# Hi Maddie!
This is a jupyter notebook. It runs python code but also allows markdown (like this) and seperates code into different cells that each have their own output window. I find it a much more organzied way to write code for data anaylsis because you can seperate different bits into their own cells and have their corresponding output right beneath them.
I'll use markdown to walk you through what I'm doing.

In [3]:
import pandas as pd
import os
import re
df = pd.read_csv('master.csv')


Just looking at a bit of the data to get an idea for it. Usually I would delete this after I ran it, but I'll leave this kind of thing in so you can get a better idea of the process I took.

In [4]:
print(df.columns)
print(df.head(10))


Index(['subreddit', 'author', 'date', 'post'], dtype='object')
  subreddit               author      date  \
0      adhd              Seftari  1/1/2018   
1      adhd           LisaLoves2  1/1/2018   
2      adhd       featherflutter  1/1/2018   
3      adhd          vanillasky0  1/1/2018   
4      adhd           JITTERdUdE  1/1/2018   
5      adhd  currentlystruggling  1/1/2018   
6      adhd          itsbrainfog  1/1/2018   
7      adhd             alnyland  1/1/2018   
8      adhd           proweruser  1/1/2018   
9      adhd            MObrien37  1/1/2018   

                                                post  
0  Lethargic/Depressed when off meds First I'll g...  
1  Concerta not working on the first day?! Update...  
2  Comorbid anxiety and ADHD-PI Medication Questi...  
3  Fist Day on Concerta 18mg UPDATE! Update!: Tha...  
4  I absolutely hate being so motivated but equal...  
5  ADHD-related sensory issues and layering cloth...  
6  How’d you guys decide on a career?? I kind

In [5]:
# gives a list of each of the values in the column
print(df['subreddit'].unique())
# gives a list of each of the values in the column, in this case subreddit, and how many times they appear
subs = df['subreddit'].value_counts() 
print(subs)

['adhd' 'autism' 'conspiracy' 'divorce' 'fitness' 'guns' 'jokes'
 'meditation' 'parenting' 'personalfinance' 'relationships' 'teaching']
jokes              48256
personalfinance    31799
fitness            17651
relationships      16744
adhd               15333
guns               14366
conspiracy          7457
divorce             3428
parenting           3080
meditation          2205
autism              2084
teaching             549
Name: subreddit, dtype: int64


In [6]:
autism = df.loc[df['subreddit'] == 'autism']
print(autism.shape)

(2084, 4)


2084 total recrods for autism didn't look right to me, so I went and downloaded the original csvs for r/autism. Here I'm printing the shape which tells me how many rows and columns there are. As I suspected there's more than 2084 for r/autism.

In [7]:
for file in os.listdir('autism_data'):
    print(file)
    aut_df = pd.read_csv('./autism_data/%s' % file)
    print(aut_df.shape)

autism_2019_features_tfidf_256.csv
(1401, 350)
autism_post_features_tfidf_256.csv
(2209, 350)
autism_2018_features_tfidf_256.csv
(683, 350)
autism_pre_features_tfidf_256.csv
(4576, 350)


Here I combine the data from all the r/autism csvs into one dataframe and check the shape of it. Then I got rid of the columns we won't need. I kept all the columns that start with 'tfidf' because its a metric that I was planning on computing later, but they already did it for us!

In [9]:
df = pd.DataFrame()
for file in os.listdir('autism_data'):
    aut_df = pd.read_csv('./autism_data/%s' % file)
    df = df.append(aut_df)
print(df.shape)
print(df.columns)
# Gets all the columns from the dataframe
tfidf = df.columns.tolist()
# Returns a list of the column names that contain the substring 'tfidf'
tfidf = list(filter(lambda x: 'tfidf' in x, tfidf))
# Add the tfidf column names to the other ones we want to keep
columns = ['subreddit', 'author', 'date', 'post'] + tfidf
# Set the dataframe equal to the dataframe but only the columns I want to keep
df = df[columns]
print(df.shape)


Index(['subreddit', 'author', 'date', 'post', 'automated_readability_index',
       'coleman_liau_index', 'flesch_kincaid_grade_level',
       'flesch_reading_ease', 'gulpease_index', 'gunning_fog_index',
       ...
       'tfidf_wish', 'tfidf_without', 'tfidf_wonder', 'tfidf_work',
       'tfidf_worri', 'tfidf_wors', 'tfidf_would', 'tfidf_wrong',
       'tfidf_x200b', 'tfidf_year'],
      dtype='object', length=350)


I went into the data to find an example of weird characters and it looks like it was an encoding issue on excel that was showing apostraphes as weird characters. Python gets it right though so nothing to worry about for us.

In [7]:
print(df.iloc[8]['post'])

Any tips for getting a 3 year old to sleep in his bed? My girlfriend’s 3 year old is highly non verbal autistic (not sure if that’s the proper description) but he never sleeps in his bed. He has a stair gate on his door and he lays by it all night until he falls asleep. It’s not a problem but I just can’t imagine he’s comfy. Is this just a phase he’ll grow out off or should we be doing something to encourage him to sleep in his bed?


An example of some markdown remnants coming through in a post, going to remove them next.

In [8]:
print(df.iloc[944]['post'])

I don't have autism. My experience. Hi. I don't have autism.

&amp;#x200B;

I have two children, my three year old is severely autistic(destructive, licking biting smearing, ect), my four year old is mild(super clever, impossible to deal with). 

&amp;#x200B;

Wife is also autistic, just only found out. Honestly before the kids I had no idea what what autism even was. Often heard the term window licker without knowing the meaning as a child, now completely know the meaning and ramifications behind the term(youngest prefers radiators over windows). 

&amp;#x200B;

I think the worst part of it all is that you can't go to your usual friends to talk about it, they don;t understand, can't really expect them to because a year ago I wouldn't have a clue either, unless you've lived it, you don't know.

&amp;#x200B;

One thing I really struggle with is knowing if my eldest is being naughty or it's the autism and she needs support. 

&amp;#x200B;

I guess im not even asking any questions just ve

In [9]:
# removing punctuation, unneccessary whitespace, markdown remnants and making everything lowercase
df['post'] = df['post'].map(lambda x: x.lower().strip())
df['post'] = df['post'].map(lambda x: x.replace('\n', ''))
df['post'] = df['post'].map(lambda x: re.sub('[,\.!?;:#]', '', x))
df['post'] = df['post'].map(lambda x: re.sub('[()]', ' ', x))
df['post'] = df['post'].map(lambda x: x.replace('&ampx200b', ''))


print(df.iloc[944]['post'])

i don't have autism my experience hi i don't have autismi have two children my three year old is severely autistic destructive licking biting smearing ect  my four year old is mild super clever impossible to deal with  wife is also autistic just only found out honestly before the kids i had no idea what what autism even was often heard the term window licker without knowing the meaning as a child now completely know the meaning and ramifications behind the term youngest prefers radiators over windows  i think the worst part of it all is that you can't go to your usual friends to talk about it they dont understand can't really expect them to because a year ago i wouldn't have a clue either unless you've lived it you don't knowone thing i really struggle with is knowing if my eldest is being naughty or it's the autism and she needs support i guess im not even asking any questions just venting


Here I'm checking for missing / placeholder values, but didn't find any

In [10]:
df['date'].unique()
# I thought there might be some placeholders here that wouldn't show up as NaN but nope, all good

array(['2019/01/01', '2019/01/02', '2019/01/03', '2019/01/04',
       '2019/01/05', '2019/01/06', '2019/01/07', '2019/01/08',
       '2019/01/09', '2019/01/10', '2019/01/11', '2019/01/12',
       '2019/01/13', '2019/01/14', '2019/01/15', '2019/01/16',
       '2019/01/17', '2019/01/18', '2019/01/19', '2019/01/20',
       '2019/01/21', '2019/01/22', '2019/01/23', '2019/01/24',
       '2019/01/25', '2019/01/26', '2019/01/27', '2019/01/28',
       '2019/01/29', '2019/01/30', '2019/01/31', '2019/02/01',
       '2019/02/02', '2019/02/03', '2019/02/04', '2019/02/05',
       '2019/02/06', '2019/02/07', '2019/02/08', '2019/02/09',
       '2019/02/10', '2019/02/11', '2019/02/12', '2019/02/13',
       '2019/02/14', '2019/02/15', '2019/02/16', '2019/02/17',
       '2019/02/18', '2019/02/19', '2019/02/20', '2019/02/21',
       '2019/02/22', '2019/02/23', '2019/02/24', '2019/02/25',
       '2019/02/26', '2019/02/27', '2019/02/28', '2019/03/01',
       '2019/03/02', '2019/03/03', '2019/03/04', '2019/

In [11]:
print(df.shape)
print(df.dropna().shape)
# if these are the same then there were no NaN values

(8869, 260)
(8869, 260)


In [14]:
print(len(df['author'].unique()))

7177


Overall, this data was pretty clean to begin with which makes sense because it comes from a published paper so they likely handled a lot of preprocessing before releasing it. I'm gonna write the final result to a csv then we can start the analysis from that. Also note that I only did this with the data from r/autism, I don't think the control subreddits will be relevant for us. We can included r/adhd if you'd like but I'm a little worried that we'll run into computing time issues if we include it so I'm skipping it for now.

In [416]:
df.to_csv('autism_master.csv')