import pandas as pd
numerics = pd.read_csv('data/numerics.csv')
print(f'\
Score min/max: {numerics.score.min()} / {numerics.score.max()}\n\
Post Age min/max: {numerics.post_age.min()} / {numerics.post_age.max()}\n\
Upvote Ratio min/max: {numerics.upvote_ratio.min()} / {numerics.upvote_ratio.max()}\n\
')
Score min/max: 45 / 193501 Post Age min/max: 14846.5434589386 / 101274.16840600967 Upvote Ratio min/max: 0.51 / 1.0
Upvote Ratio is already between 0 and 1 so there's 1/3 of the work out the way for free
def normalize_numerics(col):
col_max = numerics[col].max()
return [(val/col_max) for val in numerics[col]]
I wasn't sure how often I would need to do this so I wrote a function
numerics['norm_score'] = normalize_numerics('score')
numerics = numerics.drop('score',axis=1) # Prevent name collision with column for word 'score'
numerics['post_age'] = normalize_numerics('post_age')
So if we check again...
print(f'\
Score min/max: {numerics.norm_score.min()} / {numerics.norm_score.max()}\n\
Post Age min/max: {numerics.post_age.min()} / {numerics.post_age.max()}\n\
Upvote Ratio min/max: {numerics.upvote_ratio.min()} / {numerics.upvote_ratio.max()}\n\
')
Score min/max: 0.0002325569376902445 / 1.0 Post Age min/max: 0.14659753511298737 / 1.0 Upvote Ratio min/max: 0.51 / 1.0
It's less human readable but better for the model, who is not human
df = pd.read_csv('data/workingdf.csv')
Before beginning tokenizing, the random garbage that will end up producing gibberish will need to be removed. Things like emojis, punctuation and special characters, accents, etc. I've decided to replace underscores and hyphens with whitespace, then remove anything that is not a letter or whitespace, and finally strip all extra whitespace.
df['title'] = [title.lower().replace('_',' ').replace('-',' ') for title in df.title]
df['title'] = df.title.replace("[^a-zA-Z\s]",'',regex=True)
df['title'] = [title.strip() for title in df.title]
df.drop(df[df.title==''].index,inplace=True) # drop now empty titles
df.to_csv('data/df_clean.csv',index=False)
numerics.to_csv('data/numerics_clean.csv',index=False)
Next! Model