import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
# import spacy
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
df = pd.read_csv('data/df_clean.csv')
numerics = pd.read_csv('data/numerics_clean.csv')
ymed = df.num_comments.median()
y = pd.Series([1 if val > ymed else 0 for val in df.num_comments])
df.drop('num_comments',axis=1,inplace=True) # get rid of this immediately
# # Lemmatize and filter out ' ' tokens
# nlp = spacy.load('en_core_web_sm')
# df['title'] = [' '.join([word.lemma_ for word in nlp(title) if word.lemma_ != ' '])\
# for title in df.title] # This should be optimized
Lemmatising to my surprise seems to add no value. I thought this was going to be the most important part, but as it turns out it just takes forever to process and adds nothing but run time. I suspect this is because the post titles are so short that there is no real meaning to be extracted. This could be useful when analyzing comments
tf = TfidfVectorizer(stop_words='english',max_features=500)
tfvec = tf.fit(df.title)
X = pd.DataFrame(tfvec.transform(df.title).todense(),columns=tfvec.get_feature_names_out())
df = df.join(numerics)
del numerics # done with numerics
def make_dummies(df):
for col_name in df.columns:
if (df[col_name].dtype == 'O') or (df[col_name].dtype == 'bool'):
dums = pd.get_dummies(df[col_name],prefix=col_name,dtype=int,drop_first=True)
df = df.drop(labels=[col_name],axis=1)
df = df.join(dums)
return df
dums = make_dummies(df[df.columns[1:]]) # [1:] excludes first column, 'title'
del df # done with df
X = X.join(dums)
del dums # done with dums
# Do a split
X_train, X_test, y_train, y_test = train_test_split(X,y)
del X
del y
print('Create Random Forest...')
rf = RandomForestClassifier(n_jobs=-1)
print('Create Logistic Regression...')
# knn = KNeighborsClassifier(n_jobs=-1)
print('fit RF...')
model_rf = rf.fit(X_train,y_train)
print('fit KNN...')
# model_knn = knn.fit(X_train,y_train)
# Model Scores
def score(model,X,y):
cv=StratifiedKFold(n_splits=3,shuffle=True)
s = cross_val_score(model,X,y,cv=cv) # n_jobs=-1 actually makes it slower here
print("Score:\t{:0.2} ± {:0.2}".format(s.mean(), 2 * s.std()))
print('Scoring...')
score(model_rf,X_train,y_train)
score(model_rf,X_test,y_test)
# score(model_knn,X_train,y_train)
# score(model_knn,X_test,y_test)
Create Random Forest... Create Logistic Regression... fit RF... fit KNN... Scoring... Score: 0.62 ± 0.0056 Score: 0.6 ± 0.0052
Between Random Forest, K Neighbors, and LogisticRegression, they all score about the same. But Random Forest takes two minutes to run and the rest take a lifetime. I think the data and the problem isn't complex enough to warrant more than a Random Forest.
Everything seems to perform about 10% above the baseline, which would be 50%. In other words, the target is split cleanly in half, so if the predictions were to be 1s across the board the accuracy would be at 50%. So a cross-val score above 50 means the model is working, but doesn't necessarily give insight on exactly what it's predicting.
pd.DataFrame({'Variable':X_train.columns,
'Importance':rf.feature_importances_}).sort_values('Importance', ascending=False).head(25)
Variable | Importance | |
---|---|---|
500 | post_age | 0.080172 |
502 | norm_score | 0.077290 |
501 | upvote_ratio | 0.047547 |
5545 | is_self_True | 0.020376 |
232 | like | 0.003663 |
212 | just | 0.003490 |
4518 | subreddit_memes | 0.003337 |
428 | time | 0.003296 |
287 | new | 0.003230 |
293 | oc | 0.002981 |
198 | im | 0.002539 |
85 | day | 0.002455 |
157 | got | 0.002369 |
99 | dont | 0.002164 |
431 | today | 0.002146 |
247 | love | 0.002137 |
253 | man | 0.002102 |
156 | good | 0.002101 |
5546 | spoiler_True | 0.002013 |
5544 | is_original_content_True | 0.002012 |
141 | game | 0.001961 |
5027 | subreddit_shitposting | 0.001955 |
11 | art | 0.001921 |
5543 | over_18_True | 0.001905 |
311 | people | 0.001879 |
Here are the top 25 predictors scored by importance, or the amount of influence they have. At this time of writing, at the top is age, score, upvote ratio. Of course if you want to have a popular post, make it popular, but we can't just say that. Beyond that it appears that self posts see more activity, the subreddits memes and shitposting have been very active and popular the past few days, over 18 content is popular. And some key words that may get you to the top are 'like','just','time','new','oc','good','got','day','today','im','dont', and 'love'. I don't go on Reddit often but I'm not exactly a stranger either, this looks about right. Memes is very popular, and a very easy karma farm. People love their OC (and their porn). A lot of people on reddit talk about what's going on in their day ('today', 'day')
This is a fairly simple model, with simple data. To go beyond this I think the comments would have to be analyzed. Tokenization I thought would be the most influential piece, and I still think that thinking is correct. But in this case it doesn't apply because there is no real meaning to be had from reddit post titles, at least to a computer. There's a lot more seen by a human than just the text in the title, there's often an image attached, most posts reference a recent/current event, they could be an inside joke of sorts. For some posts there could be emojis in the title, and depending on their combination they can take on a meaning completely different from their individual meanings.
The next step from here I believe is to analyze the comments section of these posts because in this moment I think that's the easiest way to truly describe the meaning of a post to a computer. With what was gathered here I'm only to get 10% above baseline and I think that's all there is to be had here, I mean we can tweak for a few percent probably but I don't think there's much left on the table.