Using a Movie’s Plot Description to Classify Genre

The goal is to use natural language processing to create a model that predicts a movie’s genre using it’s plot summary. We have a dataset of 10,000 movies, each of which is classified as one of nine genres. To prepare the data for modeling, we’ll use sklearn’s CountVectorizer TF-IDF transformer. The countvectorizer uses a custom lemmatizer built with NLTK’s WordNetLemmatizer. There is a significant class imbalance; certain genres are more prevelant than others. We will address this using SMOTE, or Synthetic Minority Over Sampling Technique, and a stratified Kfold split during the model cross-validation process. Finally, we’ll tune the model’s parameters using sklearn’s RandomSearch and evaluate our results on a previously unseen test set.

You can check out all of the code and download the data here.

Import Libraries and Load the Data

import pandas as pd
#import text_processing as text
from sklearn import metrics
from sklearn.metrics import precision_score, accuracy_score, recall_score, f1_score, roc_auc_score
from sklearn.model_selection import StratifiedKFold, cross_val_predict, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV
from imblearn.pipeline import make_pipeline
from imblearn.over_sampling import SMOTE

df = pd.read_csv("movie_train.csv",index_col=0,)

df.reset_index(drop=False,inplace=True)
df.rename(mapper={'index':'ID'},axis=1,inplace=True)

X = df['Plot']
y = df['Genre']

print(df.shape)
df.head()

(10682, 7)

	ID	Release Year	Title	Plot	Director	Cast	Genre
0	10281	1984	Silent Madness	A computer error leads to the accidental relea...	Simon Nuchtern	Belinda Montgomery, Viveca Lindfors	horror
1	7341	1960	Desire in the Dust	Lonnie Wilson (Ken Scott), the son of a sharec...	Robert L. Lippert	Raymond Burr, Martha Hyer, Joan Bennett	drama
2	10587	1986	On the Edge	A gaunt, bushy-bearded, 44-year-old Wes Holman...	Rob Nilsson	Bruce Dern, Pam Grier	drama
3	25495	1988	Ram-Avtar	Ram and Avtar are both childhood best friends....	Sunil Hingorani	Sunny Deol, Anil Kapoor, Sridevi	drama
4	16607	2013	Machete Kills	Machete Cortez (Danny Trejo) and Sartana River...	Robert Rodriguez	Danny Trejo, Michelle Rodriguez, Sofía Vergara...	action

Tokenizing, Lemmatizing and a TF-IDF Transformer

To prepare text documents for Machine Learning pipelines, we need to convert the documents into a matrix of word frequencies. The result is a sparse matrix with columns representing every word in the dataset and rows containing the frequency of each word in a particular document. Sklearn’s CountVectorizer can perform this operation in the context of a ML pipeline.

The CountVectorizer object as a ‘tokenizer’ argument that can take in a custom tokenizer lemmatizer. Lemmatizing refers to the process of breaking words down to their root. For example, ‘running’ and ‘runs’ are counted as the same token for the classification algorithm. To improve the performance of our model, we’ll use a lemmatizer built with NLTK, a library that was specifically built for NLP.

After we’ve lemmatized and tokenized the documents to create our sparse matrix of word frequencies, we still need to control for documents that are longer than others. If one document is significantly shorter than another, the comparison of word frequencies won’t yield compelling results. To fix this, we’ll use a TF-IDF transformer that weights the word frequencies according to document length.

SMOTE and SGDClassifier

Once the data has been prepared for modeling, we’ll want to account for the class imbalance using synthetic oversampling, or SMOTE.

Finally, we’re ready to fit the data and make our predictions. I chose sklearn’s Stochastic Gradient Descent Classifier because it takes a shorter time to converge than other models, and it offers several different loss functions that we can compare during the tuning process.

The following code encompasses all of the steps I’ve just described.

from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer 

# Custom Lemmatizer
class LemmaTokenizer:
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    
    def __call__(self, doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]
    
### Make the SMOTE Pipeline
smote_pipeline = make_pipeline(CountVectorizer(tokenizer=LemmaTokenizer()),
                         TfidfTransformer(),
                         SMOTE(n_jobs=-1,random_state=42),
                         SGDClassifier(n_jobs=-1,verbose=0,random_state=42)
                        )

Tuning the Model

Now that we have our pipeline, let’s see how effective its predictions are. The following function returns the cross-validated results of our model by taking in the number of splits for a stratified Kfold cross-validation, plot descriptions as our input vector (X), genres as our targets (Y), and our pipeline. I used a stratified Kfold split to ensure that each fold has the same proportion of classes before the over-sampling step.

def pipeline_cv(splits, X, Y, pipeline):
    
    kfold = StratifiedKFold(n_splits=splits, shuffle=True, random_state=777)
    
    reports = []
    for train, test in kfold.split(X, Y):
        fit = pipeline.fit(X.iloc[train], Y.iloc[train])
        prediction = fit.predict(X.iloc[test])
        
        reports.append(
            pd.DataFrame(
                metrics.classification_report(
                    Y.iloc[test],prediction,output_dict=True
                )
            )
        )

    df_concat = pd.concat([x for x in reports])

    by_row_index = df_concat.groupby(df_concat.index)
    df_means = by_row_index.mean()

    return df_means

pipeline_cv(5,X,y,smote_pipeline)

	action	adventure	comedy	crime	drama	horror	romance	thriller	western	accuracy	macro avg	weighted avg
f1-score	0.508273	0.492725	0.640416	0.342903	0.541284	0.695955	0.427871	0.280562	0.798318	0.55907	0.525367	0.557589
precision	0.449549	0.422445	0.624285	0.290811	0.711246	0.607127	0.355346	0.290636	0.704056	0.55907	0.495056	0.589748
recall	0.585542	0.592266	0.658962	0.420886	0.437666	0.817857	0.539296	0.271533	0.921905	0.55907	0.582879	0.559070
support	166.000000	66.200000	544.800000	65.600000	754.000000	168.000000	129.800000	137.000000	105.000000	0.55907	2136.400000	2136.400000

The result of our cross-validated model funtion gives us an overview of classification metrics for each genre. To further tune our model parameters, we’ll focus on the weighted F1 score.

### Create scorer
scorer = metrics.make_scorer(metrics.f1_score, average = 'weighted')

### Tuning with Random Search

params = {
    'countvectorizer__ngram_range':[(1,2),(1,3)],
    'countvectorizer__max_df':np.linspace(.5,.7,5),
    'countvectorizer__min_df':[1,2,3,4],
    'tfidftransformer__use_idf':[True],
    'tfidftransformer__smooth_idf':[True],
    'sgdclassifier__alpha':np.linspace(.00005,.0002),
    'sgdclassifier__loss':['squared_hinge']
}

random_search = RandomizedSearchCV(smote_pipeline,params,cv=5,n_jobs=-1,scoring=scorer,verbose=0)

pipeline_cv(5,X,y,random_search)

	action	adventure	comedy	crime	drama	horror	romance	thriller	western	accuracy	macro avg	weighted avg
f1-score	0.541071	0.521904	0.669948	0.344754	0.645931	0.737875	0.447168	0.289791	0.835373	0.618421	0.559313	0.612447
precision	0.520969	0.540179	0.659452	0.431400	0.648325	0.692932	0.432808	0.376150	0.793755	0.618421	0.566219	0.611383
recall	0.563855	0.507689	0.681355	0.289790	0.643767	0.790476	0.465259	0.236496	0.883810	0.618421	0.562500	0.618421
support	166.000000	66.200000	544.800000	65.600000	754.000000	168.000000	129.800000	137.000000	105.000000	0.618421	2136.400000	2136.400000

best_model = random_search.best_estimator_
best_model

Pipeline(memory=None,
         steps=[('countvectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True,
                                 max_df=0.6499999999999999, max_features=None,
                                 min_df=3, ngram_range=(1, 3),
                                 preprocessor=None, stop_words=None,
                                 strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tok...
                 SGDClassifier(alpha=0.00011428571428571428, average=False,
                               class_weight=None, early_stopping=False,
                               epsilon=0.1, eta0=0.0, fit_intercept=True,
                               l1_ratio=0.15, learning_rate='optimal',
                               loss='squared_hinge', max_iter=1000,
                               n_iter_no_change=5, n_jobs=-1, penalty='l2',
                               power_t=0.5, random_state=42, shuffle=True,
                               tol=0.001, validation_fraction=0.1, verbose=0,
                               warm_start=False))],
         verbose=False)

Evaluating our Model

Now that we have a tuned model, we’re ready to dive deeper into some performance metrics. We have a set of observations that we previously set aside to ensure that we aren’t over-fitting anything. We previously fit the model using Kfolds, but now we’ll fit our model on all of the training data, apply it to our test set for predictions, and evaluate the results.

train_set = pd.read_csv('datasets/movie_train.csv',index_col=0)

X_train = train_set['Plot']
y_train = train_set['Genre']

X_test = pd.read_csv('datasets/movie_test.csv',index_col=0)['Plot']
y_test = pd.read_csv('datasets/test_actuals.csv',index_col=0,header=None,names=['genre'])['genre']

[data.sort_index(inplace=True) for data in [X_test,X_train,y_test,y_train]]

print(X_test.shape,y_test.shape)

(3561,) (3561,)

fit = best_model.fit(X_train,y_train)
y_pred = fit.predict(X_test)

report = pd.DataFrame(
    metrics.classification_report(y_test,y_pred,output_dict=True)
)
report

	action	adventure	comedy	crime	drama	horror	romance	thriller	western	accuracy	macro avg	weighted avg
precision	0.477352	0.515789	0.697826	0.423913	0.656832	0.750779	0.457143	0.433333	0.803030	0.638585	0.579555	0.632100
recall	0.548000	0.480392	0.688103	0.325000	0.686688	0.803333	0.468293	0.274262	0.873626	0.638585	0.571966	0.638585
f1-score	0.510242	0.497462	0.692930	0.367925	0.671429	0.776167	0.462651	0.335917	0.836842	0.638585	0.572396	0.633465
support	250.000000	102.000000	933.000000	120.000000	1232.000000	300.000000	205.000000	237.000000	182.000000	0.638585	3561.000000	3561.000000

Confusion Matrix

The confusion matrix can help us see where our model is going wrong by plotting predicted class values vs actual class values. It looks like ‘Drama’ and ‘Comedy’ are getting confused for each other, ‘Thriller’, ‘Adventure’, and ‘Crime’ movies are hard to pin down, often being confused for ‘Drama’. Ultimately, there are at least as many correct predictions than false predictions for each row of the matrix, so I’m happy with these results. Especially when considering that genre classification is a tricky task for many humans as well.

from sklearn.metrics import confusion_matrix

def plot_confusion_matrix(true,predicted,classes):
    import itertools
    cm=confusion_matrix(true,predicted,labels=classes)
    
    fig = plt.figure(figsize=(15,9))
    ax = fig.add_subplot(111)
    cax = ax.matshow(cm,cmap=plt.cm.Blues)
    plt.title('Confusion matrix',fontdict={'size':20})
    fig.colorbar(cax)
    
    ax.set_xticklabels([''] + classes,fontdict={'size':14})
    ax.set_yticklabels([''] + classes,fontdict={'size':14})
    
    plt.xlabel('Predicted',fontdict={'size':14})
    plt.ylabel('True',fontdict={'size':14})
    
    plt.grid(b=None)
    fmt = 'd'

    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
             horizontalalignment="center",
             fontdict={'size':14,'weight':'heavy'})

classes = list(report.columns)[:-3]
plot_confusion_matrix(y_test,y_pred,classes)

png

Further Considerations

How can we make this model better? We can re-examine our tokenizing and lemmatizing. We only really did the bear minimum on that step. We could continue tuning the parameters of the model with grid-search.

Most importantly, we could choose a different metric to tune or model after taking into consideration what this model could be useful for. Maybe we want to use this model to recommend movies to users. If we know a user likes comedy, we might want to optimize for recall so that we can be sure that every comedy is represented in the output even if a few non-comedies make it through. If a user only likes drama and hates comdedy, we could optimize for precision to be sure that no comedies make it through. I chose weighted-F1 because it’s a balanced metric between precision and recall, and the results show that. But if we really want to apply this to the real world, we’ll need to think further about what really matters to end-users and other stakeholders.