Movie Genre Classification

This notebook is meant to show a simple process for multilabel classification of movie genres. It is not intended to show how a full optimized model would look.

Dependencies

Functions

Text Preprocessing

Reads in the genre column value, and determines if the genre string appears in the genre description or not. Results in Multi label task

Helper function to conver the one-hot vector into a concatenated string where movie genres are comma separated

Wrapper to run the decoder across all rows in dataset

Clean the text

Model Evaluation

Appends the predictions and actuals to original dataframe along with a column complete_correct boolean for whether or not all true labels were uncovered

Feature Selection

Load Data

Remove rows where genre was unknown

Making a list of genres to focus on, then encoding based on whether that text is found in the genre colum of the training dataset.
NOTE This is set up as a multilabel classification task since some genres are combination of multiple genres

Just looking at counts of each genre in the dataset. Mind that since these are not mutually exclusive genres, some combinations may be more imbalanced by nature

Clean Data

Split Training Test

Vectorize Documents

Count Vectorizer

Used as a raw baseline for Bag Of Words approach

Raw

Clean, limit, ngram

TF-IDF Vectorizer

Used as a more robust approach at identifying imporant words.

Raw

Clean, Limit, Ngram

Show transformed data

Feature Selection

Using Chi-square

IAMHERE

https://towardsdatascience.com/text-classification-with-nlp-tf-idf-vs-word2vec-vs-bert-41ff868d1794

Make a union of all selected features from all genres

Printing out the top 5 significant values from chi squared test, and create a complete set of imporant words across all genres.

RandomForest Baseline

TODO: Test RF on all the different feature sets.
Currently we are looking at bare baseline without lemmatization/stemming, or feature selection.

Evaluating Multilabel models

Resource

Convert the genre matrices to string combinations and append to original dataframe for evaluation of Training Set

Perform the same decoding to evaluate the Test Set

ExactMatchRatio

$$\frac{1}{n}\sum^n_{i=1}I(Y_i=Z_i)$$$$I: indicator function$$

Train Set

Out of 21,602 records, 21475 were completely correct in predictions Resulting in 0.99 accuracy

Test Set

Out of 7,201 records, 2,253 were completely correct in predictions Resulting in 0.31 accuracy
(Note this improved from 0.30 accuracy to 0.31 accuracy after adding min_df=5 to the TFIDF vectorizer)

Acurracy

Proprotion of predicted correct labels to total number (pred and actual) of labels for instance
Overall Accuracy over all instances
$$\frac{1}{n} \sum^n_{i=1} \frac{| Y_i \cap Z_i |}{| Y_i \cup Z_i |} $$

Precision

Proporiton of predicted correct labels to total number of actual labels
$$\frac{1}{n} \sum^n_{i=1} \frac{| Y_i \cap Z_i |}{|Z_i |} $$

Recall

Proportion of predicted correct labels to tot num of predicted labels $$\frac{1}{n} \sum^n_{i=1} \frac{| Y_i \cap Z_i |}{| Y_i |} $$

F1-Measure

Harmonic mean of Prec and Recall

$$\frac{1}{n} \sum^n_{i=1} \frac{2 | Y_i \cap Z_i |}{| Y_i | + | Z_i |} $$

Feature Importance

Export to HTML

Resources and Notes

https://towardsdatascience.com/text-classification-with-nlp-tf-idf-vs-word2vec-vs-bert-41ff868d1794