Understanding Classifier performance on text data

Question

I am working on a multi-label text classification problem(Total target labels 90). The data distribution has a long tail and class imbalance and around 1900k records. Currently, I am working on a small sample of around 100k records with similar target distribution.I am using the OAA strategy (One against all). I have tried many algorithms on data.

Currently, each label has atleast 5000 data rows. The class imbalance is high with around 80k records for the most common label and the most rare with just one data row which I have not considered in the modelling. This dataset contains text from academic journals. It has Title and Abstract columns.

I am using HashingVectorizer(number of features 2**20, char analyzer) to generate features and TSVD to reduce the dimensionality(n_components=200).

LinearSVC(class_weight='balanced') # Got many warnings that it could not converge. I came to know that it may due to data not scaled properly. How can I scale text data??  
LogisticRegression(solver='lbfgs') # Converged very quickly
RandomForestClassifier(n_estimators=40,class_weight="balanced") # Train time ~2hr

I noticed that LinearSVC has good recall(less false negatives) while Logistic and RF has good precision(less false positives) scores. Can anyone help me in identifying the reasons behind these scores and how can I improve them.

Currently, I am not using deep learning/transformer models due to limited computation resources.

this kaggle compitation might help you & this blog upto an extent — Kalsi, Apr 21 '20 at 09:34
@joel last question. How did you calculate your F1/P/R? I ask because you say that your most common label has 80k records out of the 100k total records. So, a simple classifier that always predict the top label should have a micro average of 0.8 F1. — Bruno Lubascher, Apr 21 '20 at 14:15
@BrunoLubascher I am actually working on a small sample of data which is 100k. I have updated the post. — joel, Apr 21 '20 at 18:15
@joel what do you exactly mean by „identifying the reasons behind these scores“? Can you please elaborate? — aivanov, Apr 25 '20 at 16:11

score 1 · Answer 1 · answered Apr 26 '20 at 13:04

1

As the data is imbalanced and skewed towards few classes, that's why RF and Logistic results are biased causing high FP values, therefore high precision and low recall.
SVC on the other hand might have tried to create hyperplanes to get most out of the other side of curve, thereby producing different results.

To improve the results, try to use non linear kernels and also try to balance the input data (by scaling etc) before feeding to classifiers.

answered Apr 26 '20 at 13:04

Sandeep Bhutani

894
1
7
24

Thanks for looking into this issue. The argument class_weight='balanced' was used wherever available. I was looking into feature scaling and came to know that TF IDF features should be passed as raw and no scaling should be done. I preferred to use HashingVectorizer to create features which will scale well but does n`t have IDF which provide feature scaling. – joel Apr 26 '20 at 14:22
carry out the scaling at input level itself to get balanced number of samples across classes. TF IDF vectorizer would do adjustments itself. also, did you give a try to MultinomialNaiveBayes ? – Sandeep Bhutani Apr 26 '20 at 15:56

score -1 · Answer 2 · answered Apr 22 '20 at 15:46

Classifier Analysis

Try to leverage this python library for more insights: https://github.com/marcotcr/lime reference: https://marcotcr.github.io/lime/tutorials/Lime%20-%20basic%20usage%2C%20two%20class%20case.html

Deep Learning Specifics

For deep learning models, you can use Google Collab if possible.

If the model is deep learning based and is built on allennlp, the Gradient visualization can be performed. Refer this link for Demo on Sentiment Analysis Problem, https://demo.allennlp.org/sentiment-analysis/MTc0MzQwOA== Tutorial is at https://allennlp-course.apps.allenai.org/.

Multi Label Specifics

Refer here https://scikit-learn.org/stable/modules/multiclass.html.

Looks the probelm you are addressing is Multi Label Classification.

This will give you class wise reports, sklearn.metrics.classification_report. Accordingly, the narrow down poorly performing classes. And then tweak the model.

There is Micro/Macro metric for Multi Label problems. Refer here Micro Average vs Macro average Performance in a Multiclass classification setting

Suggestion My suggestion is better to explore how the metric is calculated. Find the fact that what F1 is being referred here.

Analyse F1, precision and recall, over labels. Then, it can be found out which labels are performing poorly.

score -1 · Answer 3 · answered May 26 '20 at 11:37

-1

You could also inform the model of the imbalance itself (either a True/False or a "class weight") depending on which modelling method you are using.

answered May 26 '20 at 11:37

oko

11
2

Understanding Classifier performance on text data

3 Answers3