7

I am working on a multi-label text classification problem(Total target labels 90). The data distribution has a long tail and class imbalance and around 1900k records. Currently, I am working on a small sample of around 100k records with similar target distribution.I am using the OAA strategy (One against all). I have tried many algorithms on data.

Currently, each label has atleast 5000 data rows. The class imbalance is high with around 80k records for the most common label and the most rare with just one data row which I have not considered in the modelling. This dataset contains text from academic journals. It has Title and Abstract columns.

I am using HashingVectorizer(number of features 2**20, char analyzer) to generate features and TSVD to reduce the dimensionality(n_components=200).

LinearSVC(class_weight='balanced') # Got many warnings that it could not converge. I came to know that it may due to data not scaled properly. How can I scale text data??  
LogisticRegression(solver='lbfgs') # Converged very quickly
RandomForestClassifier(n_estimators=40,class_weight="balanced") # Train time ~2hr 

I noticed that LinearSVC has good recall(less false negatives) while Logistic and RF has good precision(less false positives) scores. Can anyone help me in identifying the reasons behind these scores and how can I improve them.

enter image description here

Currently, I am not using deep learning/transformer models due to limited computation resources.

joel
  • 129
  • 4
  • 1
  • can you give more info on your class imbalance? E.g. a list of how many records per class. 2) What kind of text data are you dealing with? Social media? How long are records on average?
  • – Bruno Lubascher Apr 21 '20 at 09:16
  • this kaggle compitation might help you & this blog upto an extent – Kalsi Apr 21 '20 at 09:34
  • @BrunoLubascher I have updated the post. – joel Apr 21 '20 at 13:18
  • @Kalsi Thanks. I am looking into it. – joel Apr 21 '20 at 13:19
  • @joel last question. How did you calculate your F1/P/R? I ask because you say that your most common label has 80k records out of the 100k total records. So, a simple classifier that always predict the top label should have a micro average of 0.8 F1. – Bruno Lubascher Apr 21 '20 at 14:15
  • @BrunoLubascher I am actually working on a small sample of data which is 100k. I have updated the post. – joel Apr 21 '20 at 18:15
  • @joel what do you exactly mean by „identifying the reasons behind these scores“? Can you please elaborate? – aivanov Apr 25 '20 at 16:11