1

I am trying to build a model that predicts if an email is spam/not-spam. After building a logistic regression model, I have got the following results:

          precision    recall  f1-score   support
     0.0       0.92      0.99      0.95       585
     1.0       0.76      0.35      0.48        74

accuracy                           0.92       659

macro avg 0.84 0.67 0.72 659 weighted avg 0.91 0.92 0.90 659

Confusion Matrix: [[577 8] [ 48 26]]

Accuracy: 0.9150227617602428

The F1-score is the metric I am looking at. I am having difficulties in explaining the results: I think are very bad results! May I ask you how I could improve it? I am currently considering a model that looks at corpus of the emails (subject + corpus).

After Erwan's answer:

I oversampled the dataset and these are my results:

Logistic regression
              precision    recall  f1-score   support
     0.0       0.94      0.77      0.85       573
     1.0       0.81      0.96      0.88       598

accuracy                           0.86      1171

macro avg 0.88 0.86 0.86 1171 weighted avg 0.88 0.86 0.86 1171

Random Forest precision recall f1-score support

     0.0       0.97      0.54      0.69       573
     1.0       0.69      0.98      0.81       598

accuracy                           0.77      1171

macro avg 0.83 0.76 0.75 1171 weighted avg 0.83 0.77 0.75 1171

Ethan
  • 1,633
  • 9
  • 24
  • 39
LdM
  • 165
  • 9

1 Answers1

1

In your results you can observe the usual problem with imbalanced data: the classifier favors the majority class 0 (I assume this is class "ham"). In other words it tends to assign "ham" to instances which are actually "spam" (false negative errors). You can think of it like this: with the "easy" instances, the classifier gives the correct class, but for the instances which are difficult (the classifier "doesn't know") it chooses the majority class because it's the most likely.

There are many things you could do:

  • Undersampling the majority class or oversampling the minority class is the easy way to deal with class imbalance.
  • Better feature engineering is more work but it's often how to get the best improvement. For example I guess that you use all the words in the emails as features right? So you probably have too many features and that probably causes overfitting, try reducing dimensionality by removing rare words.
  • Try different models, for instance Naive Bayes or Decision Trees. Btw Decision Trees are a good way to investigate what happens inside the model.
Erwan
  • 25,321
  • 3
  • 14
  • 35
  • Hello Erwan, thank you so much for your answer and tips. I have tried to apply the first and the last bullets in your answer and I think the results are slightly improved. Could you please confirm and let me know what you think, if you think the better feature engineering would improve it further or if you think it overfits? Thanks a lot – LdM Dec 07 '20 at 00:45
  • 1
    @LdM: resampling is simple but there's something to be careful about: here I assume you have split between training/test set after resampling, right? That's not correct because this way the test set doesn't follow the true proportion, so there's a good chance that the performance is over-estimated. You should split first, resample on the training set only, then evaluate on the test set. – Erwan Dec 07 '20 at 00:59
  • thanks, Erwan. So what I did was: `smote_over_sample = SMOTE(sampling_strategy='minority')

    Testing Count Vectorizer

    X, y = bag_of_words(df) X, y = smote_over_sample.fit_resample(X, y) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40) smote_result = smote_result.append(training_log(X_train, X_test, y_train, y_test, 'Count Vectorize'), ignore_index = True)` Is it wrong? I have followed this example: https://www.kaggle.com/ruzarx/oversampling-smote-and-adasyn

    – LdM Dec 07 '20 at 01:02
  • 1
    Yes, this is wrong, the test results are biased this way. it's a common error, we see it regularly on DataScienceSE, for instance in this question. See also this one. – Erwan Dec 07 '20 at 01:16
  • Thanks a lot, Erwan. I was wondering why I was getting such good results. I am worried that many papers and tutorials that I am following applied resampling in the wrong order. So what I need to do is : smote_over_sample = SMOTE(sampling_strategy='minority') # Testing Count Vectorizer X, y = bag_of_words(df) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40) X, y = smote_over_sample.fit_resample(X_train, y_train) am I right or am I missing steps? – LdM Dec 07 '20 at 01:19
  • I have tried the code I mentioned in the comment above, however if I apply this, I get still imbalanced classes, while if I apply that one I mentioned earlier (that it was wrong), it actually resamples the data – LdM Dec 07 '20 at 01:26
  • 1
    @LdM Sorry I don't really understand the issue? maybe because I'm not very familiar with scikit functions. I can only tell you the principle: (1) split training/test; (2) apply resampling on the training set, so that the model is trained with balanced data (note that you may do some evaluation on the training set or some part of it if you want, but that's not the final evaluation); (3) apply model to test set and evaluate on it. the test set is still imbalanced, as it's supposed to be. normally you should see less false negative errors but probably more false positive errors. – Erwan Dec 07 '20 at 01:50
  • Thanks a lot, Erwan! I will try to implement code in the right way as you suggested. Many thanks for your time and help! – LdM Dec 07 '20 at 02:01