46

I am currently training a neural network and I cannot decide which to use to implement my Early Stopping criteria: validation loss or a metrics like accuracy/f1score/auc/whatever calculated on the validation set.

In my research, I came upon articles defending both standpoints. Keras seems to default to the validation loss but I have also come across convincing answers for the opposite approach (e.g. here).

Anyone has directions on when to use preferably the validation loss and when to use a specific metric?

Green Falcon
  • 14,058
  • 9
  • 57
  • 98
qmeeus
  • 1,259
  • 1
  • 10
  • 13

4 Answers4

36

TLDR; Monitor the loss rather than the accuracy

I will answer my own question since I think that the answers received missed the point and someone might have the same problem one day.

First, let me quickly clarify that using early stopping is perfectly normal when training neural networks (see the relevant sections in Goodfellow et al's Deep Learning book, most DL papers, and the documentation for keras' EarlyStopping callback).

Now, regarding the quantity to monitor: prefer the loss to the accuracy. Why? The loss quantify how certain the model is about a prediction (basically having a value close to 1 in the right class and close to 0 in the other classes). The accuracy merely account for the number of correct predictions. Similarly, any metrics using hard predictions rather than probabilities have the same problem.

Obviously, whatever metrics you end up choosing, it has to be calculated on a validation set and not a training set (otherwise, you are completely missing the point of using EarlyStopping in the first place)

qmeeus
  • 1,259
  • 1
  • 10
  • 13
  • @qmeeus sorry if I am missing your point, but why is loss better than accuracy? We want to do well on the accuracy at "test time" so I'd personally track the accuracy not the loss. The loss is usually a made up quantity that upper bounds what we really want to do (convex surrogate functions). – Charlie Parker Mar 04 '21 at 17:53
  • 2
    @CharlieParker, accuracy is rarely what you want (problem with class imbalance, etc.) but even ignoring this problem, a model that predicts each example correctly with a large confidence is preferable to a model that predicts each example correctly with a 51% confidence. The accuracy being a discrete transformation of the class probabilities, it does not allow you to make this distinction. Cross-entropy does. It does not impact the error rate on out of distribution samples but what does anyway? ;) – qmeeus Mar 05 '21 at 09:16
6

In my opinion, this is subjective and problem specific. You should use whatever is the most important factor in your mind as the driving metric, as this might make your decisions on how to alter the model better focussed.

Most metrics one can compute will be correlated/similar in many ways: e.g. if you use MSE for your loss, then recording MAPE (mean average percentage error) or simple $L_1$ loss, they will give you comparable loss curves.

For example, if you will report an F1-score in your report/to your boss etc. (and assuming that is what they really care about), then using that metric could make most sense. The F1-score, for example, takes precision and recall into account i.e. it describes the relationship between two more fine-grained metrics.

Bringing those things together, computing scores other than normal loss may be nice for the overview and to see how your final metric is optimised over the course of the training iterations. That relationship could perhaps give you a deeper insight into the problem,

It is usually best to try several options, however, as optimising for the validation loss may allow training to run for longer, which eventually may also produce a superior F1-score. Precision and recall might sway around some local minima, producing an almost static F1-score - so you would stop training. If you had been optimising for pure loss, you might have recorded enough fluctuation in loss to allow you to train for longer.

n1k31t4
  • 14,858
  • 2
  • 30
  • 49
1

Usually a loss function is just a surrogate one because we cannot optimize directly the metric. If the metric is representative of the task(business value the best), the value of the metric on evaluation dataset would be better than the loss on that dataset. For instance, if data imbalance is a serious problem, try PR curve.

Lerner Zhang
  • 516
  • 3
  • 10
-1

I am currently training a neural network and I cannot decide which to use to implement my Early Stopping criteria: validation loss or a metrics like accuracy/f1score/auc/whatever calculated on the validation set.

If you are training a deep network, I highly recommend you not to use early stop. In deep learning, it is not very customary. Instead, you can employ other techniques like drop out for generalizing well. If you insist on that, choosing criterion depends on your task. If you have unbalanced data, you have to employ F1 score and evaluate it on your cross-validation data. If you have balanced data, try to use accuracy on your cross-validation data. Other techniques highly depend on your task.

I highly encourage you to find a model which fits your data very well and employ drop out after that. This is the most customary thing people use for deep models.

Green Falcon
  • 14,058
  • 9
  • 57
  • 98
  • 4
    I am using dropout as well. However, I can't find a reason why early stopping should not be used though... – qmeeus Aug 21 '18 at 19:21
  • Early stop tries to solve both learning and generalization problems. On the other hand drop out just tries to overcome the generalization problem. – Green Falcon Aug 21 '18 at 19:35
  • 2
    You don't answer my question... I don't deny the fact that dropout is useful and should be used to protect against overfitting, I couldn't agree more on that. My question is: why do you say that early stop should not be used with ANN? (cf your first sentence: If you are training a deep network, I highly recommend you not to use early stop.) – qmeeus Aug 21 '18 at 20:22
  • Did you read my last comment? It exactly answers your question. It's a famous quote from pr. Ng in his deep learning class, second course. The latter case is an easier task due to not struggling to solve multple tasks simoltaneously. – Green Falcon Aug 21 '18 at 21:43
  • interesting! In my experience, I found early stopping quite useful because you don't loose time and resources on models that do not generalize. Even using dropout does not make a poorly configured model generalize better. If doing some kind of automated grid search, it is often useful to have a mechanism to stop the training if necessary without human intervention. – qmeeus Aug 22 '18 at 07:35
  • As I referred in the answer, I highly encourage you to find a model which fits your data very well and employ drop out after that. – Green Falcon Aug 22 '18 at 08:23
  • 3
    And in order to find it and find the right set of hyperparameters, I'm employing some kind of directed grid search with early stop for the reasons I explained above. Point taken though and once I have selected the final model and I will train it, I will not use early stop. Thank you for this interesting discussion and for you advice – qmeeus Aug 22 '18 at 11:37
  • Early stop is an ugly way to solve overfitting. It's ugly because you are very dependent on the luck of your initialization and learning algorithm. Besides if you change any detail about those, the impact will be huge. If overfitting is a problem, it is wiser to just reduce the size of the neural network or use other regularization techniques. – Ricardo Cruz Aug 23 '18 at 21:26
  • From Deep Learning (Goodfellow et al. available at http://www.deeplearningbook.org) in 11.2: "(...) you should include some mild forms of regularization from the start. Early stopping should be used almost universally. Dropout is an excellent regularizer that is easy to implement (...). Batch normalization also sometimes reduces generalization error (...)." In any case, early stopping or not, how good a model is must not depend on a lucky initialisation". However, choosing the right initialisation (see Glorot) method and learning algo plays a role indeed. – qmeeus Sep 19 '18 at 13:30
  • It highly depends on your task, you should include some mild forms of regularization from the start, at least in based on my experience. I prefer not to do that for regression tasks where there are numerous real value outputs. – Green Falcon Sep 21 '18 at 18:24
  • Early stopping is an essential mechanism when training neural networks. If You like, You can train your model for some fixed number of epochs and then add the Early stopping mechanism to make sure that the model doesn't waste more time by doing pointless training. In case of imbalanced datasets, the F1 metric can also be used in the early stopping mechanism. In that case, your model should monitor the F1 score, rather than the validation loss. However, by monitoring the metric, rather than the loss could result in bad output probabilities by the model. – LazyAnalyst Aug 26 '22 at 11:27
  • dear @LazyAnalyst this suggestion is not by me. I heard about it years ago from Pr. Ng in his popular course Deep Learning. The main reason he suggested that was this point that you try to handle multiple things simultaneously with solely one tool whilst you can use more appropriate tools for each specific case. For dealing with overfitting, we have better ideas that are orthogonal to the training phase, and they do not affect that process. – Green Falcon Aug 26 '22 at 11:32