BERT pre-trains the special [CLS]
token on the NSP task - for every pair A-B
predicting whether sentence B follows sentence A in the corpus or not.
When fine-tuning BERT for sentence classification (e.g. spam or not), it is recommended to use a degenerate pair A-null
and use the [CLS]
token output for our task.
How is that making sense? in the pre-training stage, BERT never saw such pairs, how come it will eat them just fine and "know" that instead of extracting the relation between A and B it is to extract the meaning of sentence A as there is no sentence B?
Is there another practice of fine-tuning the model with A-spam
and A-notspam
for every sentence A, and seeing which pair gets the better NSP score? or is that totally equivalent to fine tuning with A-null
?