71

I am reading this article on how to use BERT by Jay Alammar and I understand things up until:

For sentence classification, we’re only only interested in BERT’s output for the [CLS] token, so we select that slice of the cube and discard everything else.

I have read this topic, but still have some questions:

Isn't the [CLS] token at the very beginning of each sentence? Why is that "we are only interested in BERT's output for the [CLS] token"? Can anyone help me get my head around this? Thanks!

user3768495
  • 927
  • 1
  • 7
  • 8

4 Answers4

34

CLS stands for classification and its there to represent sentence-level classification.

In short in order to make pooling scheme of BERT work this tag was introduced. I suggest reading up on this blog where this is also covered in detail.

breandan
  • 103
  • 3
Noah Weber
  • 5,669
  • 1
  • 12
  • 26
25

[CLS] stands for classification. It is added at the beginning because the training tasks here is sentence classification. And because they need an input that can represent the meaning of the entire sentence, they introduce a new tag.

They can’t take any other word from the input sequence, because the output of that is the word representation. So they add a tag that has no other purpose than being a sentence-level representation for classification.

Malgo
  • 351
  • 4
  • 5
19

In order to better understand the role of [CLS] let's recall that BERT model has been trained on 2 main tasks:

  1. Masked language modeling: some random words are masked with [MASK] token, the model learns to predict those words during training. For that task we need the [MASK] token.
  2. Next sentence prediction: given 2 sentences, the model learns to predict if the 2nd sentence is the real sentence, which follows the 1st sentence. For this task, we need another token, output of which will tell us how likely the current sentence is the next sentence of the 1st sentence. And here comes the [CLS]. You can think about the output of [CLS] as a probability.

Now you may ask the question: can we instead of using [CLS]'s output just outputting a number (as probability)? Yes, we can do that if the task of predicting next sentence is a separate task. However, BERT has been trained on both tasks simultaneously. Organizing inputs and outputs in such a format (with both [MASK] and [CLS]) will help BERT to learn both tasks at the same time and boost its performance.

When it comes to classification task (e.g. sentiment classification), as mentioned in other answers, the output of [CLS] can be helpful because it contains BERT's understanding at the sentence-level.

hoang tran
  • 291
  • 2
  • 4
  • As my understanding CLS token is representation of whole text (sentence1 and sentence2), which means that model got trained such a way that CLS token is having probablity of "if second sentence is next sentence of 1st sentence", so how are people can generate sentence embeddings from CLS tokens? How does CLS is having the meaning of the sentence in a single sentence classification task without even being trained for it?? – canP Jul 14 '22 at 04:02
15

Here're my understandings:

(1)[CLS] appears at the very beginning of each sentence, it has a fixed embedding and a fix positional embedding, thus this token contains no information itself. (2)However, the output of [CLS] is inferred by all other words in this sentence, so [CLS] contains all information in other words.

This makes [CLS] a good representation for sentence-level classification.

of course, you may use an average vector, it makes sense, too.

BigMoyan
  • 151
  • 1
  • 2