What is purpose of the [CLS] token and why is its encoding output important?

Question

I am reading this article on how to use BERT by Jay Alammar and I understand things up until:

For sentence classification, we’re only only interested in BERT’s output for the [CLS] token, so we select that slice of the cube and discard everything else.

I have read this topic, but still have some questions:

Isn't the [CLS] token at the very beginning of each sentence? Why is that "we are only interested in BERT's output for the [CLS] token"? Can anyone help me get my head around this? Thanks!

score 34 · Accepted Answer · edited May 13 '21 at 04:30

34

CLS stands for classification and its there to represent sentence-level classification.

In short in order to make pooling scheme of BERT work this tag was introduced. I suggest reading up on this blog where this is also covered in detail.

edited May 13 '21 at 04:30

breandan

103
3

answered Jan 09 '20 at 17:26

Noah Weber

5,669
1
12
26

The article you shared is quite helpful. Thanks! – user3768495 Jan 09 '20 at 18:14
4

I think an answer should be standalone, and would appreciate the gist of the link, which will inevitably be broken in the future. – Gulzar Aug 31 '22 at 11:22
aaaaaand the images of the link are now broken – Tanguy Jun 30 '23 at 08:12

score 25 · Answer 2 · answered Apr 26 '20 at 13:41

[CLS] stands for classification. It is added at the beginning because the training tasks here is sentence classification. And because they need an input that can represent the meaning of the entire sentence, they introduce a new tag.

They can’t take any other word from the input sequence, because the output of that is the word representation. So they add a tag that has no other purpose than being a sentence-level representation for classification.

hoang tran · Answer 3 · 2020-12-30T17:46:47.407

In order to better understand the role of [CLS] let's recall that BERT model has been trained on 2 main tasks:

Masked language modeling: some random words are masked with [MASK] token, the model learns to predict those words during training. For that task we need the [MASK] token.
Next sentence prediction: given 2 sentences, the model learns to predict if the 2nd sentence is the real sentence, which follows the 1st sentence. For this task, we need another token, output of which will tell us how likely the current sentence is the next sentence of the 1st sentence. And here comes the [CLS]. You can think about the output of [CLS] as a probability.

Now you may ask the question: can we instead of using [CLS]'s output just outputting a number (as probability)? Yes, we can do that if the task of predicting next sentence is a separate task. However, BERT has been trained on both tasks simultaneously. Organizing inputs and outputs in such a format (with both [MASK] and [CLS]) will help BERT to learn both tasks at the same time and boost its performance.

When it comes to classification task (e.g. sentiment classification), as mentioned in other answers, the output of [CLS] can be helpful because it contains BERT's understanding at the sentence-level.

As my understanding CLS token is representation of whole text (sentence1 and sentence2), which means that model got trained such a way that CLS token is having probablity of "if second sentence is next sentence of 1st sentence", so how are people can generate sentence embeddings from CLS tokens? How does CLS is having the meaning of the sentence in a single sentence classification task without even being trained for it?? — canP, Jul 14 '22 at 04:02

score 15 · Answer 4 · answered Jun 28 '21 at 06:46

Here're my understandings:

(1)[CLS] appears at the very beginning of each sentence, it has a fixed embedding and a fix positional embedding, thus this token contains no information itself. (2)However, the output of [CLS] is inferred by all other words in this sentence, so [CLS] contains all information in other words.

This makes [CLS] a good representation for sentence-level classification.

of course, you may use an average vector, it makes sense, too.

What is purpose of the [CLS] token and why is its encoding output important?

4 Answers4

Linked