0

Can we tell BERT extracts local features? For example consider the sentence "This is my first sentence. This is my second sentence". Now How Bert extracts the features. attention is computed for each sentence or as whole?

SS Varshini
  • 239
  • 5
  • 13

1 Answers1

1

BERT's self-attention will be computed for each pair of tokens. If the input sentence has $N$ tokens, then the attention weights will be computed over the $N^2$ pairs of tokens.

The attention, nevertheless, will be computed in each one of the attention heads of each of the layers of BERT.

If you want to understand the self-attention patterns that are normally found in BERT, you can check this article, where you will find analyses like this one:

enter image description here

noe
  • 26,410
  • 1
  • 46
  • 76
  • From given image, I can conclude that bert extracts global features. Is this conclusion right? – SS Varshini Aug 02 '21 at 15:05
  • If with "global" you mean that the representation of each token depends also on the other tokens, the answer is yes. – noe Aug 02 '21 at 15:38
  • Does bert captures both local and global semantics – SS Varshini Sep 04 '21 at 08:01
  • Yes, BERT captures both local and global information. – noe Sep 04 '21 at 08:25
  • but each token representation depends on all other tokens which extracts global information – SS Varshini Sep 04 '21 at 08:46
  • For example, Consider the following review "I bought this product for my father. Mobile is working fine but the charge is bad and camera is good" I would like to classify this review to positive, negative or neutral. But how attention between (camera,bad) and attention between (camera,good) differs – SS Varshini Sep 04 '21 at 10:10