0

Bert is pre-trained model which can be fine-tuned for the text classification. How to extract local features using BERT

SS Varshini
  • 239
  • 5
  • 13

1 Answers1

1

First, it is different to fine-tune BERT than extracting features from it.

  • In feature extraction, you normally take BERT's output together with the internal representation of all or some of BERT's layers, and then train some other separate model on those features.

  • In fine-tuning, you re-train the whole BERT model on the specific task you are interested in (e.g. classification). Normally, you choose a very low learning rate to avoid catastrophic forgetting.

To choose which approach is appropriate in your case, you can check this article (basically you will probably get better performance with fine-tuning).

BERT can given you both sentence-level representations of token-level representations. For the sentence-level representations, you just take the BERT representations at the first token position, which is the special token [CLS] (see this). For token-level representations, you just take the representations of the token at the position you need. Take into account that BERT uses subword-level tokens, so it won't give you word-level representations (see this).

Regardless of the approach you choose, you can use the Huggingface Transformers python library, which has a lot of examples on the internet (e.g. fine-tuning and feature extraction).

noe
  • 26,410
  • 1
  • 46
  • 76
  • 1
    My question is about feature extraction and I have read that transformers extract global features. But Does BERT also extracts global features or it incudes local features. For example consider the sentence "This is my first sentence. This is my second sentence" . Now How Bert extracts the features. – SS Varshini Aug 02 '21 at 14:25
  • I added some clarifications to describe what kind of representations you can get with BERT. – noe Aug 02 '21 at 14:30
  • 1
    As I comment in the answer, the features BERT gives you are the hidden states of the self-attention layers at specific positions: the first position is for sentence-level representations, while the rest of the positions belong to the token at that position. – noe Aug 02 '21 at 14:31
  • 1
    when i have multiple sentences in text then the first position has sentence level representation or the whole text represnetation? – SS Varshini Aug 02 '21 at 14:37
  • Will bert computes the attention between "first" and "second". Example mentioned in the above comment. – SS Varshini Aug 02 '21 at 14:38
  • When you have multiple sentences, the first position will contain the representation of the whole text. In the example, BERT will internally compute the attention between tokens, not words; if BERT's vocabulary has tokens for the whole words "first" and "second" (and the actual BERT vocabulary does contain them), then it will certainly compute attention between them internally in all of the attention heads of all the model layers. – noe Aug 02 '21 at 14:42
  • Can we conclude that bert extracts global features of text but not local ? where local i mean sentence level or aspect level when text has multiple sentences. – SS Varshini Aug 02 '21 at 15:02
  • 1
    BERT extracts context-aware representations, meaning that the obtained vectors depend on the rest of the tokens, not just the token at the specific position. Nevertheless, a different representation is computed for each token in the sentence (apart from the described [CLS] representation). – noe Aug 02 '21 at 15:40
  • Thanks for your time and explanation. – SS Varshini Aug 02 '21 at 16:06
  • @noe so for this example: "[CLS] SentenceA [SEP] SentenceB", we cannot extract features that belong to sentenceA if I have 2 sentences. If so, how CLS is being able to extract meaning from one sentence since it's not trained for it? – canP Jul 14 '22 at 03:57