Bert is pre-trained model which can be fine-tuned for the text classification. How to extract local features using BERT
1 Answers
First, it is different to fine-tune BERT than extracting features from it.
In feature extraction, you normally take BERT's output together with the internal representation of all or some of BERT's layers, and then train some other separate model on those features.
In fine-tuning, you re-train the whole BERT model on the specific task you are interested in (e.g. classification). Normally, you choose a very low learning rate to avoid catastrophic forgetting.
To choose which approach is appropriate in your case, you can check this article (basically you will probably get better performance with fine-tuning).
BERT can given you both sentence-level representations of token-level representations. For the sentence-level representations, you just take the BERT representations at the first token position, which is the special token [CLS]
(see this). For token-level representations, you just take the representations of the token at the position you need. Take into account that BERT uses subword-level tokens, so it won't give you word-level representations (see this).
Regardless of the approach you choose, you can use the Huggingface Transformers python library, which has a lot of examples on the internet (e.g. fine-tuning and feature extraction).

- 26,410
- 1
- 46
- 76
[CLS]
representation). – noe Aug 02 '21 at 15:40