For any NER task, we need a sequence of words and their corresponding labels. To extract features for these words from BERT, they need to be tokenized into subwords.
For example, the word 'infrequent'
(with label B-count) will be tokenized into ['in', '##fr', '##e', '##quent']
. How will its label be represented?
According to the BERT paper, "We use the representation of the first sub-token as the input to the token-level classifier over the NER label set".
So I assume, for the subwords ['in', '##fr', '##e', '##quent']
, the label for the first subword will either be this ['B-count', 'B-count', 'B-count', 'B-count']
where we propagate the word label to all the subwords. Or should it be ['B-count', 'X', 'X', 'X']
where we leave the original label on the first token of the word, then use the label “X” for subwords of that word.
Any help will be appreciated.