What should be the labels for subword tokens in BERT for NER task?

Question

For any NER task, we need a sequence of words and their corresponding labels. To extract features for these words from BERT, they need to be tokenized into subwords.

For example, the word 'infrequent' (with label B-count) will be tokenized into ['in', '##fr', '##e', '##quent']. How will its label be represented?

According to the BERT paper, "We use the representation of the first sub-token as the input to the token-level classifier over the NER label set".

So I assume, for the subwords ['in', '##fr', '##e', '##quent'] , the label for the first subword will either be this ['B-count', 'B-count', 'B-count', 'B-count'] where we propagate the word label to all the subwords. Or should it be ['B-count', 'X', 'X', 'X'] where we leave the original label on the first token of the word, then use the label “X” for subwords of that word.

Any help will be appreciated.

as far as I've understood, you should map only the root of the word that gets "tokenized" in subwords. I'm facing similar issues with a WSD task. After BERT, you should make a custom layer that removes the subwords... how to do so is still a mystery to me — Gianmarco F., Mar 15 '20 at 18:53
Yes, we need to combine the subword token embeddings using some method (averaging over them?) and get rid of the dummy (X) subword token labels. I need to figure that out too yet. — PinkBanter, Mar 17 '20 at 13:38

score 3 · Accepted Answer · answered Jun 01 '20 at 12:54

Method 2 is the correct one.

Leave the actual label of the word only in the first sub-token, and the other sub-tokens will have a dummy label (which in this case is 'X'). The important thing is that when calculating the loss (e.g., CELoss) and metrics (e.g., F1), this 'X' labels on the sub-tokens are not taken into account.

This is also the reason why we don't use method 1 is that otherwise, we would be introducing more labels of the type [B-count] and affecting the support number for such a class (which would make a test set no longer comparable with other models that do not increase the number of labels for such class).

Hi, could you please give a source of your claim please? – Tengerye Apr 10 '21 at 08:58 — Tengerye, Apr 10 '21 at 08:58

score 0 · Answer 2 · answered Apr 28 '23 at 08:12

i am using huingface for NER. To solve this problem, you can refer to this website:https://huggingface.co/docs/transformers/tasks/token_classification Specifically, the code for aligning labels and final evaluation is as follows

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
labels = []
for i, label in enumerate(examples[f&quot;ner_tags&quot;]):
    word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
    previous_word_idx = None
    label_ids = []
    for word_idx in word_ids:  # Set the special tokens to -100.
        if word_idx is None:
            label_ids.append(-100)
        elif word_idx != previous_word_idx:  # Only label the first token of a given word.
            label_ids.append(label[word_idx])
        else:
            label_ids.append(-100)
        previous_word_idx = word_idx
    labels.append(label_ids)

tokenized_inputs[&quot;labels&quot;] = labels
return tokenized_inputs


import numpy as np
labels = [label_list[i] for i in example[f"ner_tags"]]
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)
true_predictions = [
    [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]
true_labels = [
    [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]

results = seqeval.compute(predictions=true_predictions, references=true_labels)
return {
    &quot;precision&quot;: results[&quot;overall_precision&quot;],
    &quot;recall&quot;: results[&quot;overall_recall&quot;],
    &quot;f1&quot;: results[&quot;overall_f1&quot;],
    &quot;accuracy&quot;: results[&quot;overall_accuracy&quot;],
}

What should be the labels for subword tokens in BERT for NER task?

2 Answers2