8

For any NER task, we need a sequence of words and their corresponding labels. To extract features for these words from BERT, they need to be tokenized into subwords.

For example, the word 'infrequent' (with label B-count) will be tokenized into ['in', '##fr', '##e', '##quent']. How will its label be represented?

According to the BERT paper, "We use the representation of the first sub-token as the input to the token-level classifier over the NER label set".

So I assume, for the subwords ['in', '##fr', '##e', '##quent'] , the label for the first subword will either be this ['B-count', 'B-count', 'B-count', 'B-count'] where we propagate the word label to all the subwords. Or should it be ['B-count', 'X', 'X', 'X'] where we leave the original label on the first token of the word, then use the label “X” for subwords of that word.

Any help will be appreciated.

PinkBanter
  • 374
  • 3
  • 15
  • 1
    as far as I've understood, you should map only the root of the word that gets "tokenized" in subwords. I'm facing similar issues with a WSD task. After BERT, you should make a custom layer that removes the subwords... how to do so is still a mystery to me – Gianmarco F. Mar 15 '20 at 18:53
  • Yes, we need to combine the subword token embeddings using some method (averaging over them?) and get rid of the dummy (X) subword token labels. I need to figure that out too yet. – PinkBanter Mar 17 '20 at 13:38
  • @adjective_noun Hi, have you figured it out yet? – Tengerye Apr 10 '21 at 08:57

2 Answers2

3

Method 2 is the correct one.

Leave the actual label of the word only in the first sub-token, and the other sub-tokens will have a dummy label (which in this case is 'X'). The important thing is that when calculating the loss (e.g., CELoss) and metrics (e.g., F1), this 'X' labels on the sub-tokens are not taken into account.

This is also the reason why we don't use method 1 is that otherwise, we would be introducing more labels of the type [B-count] and affecting the support number for such a class (which would make a test set no longer comparable with other models that do not increase the number of labels for such class).

PinkBanter
  • 374
  • 3
  • 15
0

i am using huingface for NER. To solve this problem, you can refer to this website:https://huggingface.co/docs/transformers/tasks/token_classification Specifically, the code for aligning labels and final evaluation is as follows

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
labels = []
for i, label in enumerate(examples[f"ner_tags"]):
    word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
    previous_word_idx = None
    label_ids = []
    for word_idx in word_ids:  # Set the special tokens to -100.
        if word_idx is None:
            label_ids.append(-100)
        elif word_idx != previous_word_idx:  # Only label the first token of a given word.
            label_ids.append(label[word_idx])
        else:
            label_ids.append(-100)
        previous_word_idx = word_idx
    labels.append(label_ids)

tokenized_inputs["labels"] = labels
return tokenized_inputs

import numpy as np

labels = [label_list[i] for i in example[f"ner_tags"]]

def compute_metrics(p): predictions, labels = p predictions = np.argmax(predictions, axis=2)

true_predictions = [
    [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]
true_labels = [
    [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]

results = seqeval.compute(predictions=true_predictions, references=true_labels)
return {
    "precision": results["overall_precision"],
    "recall": results["overall_recall"],
    "f1": results["overall_f1"],
    "accuracy": results["overall_accuracy"],
}