2

I am doing NER using Bert Model. I have encountered some words in my datasets which is not a part of bert vocabulary and i am getting the same error while converting words to ids. Can someone help me in this?

Below is the code i am using for bert.

df = pd.read_csv("drive/My Drive/PA_AG_123records.csv",sep=",",encoding="latin1").fillna(method='ffill')

!wget --quiet https://raw.githubusercontent.com/tensorflow/models/master/official/nlp/bert/tokenization.py

import tensorflow_hub as hub import tokenization module_url = 'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2' bert_layer = hub.KerasLayer(module_url, trainable=True)

vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy() do_lower_case = bert_layer.resolved_object.do_lower_case.numpy() tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)

tokens_list=['hrct', 'heall', 'government', 'of', 'hem', 'snehal', 'sarjerao', 'nawale', '12', '12', '9999', 'female', 'mobile', 'no', '1155812345', '3333', '3333', '3333', '41st', '3iteir', 'fillow']

max_len =25 text = tokens_list[:max_len-2] input_sequence = ["[CLS]"] + text + ["[SEP]"] print("After adding flasges -[CLS] and [SEP]: ") print(input_sequence)

tokens = tokenizer.convert_tokens_to_ids(input_sequence ) print("tokens to id ") print(tokens) ```

AMIT KUMAR
  • 23
  • 1
  • 4

1 Answers1

3

The problem is that you are not using BERT's tokenizer properly.

Instead of using BERT's tokenizer to actually tokenize the input text, you are splitting the text in tokens yourself, in your token_list and then requesting the tokenizer to give you the IDs of those tokens. However, if you provide tokens that are not part of the BERT subword vocabulary, it will not be able to handle them.

You must not do this.

Instead, you should let the tokenizer tokenize the text and then ask for the token IDs, e.g.:

tokens_list = tokenizer.tokenize('Where are you going?') 

Remember, nevertheless, that BERT uses subword tokenization, so it will split the input text so that it can be represented with the subwords in its vocabulary.

noe
  • 26,410
  • 1
  • 46
  • 76