What if My Word is not in Bert model vocabulary?

Question

I am doing NER using Bert Model. I have encountered some words in my datasets which is not a part of bert vocabulary and i am getting the same error while converting words to ids. Can someone help me in this?

Below is the code i am using for bert.

df = pd.read_csv("drive/My Drive/PA_AG_123records.csv",sep=",",encoding="latin1").fillna(method='ffill')
!wget --quiet https://raw.githubusercontent.com/tensorflow/models/master/official/nlp/bert/tokenization.py
import tensorflow_hub as hub
import tokenization
module_url = 'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2'
bert_layer = hub.KerasLayer(module_url, trainable=True)
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)
tokens_list=['hrct',
 'heall',
 'government',
 'of',
 'hem',
 'snehal',
 'sarjerao',
 'nawale',
 '12',
 '12',
 '9999',
 'female',
 'mobile',
 'no',
 '1155812345',
 '3333',
 '3333',
 '3333',
 '41st',
 '3iteir',
 'fillow']
max_len =25
text = tokens_list[:max_len-2]
input_sequence = ["[CLS]"] + text + ["[SEP]"]
print("After adding  flasges -[CLS] and [SEP]: ")
print(input_sequence)
tokens = tokenizer.convert_tokens_to_ids(input_sequence )
print("tokens to id ")
print(tokens)
```

BERT uses subword vocabularies, and normally have no out-of-vocabulary word problems (see How pre-trained BERT model generates word embeddings for out of vocabulary words? ). — noe, Mar 04 '21 at 08:31
What specific errors are you getting? With what input words? — noe, Mar 04 '21 at 08:32
the word is 'hrct' and it is getting error that it has no any key value. — AMIT KUMAR, Mar 04 '21 at 08:33
Given that BERT does not have OOV word problems with Latin script words, I think this may be related to the BERT implementation you are using or to how you are using it. If you copy here the code you are using, we may be able to help better. — noe, Mar 04 '21 at 08:40

score 3 · Accepted Answer · answered Mar 04 '21 at 09:47

The problem is that you are not using BERT's tokenizer properly.

Instead of using BERT's tokenizer to actually tokenize the input text, you are splitting the text in tokens yourself, in your token_list and then requesting the tokenizer to give you the IDs of those tokens. However, if you provide tokens that are not part of the BERT subword vocabulary, it will not be able to handle them.

You must not do this.

Instead, you should let the tokenizer tokenize the text and then ask for the token IDs, e.g.:

tokens_list = tokenizer.tokenize('Where are you going?')

Remember, nevertheless, that BERT uses subword tokenization, so it will split the input text so that it can be represented with the subwords in its vocabulary.

OK, Thanks for clarification – AMIT KUMAR Mar 04 '21 at 12:34 — AMIT KUMAR, Mar 04 '21 at 12:34

What if My Word is not in Bert model vocabulary?

1 Answers1

Linked