1

I'm building an internal semantic search engine using BERT/SBERT + ElasticSearch 8 where answers are retrieved based on their cosine similarity with a query.

The documents to be searched are somewhat domain-specific, off the top of my head estimation is that about 10% of the vocabulary is not present in Wiki or Common Crawl datasets on which BERT models were trained. These are basically "made-up" words - niche product and brand names.

So my question is:

  1. Should I pre-train a BERT/SBERT model first on my specific corpus to learn the embeddings for these words using MLM?

or

  1. Can I skip pre-training and start fine-tuning a selected model for Q/A using SQUAD, synthetic Q/A based on my corpus and actual logged user queries?

My concern is that if I skip #1 then a model would not know the embeddings for some of the "made up" words, replace them with "unknown" token and this might lead to worse search performance.

ruslaniv
  • 163
  • 3

1 Answers1

1

Is your corpus big enough? (= several GBs)

If yes, you could train a model from scratch and have good results.

https://towardsdatascience.com/how-to-train-a-bert-model-from-scratch-72cfce554fc6

If not, fine-tuning should be better. You can always try to train it from scratch but you might have sometimes wrong results. Perhaps you can add some training data from similar sources to reach an optimal result.

https://www.tensorflow.org/tfmodels/nlp/fine_tune_bert

Nicolas Martin
  • 4,674
  • 1
  • 6
  • 15
  • Yes, i have downloaded all product titles and descriptions from the DB and the CSV is about 10Gb. – ruslaniv Oct 23 '22 at 06:12
  • Does it work @ruslaniv? – Nicolas Martin Oct 29 '22 at 14:40
  • 1
    Yes, after all I trained the model with MLM (which I think I failed to train properly https://datascience.stackexchange.com/questions/115717/do-i-need-to-train-a-tokenizer-when-training-sbert-with-mlm) and then fine tuned it with synthetic queries. Still waiting for enough of actual user queries that we just started logging for further training. If you could look at related question I would really appreciate it. – ruslaniv Nov 01 '22 at 10:09
  • ok, I'll check. – Nicolas Martin Nov 01 '22 at 17:13