3

I read the answers for that question What kind of word embedding is used in the original transformer?. It says that transforms like bert start the first word embedding layer with random values.

Initializing the first word embedding layer in transformers with random values works fine but Wouldn't initializing transformers with pre-trained word embedding speed up the training of transformers?

Isn't starting with pre-trained word embedding(vectors that have semantic meaning) is better than starting from scratch?

I am not talking about the performance. I am talking about the speed of the training.

floyd
  • 131
  • 1

1 Answers1

2

I'd expect your approach to converge faster indeed.

Not having to start from scratch regarding your embedding vectors means that there are fewer parameters to learn. Assuming that the token embeddings are indeed useful for the model, the model should converge faster. However, note that a trainable token embedding might eventually exceed the model with a pre-trained token embedding with regard to performance.

Robin van Hoorn
  • 2,366
  • 1
  • 10
  • 33