Would initializing transformers with pre-trained word embedding speed up the training of transformers?

Question

I read the answers for that question What kind of word embedding is used in the original transformer?. It says that transforms like bert start the first word embedding layer with random values.

Initializing the first word embedding layer in transformers with random values works fine but Wouldn't initializing transformers with pre-trained word embedding speed up the training of transformers?

Isn't starting with pre-trained word embedding(vectors that have semantic meaning) is better than starting from scratch?

I am not talking about the performance. I am talking about the speed of the training.

score 2 · Answer 1 · answered May 15 '23 at 11:35

2

I'd expect your approach to converge faster indeed.

Not having to start from scratch regarding your embedding vectors means that there are fewer parameters to learn. Assuming that the token embeddings are indeed useful for the model, the model should converge faster. However, note that a trainable token embedding might eventually exceed the model with a pre-trained token embedding with regard to performance.

answered May 15 '23 at 11:35

Robin van Hoorn

2,366
1
10
33

What about a fine-tuning approach, whereby the original pre-trained token/word embeddings can be finetuned to the task – information_interchange Jul 08 '23 at 18:01
1

You migjt then wany to start with frozen embeddings, and slowly unfreeze them. Otherwise you might get some catastrophic forgetting – Robin van Hoorn Jul 09 '23 at 08:54

Would initializing transformers with pre-trained word embedding speed up the training of transformers?

1 Answers1