Is this the correct way to calculate word embeddings using Roberta?

Question

I'm trying to write a program that using Roberta to calculate word embeddings:

from transformers import RobertaModel, RobertaTokenizer
import torch
model = RobertaModel.from_pretrained('roberta-base')
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
caption = "this bird is yellow has red wings"
encoded_caption = tokenizer(caption, return_tensors='pt')
input_ids = encoded_caption['input_ids']
outputs = model(input_ids)
word_embeddings = outputs.last_hidden_state

I extract the last hidden state after forwarding the input_ids to the RobertaModel class to calculate word embeddings, I don't know if this is the correct way to do this, can anyone help me confirm this ? Thanks

score 1 · Accepted Answer · answered Jan 12 '24 at 19:57

1

This was studied in the original BERT article, which concluded that the best approach was to concatenate the states of the last 4 layers:

Although BERT preceeded RoBERTa, we may understand this observation to be somewhat applicable to RoBERTa, which is very similar. You may, nonetheless, experiment with the precise number of layer states to concatenate to see what value gives the best results.

answered Jan 12 '24 at 19:57

noe

26,410
1
46
76

What does this table mean ? I don't understand – Jan 12 '24 at 20:02
Ok, I have read the paper and understand, another question is how can I concatenate last 4 hidden layers in the code ? – Jan 12 '24 at 20:06
First, you should invoke your model with model(input_ids, output_hidden_states=True) to get the hidden states. Then, you concatenate them with torch.cat, like torch.cat([outputs['hidden_states'][-i] for i in range(1,5)],dim=-1). – noe Jan 12 '24 at 20:14
So I did this and word embedding has shape of torch.Size([1, 9, 3072]), is this normal ? I thought the hidden size should be the same 768 why increase to 3072 ? – Jan 12 '24 at 20:22
The hidden state size is 768, but you have concatenated the last 4 hidden states together in a single vector, so the resulting size is 4x. – noe Jan 12 '24 at 20:23
Cool, thank you! I want to ask, how can you calculate the weighted sum of the last 4 hidden layers in the code? I have searched on the Internet but didn't find any examples. – Jan 12 '24 at 21:33
With a matrix multiplication with the weights vector. You may need to permute the dimensions first. – noe Jan 12 '24 at 22:13
I also want to ask do you think is necessary to forward the attention_mask to the RobertaModel in order to ignore padding tokens and calculate the contextualized word embeddings? Thanks – Jan 27 '24 at 18:03
I would need to look into it. Please, create a new question for this new doubt. – noe Jan 27 '24 at 18:14
If I apply the approaches from the table, are they used to calculate "contextualized token embedding" instead of "word embedding", because these terms are different ? – Feb 03 '24 at 21:02
RoBERTa does not provide word embeddings, just token embeddings. You can check this answer and this other answer. They are about BERT but apply the same to RoBERTa. – noe Feb 04 '24 at 08:27

Is this the correct way to calculate word embeddings using Roberta?

1 Answers1