Understanding alpha parameter tuning in LORA paper

Question

I was reading the LORA paper https://arxiv.org/pdf/2106.09685.pdf a thing I don’t understand is section 4.1, where the updates are updated by alpha, where alpha is a constant in r. It is said that alpha is set to the first r tried. Then if I understand correctly, the authors say that this makes it unnecesary to tune the learning rate alpha. I would really appreciate if someone could explain this concept to me. To start with I don’t understand, why the need to scale the weight update by a constant. I mean, all the weights of the updates are optimized in the fine tuning process.

I also wanted to understand why is A initialized randomly and B to zero. Would it make a difference if it would be the other way around (A zero, B random?). Also, what would go wrong if both would be set to zero?

score 3 · Answer 1 · answered Sep 09 '23 at 15:59

I had the same question. I still haven't got a convincing answer, but while searching I found these that might be helpful:

In this blog, they say:

Alpha scales the learned weights. Existing literature, including the original LoRA paper, generally advises fixing Alpha—often at 16—rather than treating it as a tunable hyperparameter

In literature, they say:

[...] and LoRA alpha is the scaling factor for the weight matrices. The weight matrix is scaled by $\frac{lora\_alpha}{lora\_rank}$, and a higher alpha value assigns more weight to the LoRA activations. We chose 16 since this was common practice in training scripts we reviewed and chose a 1:1 ratio so as not to overpower the base model.

While these two passages were clear, I still don't understand why one should scale the update weights. Also, I wouldn't have expected a ratio $\frac{lora\_alpha}{lora\_rank} > 1$, but in the tutorial I am following for applying LoRA to Whisper (an ASR model), the ratio is equal to $2$, with $lora\_alpha=64$ and $lora\_rank=32$.

score 0 · Answer 2 · answered Oct 19 '23 at 03:32

I also wanted to understand why is A initialized randomly and B to zero. Would it make a difference if it would be the other way around (A zero, B random?).

I think they just want initial step of fine tuning to be like we fine tune with only pretrained weight (no additional weight affect the result). One of A or B have to be all zero, so (A zero, B random?) should work too.

Also, what would go wrong if both would be set to zero?

Initializing both of them to zero might cause some issue, I think it is the same reason why we don't initialize deep learning model with zero. The gradient signal sent from A to B will be the same (all zero) and each node in A will look like the same to B.

Related link Initialize perceptron weights with zero

Understanding alpha parameter tuning in LORA paper

2 Answers2