I was reading the LORA paper https://arxiv.org/pdf/2106.09685.pdf a thing I don’t understand is section 4.1, where the updates are updated by alpha, where alpha is a constant in r. It is said that alpha is set to the first r tried. Then if I understand correctly, the authors say that this makes it unnecesary to tune the learning rate alpha. I would really appreciate if someone could explain this concept to me. To start with I don’t understand, why the need to scale the weight update by a constant. I mean, all the weights of the updates are optimized in the fine tuning process.
I also wanted to understand why is A initialized randomly and B to zero. Would it make a difference if it would be the other way around (A zero, B random?). Also, what would go wrong if both would be set to zero?