ΔW, frozen W, and why naïvely matching sizes defeats the point—setup for the low-rank trick.
Implementing LoRA From Scratch. Let us understand LoRA in more detail.
Consider the current weights of some random layer in the pre-trained model are W of dimensions d × k, and we wish to fine-tune it on some other dataset.
During fine-tuning, the gradient update rule suggests that we must add ΔW to get the updated parameters:
For simplicity, you can think about ΔW as the update obtained after running gradient descent on the new dataset:
Also, instead of updating the original weights W, it is perfectly legal to maintain both matrices, W and ΔW.
During inference, we can compute the prediction on an input sample x as follows:
In fact, in all the model fine-tuning iterations, W can be kept static, and all weight updates using gradient computation can be incorporated to ΔW instead.
But you might be wondering… how does that even help?
The matrix W is already huge, and we are talking about introducing another matrix that is equally big.
So, we must introduce some smart tricks to manipulate ΔW so that we can fulfill the fine-tuning objective while ensuring we do not consume high memory.
Now, we really can’t do much about W as these weights refer to the pre-trained model. So all optimization (if we intend to use any) must be done on ΔW instead.
While doing so, we must also remember that currently, both W and ΔW have the same dimensions. But given that W already is huge, we must ensure that ΔW does not end up being of the same dimensions, as this will defeat the entire purpose of efficient fine-tuning.
In other words, if we were to keep ΔW of the same dimensions as W, then it would have been better if we had fine-tuned the original model itself.