sharpbyte.dev
← Fine-tuning
Fine-tuning · topic 4 of 11

Implementing LoRA from scratch

ΔW, frozen W, and why naïvely matching sizes defeats the point—setup for the low-rank trick.

Updates without touching all of W

Implementing LoRA From Scratch. Let us understand LoRA in more detail.

Consider the current weights of some random layer in the pre-trained model are W of dimensions d × k, and we wish to fine-tune it on some other dataset.

Layer weight matrix W (d × k) before adaptation.
Layer weight matrix W (d × k) before adaptation.

During fine-tuning, the gradient update rule suggests that we must add ΔW to get the updated parameters:

Fine-tuning adds an update ΔW to frozen pretrained W.
Fine-tuning adds an update ΔW to frozen pretrained W.

For simplicity, you can think about ΔW as the update obtained after running gradient descent on the new dataset:

Keeping W static

Also, instead of updating the original weights W, it is perfectly legal to maintain both matrices, W and ΔW.

Keeping pretrained W and update ΔW side by side (deck).
Keeping pretrained W and update ΔW side by side (deck).

During inference, we can compute the prediction on an input sample x as follows:

Forward pass combines frozen W with ΔW on input x (deck).
Forward pass combines frozen W with ΔW on input x (deck).

In fact, in all the model fine-tuning iterations, W can be kept static, and all weight updates using gradient computation can be incorporated to ΔW instead.

But you might be wondering… how does that even help?

The matrix W is already huge, and we are talking about introducing another matrix that is equally big.

So, we must introduce some smart tricks to manipulate ΔW so that we can fulfill the fine-tuning objective while ensuring we do not consume high memory.

Now, we really can’t do much about W as these weights refer to the pre-trained model. So all optimization (if we intend to use any) must be done on ΔW instead.

While doing so, we must also remember that currently, both W and ΔW have the same dimensions. But given that W already is huge, we must ensure that ΔW does not end up being of the same dimensions, as this will defeat the entire purpose of efficient fine-tuning.

In other words, if we were to keep ΔW of the same dimensions as W, then it would have been better if we had fine-tuned the original model itself.

Key takeaways

  • Fine-tuning can be rewritten as freezing W and learning an update ΔW.
  • Efficient fine-tuning requires ΔW to stay cheap—enter the low-rank factorization next.