LoRAWeights, alpha scaling, attaching adapters to fc layers, and training only A/B.
The following sections pick up after the LoRAWeights listing shown at the end of How LoRA works—tuning initialization, the alpha scaling knob, and wiring adapters into a full network.
As demonstrated above, the LoRAWeights class aims to decompose a matrix of dimensionality d × k into two matrices A and B. Thus, it accepts four parameters:
Also, both self.A and self.B are learnable parameters of the module, representing the matrices used in the decomposition.
As discussed earlier, this ensures that the product of AB is zero as we begin fine-tuning. This initialization also validates the fact that if no fine-tuning has been done so far, the original model weights are retained:
In the forward method, the input x is multiplied by the matrices A and B, and then scaled by alpha. The result is returned as the output of the module.
The parameter alpha is another hyperparameter, which acts as a scaling factor. It determines the impact of the new layers on the current model.
As discussed earlier, LoRA is used on large matrices of a neural network. For instance, say we have the following neural network class:
As LoRA is used after training, so we will already have a trained model available. Let’s say it is accessible using the model object.
Now, our primary objective is to attach the matrices in LoRAWeights class with the matrices in the layers of the above network. And, of course, each layer (fc1, fc2, fc3, fc4) will have its respective LoRAWeights layer.
Note: Of course, it is not necessary that each layer must have a respective fine-tuning LoRAWeights layer too. In fact, in the original paper, it is mentioned that they limited the study to only adapting the attention weights for downstream tasks and they froze the multi-layer perception (feedforward) units of the Transformer for parameter efficiency.
In our case, for instance, we can freeze the fc4 layer as it is not enormously big compared to other layers in the network.
Also, we must remember that the network is trained as we would usually train any other neural network, but while only training the weight matrices A and B, i.e., the pre-trained model (model) is frozen. We do this as follows:
Done!
Next, we utilize the LoRAWeights class to define the fine-tuning network below:
As depicted above:
Done!
Now this MyNeuralNetworkwithLoRA model can be trained like any other neural network.
We have ensured that the pre-trained model (model) does not update during fine-tuning and only weights in LoRAWeights class is learned.
So far, we explored how to update model weights efficiently (LoRA and its variants).
But fine-tuning also depends on what data you use to update the model.
This brings us to instruction fine-tuning (IFT)—the process of teaching an LLM how to follow human instructions by training it on curated instruction–response pairs.
IFT is the foundation of supervised fine-tuning (SFT), and most modern LLMs rely on some form of IFT during alignment.
That was pretty simple, wasn't it?