sharpbyte.dev
← Fine-tuning
Fine-tuning · topic 6 of 11

PyTorch LoRA implementation

LoRAWeights, alpha scaling, attaching adapters to fc layers, and training only A/B.

From LoRAWeights to training details

The following sections pick up after the LoRAWeights listing shown at the end of How LoRA works—tuning initialization, the alpha scaling knob, and wiring adapters into a full network.

As demonstrated above, the LoRAWeights class aims to decompose a matrix of dimensionality d × k into two matrices A and B. Thus, it accepts four parameters:

  • d: The number of rows in matrix W.
  • k: The number of columns in matrix W.
  • r: The rank hyperparameter.
  • alpha: A scaling parameter that controls the strength of the adaptation.

Initialization and forward

Also, both self.A and self.B are learnable parameters of the module, representing the matrices used in the decomposition.

  • The matrix A has been initialized from a Gaussian distribution.
  • Note: If needed, we can also scale this matrix A so that the initial values are not too large.
  • The matrix B is a zero matrix.

As discussed earlier, this ensures that the product of AB is zero as we begin fine-tuning. This initialization also validates the fact that if no fine-tuning has been done so far, the original model weights are retained:

AB starts at zero so the base checkpoint is unchanged at step zero.
AB starts at zero so the base checkpoint is unchanged at step zero.

In the forward method, the input x is multiplied by the matrices A and B, and then scaled by alpha. The result is returned as the output of the module.

The parameter alpha is another hyperparameter, which acts as a scaling factor. It determines the impact of the new layers on the current model.

Alpha scales the adapter’s contribution at the output edge.
Alpha scales the adapter’s contribution at the output edge.
  • A higher value of alpha means that the changes made by the LoRA layer will be more significant, potentially leading to more pronounced adjustments in the model's behavior.
  • Conversely, a lower value of alpha results in more subtle changes, as the impact of the transformation is reduced.

Wiring LoRA into a network

As discussed earlier, LoRA is used on large matrices of a neural network. For instance, say we have the following neural network class:

Example MLP with fc1…fc4 where LoRA can attach.
Example MLP with fc1…fc4 where LoRA can attach.

As LoRA is used after training, so we will already have a trained model available. Let’s say it is accessible using the model object.

Now, our primary objective is to attach the matrices in LoRAWeights class with the matrices in the layers of the above network. And, of course, each layer (fc1, fc2, fc3, fc4) will have its respective LoRAWeights layer.

Note: Of course, it is not necessary that each layer must have a respective fine-tuning LoRAWeights layer too. In fact, in the original paper, it is mentioned that they limited the study to only adapting the attention weights for downstream tasks and they froze the multi-layer perception (feedforward) units of the Transformer for parameter efficiency.

In our case, for instance, we can freeze the fc4 layer as it is not enormously big compared to other layers in the network.

Also, we must remember that the network is trained as we would usually train any other neural network, but while only training the weight matrices A and B, i.e., the pre-trained model (model) is frozen. We do this as follows:

Training loop: optimize only LoRA A/B; base frozen.
Training loop: optimize only LoRA A/B; base frozen.

Done!

Next, we utilize the LoRAWeights class to define the fine-tuning network below:

Composite network: LoRA adapters fused with fc1–fc3 forward path.
Composite network: LoRA adapters fused with fc1–fc3 forward path.

As depicted above:

  • The LoRA layers are applied over the fully connected layer (fc1, fc2, fc3) in the existing model. More specifically, we create three LoRAWeights layers (loralayer1, loralayer2, loralayer3) based on the dimensions of the fully connected layers (fc1, fc2, fc3) in the model.
  • In the forward method, we pass the input through the first fully connected layer (fc1) of the original model and add the output to the result of the LoRA layer applied to the same input (self.loralayer1(x)). Next, we apply a ReLU activation function to the sum. We repeat the process for the second and third fully connected layers (fc2, fc3) and lastly, return the final output of the last fully connected layer of the pre-trained model (fc4).

Done!

Now this MyNeuralNetworkwithLoRA model can be trained like any other neural network.

We have ensured that the pre-trained model (model) does not update during fine-tuning and only weights in LoRAWeights class is learned.

Bridge: why data matters next

So far, we explored how to update model weights efficiently (LoRA and its variants).

But fine-tuning also depends on what data you use to update the model.

This brings us to instruction fine-tuning (IFT)—the process of teaching an LLM how to follow human instructions by training it on curated instruction–response pairs.

IFT is the foundation of supervised fine-tuning (SFT), and most modern LLMs rely on some form of IFT during alignment.

That was pretty simple, wasn't it?

Key takeaways

  • Initialize B at zero so training starts from the base checkpoint.
  • Attach adapters to the largest, most influential linear maps—often attention projections first.