sharpbyte.dev
← Fine-tuning
Fine-tuning · topic 3 of 11

Five LLM fine-tuning techniques (and bonuses)

LoRA family, LoRA-drop, QLoRA, DoRA—parameter-efficient paths when full updates are prohibitive.

Why PEFT matters

5 LLM Fine-tuning Techniques. Traditional fine-tuning (depicted below) is infeasible with LLMs because these models have billions of parameters and are hundreds of GBs in size, and not everyone has access to such computing infrastructure.

Traditional full fine-tuning vs the LLM scale problem.
Traditional full fine-tuning vs the LLM scale problem.

Thankfully, today, we have many optimal ways to fine-tune LLMs, and five such popular techniques are depicted below:

Overview of five popular efficient techniques.
Overview of five popular efficient techniques.

Let’s understand these:

1) LoRA

1) LoRA. Add two low-rank matrices A and B alongside weight matrices, which contain the trainable parameters. Instead of fine-tuning W, adjust the updates in these low-rank matrices.

LoRA injects trainable low-rank adapters.
LoRA injects trainable low-rank adapters.

2) LoRA-FA · 3) VeRA

2) LoRA-FA. While LoRA considerably decreases the total trainable parameters, it still requires substantial activation memory to update the low-rank weights. LoRA-FA (FA stands for Frozen-A) freezes the matrix A and only updates matrix B.

LoRA-FA: freeze A, train B.
LoRA-FA: freeze A, train B.

3) VeRA. In LoRA, every layer has a different pair of low-rank matrices A and B, and both matrices are trained. In VeRA, however, matrices A and B are frozen, random, and shared across all model layers. VeRA focuses on learning small, layer-specific scaling vectors, denoted as b and d, which are the only trainable parameters in this setup.

VeRA: shared frozen A/B; train small scale vectors.
VeRA: shared frozen A/B; train small scale vectors.

4) Delta-LoRA · 5) LoRA+

4) Delta-LoRA. Here, in addition to training low-rank matrices, the matrix W is also adjusted but not in the traditional way. Instead, the difference (or delta) between the product of the low-rank matrices A and B in two consecutive training steps is added to W:

Delta-LoRA applies a delta from AB across steps.
Delta-LoRA applies a delta from AB across steps.

5) LoRA+. In LoRA, both matrices A and B are updated with the same learning rate. Authors found that setting a higher learning rate for matrix B results in more optimal convergence.

LoRA+ uses asymmetric learning rates for A and B.
LoRA+ uses asymmetric learning rates for A and B.

Bonus: LoRA-drop

Bonus: LoRA-drop. LoRA-drop observes that not all layers benefit equally from LoRA updates. It first adds low-rank matrices to every layer and trains briefly, then measures each layer’s activation strength to see which layers actually matter.

LoRA-drop measures which layers matter.
LoRA-drop measures which layers matter.

Layers whose LoRA activations stay near zero have minimal influence on the model's output and can be removed.

Prune low-impact LoRA layers after probing.
Prune low-impact LoRA layers after probing.

By keeping LoRA only in high-impact layers, LoRA-drop reduces training cost and speeds up fine-tuning with little to no loss in accuracy.

Bonus: QLoRA

Bonus: Quantized Low-Rank Adaptation (QLoRA). QLoRA is an improvement on the LoRA technique discussed above, which further addresses the memory limitations associated with fine-tuning large models using LoRA.

More specifically, if we recall what we discussed above in LoRA, we saw that we augment the network layers whose weights are W with two matrices A and B.

QLoRA builds on LoRA with quantized base weights.
QLoRA builds on LoRA with quantized base weights.

Now, considering the example where we have 25 Million parameters in the weight matrix W:

Memory footprint of full-precision W.
Memory footprint of full-precision W.

Typically, these 25 million parameters will be represented as float32, which requires 32 bits (or 4 bytes) per parameter. This leads to a significant memory footprint, especially for large LLMs.

This results in a memory utilization of (25 million × 4 bytes/parameter) = 100 million bytes for this matrix alone, which is 0.1GBs.

The idea in QLoRA is to reduce this memory utilization of weight matrix W using quantization.

As you may have guessed, quantization involves using lower-bit representations, such as 16-bit, 8-bit, or 4-bit, to represent parameters.

Quantize W; adapters stay trainable in low rank.
Quantize W; adapters stay trainable in low rank.

This results in a significant decrease in the amount of memory required to store the model's parameters.

For instance, consider your model has over a million parameters, each represented with 32-bit floating-point numbers.

If possible, representing them with 8-bit numbers can result in a significant decrease (~75%) in memory usage while still allowing for a large range of values to be represented.

Bit-width vs size trade-off (conceptual).
Bit-width vs size trade-off (conceptual).

Of course, quantization introduces a trade-off between model size and precision.

While reducing the bit-width of parameters makes the model smaller, it also leads to a loss of precision.

This means the model's predictions become more somewhat approximate than the original, full-precision model.

QLoRA does employ some special techniques to preserve the information as much as possible, but there is definitely some trade-off involved.

This somewhat lies along the lines of quantization in model compression, which we will cover ahead in the LLM optimization section.

Bonus: DoRA

Bonus: DoRA. DoRA (Weight-Decomposed Low-Rank Adaptation) represents a refined approach to fine-tuning large models by addressing a key limitation of LoRA while preserving its efficiency.

At its core, DoRA builds upon the principles of LoRA but introduces a decomposition step that separates a pretrained weight matrix W into two components: magnitude (m) and direction (V).

DoRA separates magnitude and direction of weights.
DoRA separates magnitude and direction of weights.

This separation allows the fine-tuning process to target these components independently, improving parameter efficiency and performance.

Key takeaways

  • PEFT trades full-weight updates for small trainable adapters or shared structures.
  • QLoRA attacks memory by quantizing W; DoRA refines what gets updated in each matrix.