Fine-tuning · topic 2 of 11

Issues with traditional fine-tuning

Billions of parameters, GPU RAM, multi-tenant storage—why full fine-tuning doesn’t scale for LLMs.

Issues with traditional fine-tuning

However, a problem arises when we use the same traditional fine-tuning technique on much larger models—LLMs, for instance.

This is because, as you may already know, these models are huge—billions or even trillions of parameters.

Consider the size difference between BERT-large and GPT-3:

BERT-large vs GPT-3 size / memory (deck).

Fine-tuning BERT-large on a single GPU is easy with traditional fine-tuning. But it’s impossible with GPT-3, which has 175B parameters.

That’s 350GB of memory just to store model weights (float16 precision).

Imagine OpenAI used traditional fine-tuning within its fine-tuning API:

If 10 users fine-tuned GPT-3 → 3500 GB to store weights.
If 1000 users fine-tuned GPT-3 → 350k GB to store weights.
If 100k users fine-tuned GPT-3 → 35M GB to store weights.

Multi-tenant storage / usage scaling implied by per-user full fine-tuning (deck).

And the problems don't end there:

OpenAI bills solely based on usage.
What if someone fine-tunes the model for learning but never uses it?
Since a request can come anytime, should they always keep the fine-tuned model in memory since loading 350GB is a heavy task?

Traditional fine-tuning is just not practically feasible here, and in fact, not everyone can afford to do it due to a lack of massive infrastructure.

Additionally, maintaining the infrastructure to support fine-tuning requests from potentially thousands of customers simultaneously would be a huge task for them.

Key takeaways

Full-weight LLM fine-tuning explodes RAM, storage, and serving complexity.
Multi-tenant APIs make “one copy per customer” economically implausible without PEFT.