Billions of parameters, GPU RAM, multi-tenant storage—why full fine-tuning doesn’t scale for LLMs.
However, a problem arises when we use the same traditional fine-tuning technique on much larger models—LLMs, for instance.
This is because, as you may already know, these models are huge—billions or even trillions of parameters.
Consider the size difference between BERT-large and GPT-3:
Fine-tuning BERT-large on a single GPU is easy with traditional fine-tuning. But it’s impossible with GPT-3, which has 175B parameters.
That’s 350GB of memory just to store model weights (float16 precision).
Imagine OpenAI used traditional fine-tuning within its fine-tuning API:
And the problems don't end there:
Traditional fine-tuning is just not practically feasible here, and in fact, not everyone can afford to do it due to a lack of massive infrastructure.
Additionally, maintaining the infrastructure to support fine-tuning requests from potentially thousands of customers simultaneously would be a huge task for them.