Under the hood: turn text into tokens, run them through a deep Transformer stack, and train billions of weights on a cluster of GPUs.
Most modern LLMs are built using the Transformer architecture, introduced in the paper “Attention Is All You Need.”
Transformers process entire sequences of tokens in parallel and use a mechanism called self-attention to decide which parts of the input are most relevant to each other.
This design replaced older recurrent models that processed text one token at a time, making training much faster and more scalable.
Raw text is first converted into tokens using a tokenizer.
Each token is mapped to a unique ID from a fixed vocabulary.
The model never sees raw characters directly—it only operates on these token IDs.
Each token ID is converted into a dense vector called an embedding.
Since Transformers do not inherently understand order, positional encodings are added so the model knows where each token appears in the sequence.
As data passes through more layers, the model builds increasingly abstract representations of meaning, structure, and intent.
The final layer produces a probability distribution over the entire vocabulary.
The model selects (or samples) the next token, appends it to the sequence, and repeats the process until generation is complete.
Training an LLM involves feeding massive datasets through this architecture and adjusting billions of parameters using optimization algorithms like Adam.
Because models are too large to fit on a single GPU, training is distributed across thousands of GPUs using techniques like data parallelism, model parallelism, and pipeline parallelism.
If you are building apps—not training—this still matters: it explains why frontier models are expensive, why releases are rare, and why inference is priced per token.