sharpbyte.dev
← LLMs
LLMs · topic 4 of 11

How are LLMs built?

Under the hood: turn text into tokens, run them through a deep Transformer stack, and train billions of weights on a cluster of GPUs.

The Transformer architecture

Most modern LLMs are built using the Transformer architecture, introduced in the paper “Attention Is All You Need.”

Transformers process entire sequences of tokens in parallel and use a mechanism called self-attention to decide which parts of the input are most relevant to each other.

This design replaced older recurrent models that processed text one token at a time, making training much faster and more scalable.

Text flows through tokenization, stacked Transformer layers, and a next-token prediction head.
Text flows through tokenization, stacked Transformer layers, and a next-token prediction head.

Tokenization

Raw text is first converted into tokens using a tokenizer.

Each token is mapped to a unique ID from a fixed vocabulary.

The model never sees raw characters directly—it only operates on these token IDs.

Raw text is split into tokens, then mapped to numeric IDs the model can process.
Raw text is split into tokens, then mapped to numeric IDs the model can process.

Embeddings and positional encoding

Each token ID is converted into a dense vector called an embedding.

Since Transformers do not inherently understand order, positional encodings are added so the model knows where each token appears in the sequence.

Transformer layers

As data passes through more layers, the model builds increasingly abstract representations of meaning, structure, and intent.

  • Self-attention — allows each token to “look at” other tokens and gather context.
  • Feed-forward network — transforms the representation of each token independently.

Output head and next-token prediction

The final layer produces a probability distribution over the entire vocabulary.

The model selects (or samples) the next token, appends it to the sequence, and repeats the process until generation is complete.

Training at scale

Training an LLM involves feeding massive datasets through this architecture and adjusting billions of parameters using optimization algorithms like Adam.

Because models are too large to fit on a single GPU, training is distributed across thousands of GPUs using techniques like data parallelism, model parallelism, and pipeline parallelism.

If you are building apps—not training—this still matters: it explains why frontier models are expensive, why releases are rare, and why inference is priced per token.

Key takeaways

  • LLMs = tokenizer + embeddings + stacked Transformer layers + next-token head.
  • Self-attention lets every token use context from the whole sequence.
  • Distributed training across many GPUs is required at frontier scale.