Back to List

The Complete History of LLMs: From RNN to GPT

2026-03-01·17 min read·AITutorial

Introduction

LLMs have been in the spotlight for over two years now, and information is everywhere — but most of it is either too academic (formulas on page one) or too fragmented (covering just one model or trick).

I've been looking for a single article that tells the full story of LLMs from start to finish — not a paper survey, but a technical map from a developer's perspective. Couldn't find one I liked, so I wrote it myself.

This article is for developers with programming experience who want to understand the big picture. After reading, you'll know: where LLMs came from, how they work, who the major players are, what the core concepts mean, and how to start learning.


1. A Brief History of NLP: From Rules to Neural Networks

Before we talk about LLMs, let's quickly trace the evolution of NLP (Natural Language Processing). This history helps you understand why the Transformer was a paradigm shift.

1950s ──── Rule-based systems (hand-written grammar rules, brittle but explainable)
  │
1990s ──── Statistical methods (TF-IDF, n-grams, Naive Bayes)
  │         └─ Probability replaced rules, but features were all hand-crafted
  │
2003  ──── Neural language models (Bengio — learning word representations)
  │
2013  ──── Word2Vec ⭐ (The word vector revolution)
  │         └─ "King - Man + Woman ≈ Queen"
  │
2014  ──── Seq2Seq (Encoder-Decoder, machine translation breakthrough)
  │
2015  ──── Attention mechanism (stop treating every word equally)
  │
2017  ──── Transformer ⭐⭐⭐ (Changed everything)
  │
2018+ ──── BERT, GPT series, the LLM era begins

Key Milestones

Word2Vec (2013) was the first moment ordinary developers felt "AI understands language." It mapped each word to a vector (say, 300 dimensions), where semantically similar words were close together. Even more remarkably, vector arithmetic captured semantic relationships.

Seq2Seq + Attention (2014-2015) solved the "input a sequence, output another sequence" problem (e.g., translation). The Attention mechanism let the model "look back" at the most relevant parts of the input when generating each word, instead of compressing the entire input into a single fixed vector.

These two breakthroughs directly paved the way for the Transformer.


2. Transformer: The Paper That Changed Everything (2017)

In 2017, Google published "Attention Is All You Need," introducing the Transformer architecture. Bold title — but it delivered.

2.1 Core Idea: Self-Attention

Models before Transformer (RNN/LSTM) processed text sequentially — one word at a time, like reading a sentence left to right. This caused two problems: slow training (no parallelism) and long-range dependencies getting lost (forgetting the beginning by the end).

Self-Attention takes a completely different approach. Imagine reading this sentence:

"The cat sat on the mat because it was tired."

When you read "it," your brain automatically links it to "cat." Self-Attention does exactly this — it lets every word in a sentence directly "see" every other word and compute their relevance.

No sequential processing. All words handled simultaneously, relationships computed in one step. That's why Transformers can train in parallel, an order of magnitude faster than RNNs.

2.1.1 Q/K/V: The Mechanics of Self-Attention

The Self-Attention computation is easier to grasp with an analogy: imagine you're searching for a book in a library.

  • Query: The question in your mind — "I want a book about machine learning"
  • Key: The label on each book's cover — "Machine Learning", "Cooking", "History"
  • Value: The actual content of the book

The process: match your Query against each book's Key. Books (Values) with high match scores get more of your attention; low-scoring ones get skimmed over.

In a Transformer, every word generates its own Q, K, and V vectors (via three separate linear transformations). Then:

1. Compute attention scores: Score = Q × K^T (how much each word attends to every other word)
2. Scale: Score / √d_k (prevent large values from causing softmax gradient vanishing)
3. Softmax normalization: convert scores into a probability distribution
4. Weighted sum: weight V by the probabilities to get the final output

Written as a formula: Attention(Q, K, V) = softmax(QK^T / √d_k) V

In practice, Transformers also use Multi-Head Attention — splitting Q/K/V into multiple groups (e.g., 8 or 32), each computing attention independently, then concatenating the results. This lets the model attend to different types of relationships simultaneously (for instance, one head captures syntactic relations while another captures semantic ones).

2.1.2 Positional Encoding

Self-Attention has a blind spot: it's completely order-agnostic. "Cat chases dog" and "dog chases cat" look identical to pure Self-Attention.

The fix is to add a positional encoding vector to each position. The original Transformer used fixed encodings generated by sinusoidal functions. Later models (like RoPE, adopted by LLaMA and most modern architectures) switched to rotary positional encoding, which handles long sequences much more gracefully.

2.2 Encoder vs Decoder

The original Transformer has two parts:

┌─────────────┐    ┌─────────────┐
│   Encoder   │───→│   Decoder   │
│(understand) │    │ (generate)  │
└─────────────┘    └─────────────┘
  • Encoder: Reads the entire input, produces contextual representations. Each word sees all other words (bidirectional).
  • Decoder: Generates output step by step. Each word only sees preceding words (unidirectional), plus the Encoder's output.

Later models chose different combinations based on their goals:

  • Encoder only → BERT (excels at understanding)
  • Decoder only → GPT (excels at generation)
  • Both → T5 (translation, summarization)

2.3 Tokenization: How Models "Read" Text

Models don't process raw text directly. They first split text into tokens. The mainstream method is BPE (Byte Pair Encoding):

  • "unhappiness" → ["un", "happiness"] or ["un", "happ", "iness"]
  • Chinese is typically split by character or subword: "大语言模型" → ["大", "语言", "模型"]

Different models use different tokenizers, which is why the same text consumes different token counts across models.


3. LLM Evolution: Three Paths

After Transformer, LLMs evolved along three routes:

3.1 Encoder Path: Understanding

BERT (2018) was the first model that made people exclaim "NLP has been revolutionized." Using the Encoder architecture, it pre-trained via "fill in the blank" (Masked Language Model) — randomly masking words and having the model predict them.

Input:  The weather today is [MASK] nice
Predict: [MASK] → very

BERT excels at understanding tasks: text classification, sentiment analysis, QA, named entity recognition. RoBERTa and DeBERTa optimized training strategies but kept the core architecture.

3.2 Decoder Path: Generation (Mainstream)

The GPT series took a different road — Decoder only, trained by "predicting the next word":

Input:  The weather today is
Predict: → very → nice → .

Looks simple, but when the model is large enough and the data plentiful enough, this simple objective produces astonishing capabilities.

  • GPT-1 (2018): 117M parameters, proved the "pre-train + fine-tune" paradigm
  • GPT-2 (2019): 1.5B parameters, generation quality so good OpenAI initially withheld release
  • GPT-3 (2020): 175B parameters, demonstrated few-shot ability (no fine-tuning needed, just give examples)
  • GPT-4 (2023): Multimodal, comprehensive capability leap
  • GPT-4o / o1 / o3 (2024-2025): Native multimodal, enhanced Chain of Thought reasoning

3.3 Encoder-Decoder Path

T5 (Google, 2019) unified all NLP tasks into a "text-to-text" format:

Translation:     "translate English to French: Hello" → "Bonjour"
Summarization:   "summarize: [long text]" → "[summary]"
Classification:  "classify: I love this movie" → "positive"

BART (Meta) follows the same path. These models excel at translation and summarization but have been gradually overtaken by pure Decoder models in general conversation.

3.4 Scaling Laws: Bigger Is (Actually) Better

In 2020, OpenAI published the Scaling Laws paper, revealing a pattern: model performance follows a power law with three factors —

  1. Parameters (model size)
  2. Data (training data)
  3. Compute (training compute)

Scale all three proportionally, and performance keeps improving — with no clear ceiling in sight. That's why companies are racing to stack parameters and hoard GPUs.

But there's a second act to this story. In 2022, DeepMind published the Chinchilla paper, which corrected a key assumption in the original Scaling Laws: previous LLMs were generally "over-parameterized and under-trained." Chinchilla demonstrated that with a fixed compute budget, parameters and training data should scale proportionally. A 70B-parameter model trained on more data can outperform a 280B-parameter model that was starved of data.

This finding directly shaped subsequent training strategies — the LLaMA series is Chinchilla's philosophy in action, pairing relatively modest parameter counts with ample training data to achieve results that far exceeded expectations.

3.5 Emergent Abilities

Even more fascinating: when model scale crosses certain thresholds, capabilities appear that smaller models completely lack — multi-step reasoning, code generation, math problem-solving. These abilities weren't explicitly trained; they "emerged."

This is why LLMs are both exciting and unsettling: we've built systems that exceed expectations, but we don't fully understand why they can do what they do.


4. The LLM Landscape

As of early 2026, here's the major model landscape:

ModelCompanyParametersOpen SourceKey Strengths
GPT-4oOpenAIUndisclosedNoNative multimodal, fast
o1 / o3OpenAIUndisclosedNoEnhanced reasoning, math/code
Claude 4 seriesAnthropicUndisclosedNoLong context (200K), safety alignment leader
Gemini 2.5GoogleUndisclosedNoStrong multimodal, 1M token context
LLaMA 3Meta8B-405BYesOpen-source benchmark, rich ecosystem
DeepSeek V3 / R1DeepSeek671B (MoE)YesExceptional cost-efficiency, strong reasoning
Qwen 2.5Alibaba0.5B-72BYesExcellent Chinese, full size range
Mistral LargeMistralUndisclosedPartialEuropean contender, efficiency-focused
GLM-4Zhipu AIUndisclosedPartialNative Chinese, strong tool use

Key trends:

  • Open vs Closed: Open-source models (LLaMA, DeepSeek, Qwen) are rapidly closing the gap with closed-source
  • MoE Architecture: DeepSeek V3 has 671B parameters but only activates 37B, dramatically reducing inference cost
  • Reasoning Enhancement: o1/o3 and DeepSeek R1 represent the "let the model think longer" direction
  • Long Context: From 4K → 32K → 128K → 200K → 1M, context windows keep expanding

5. Deep Dive: Anthropic and Claude — The AI Safety Idealists

Among the major LLM players, Anthropic and its Claude series occupy a unique position. It's not the oldest, nor the largest in parameter count, but it represents a distinctive technical philosophy and set of values. Understanding Anthropic's story reveals a thread that's often overlooked but critically important in LLM development — AI safety and alignment.

5.1 Origins: The OpenAI Exodus

In 2021, Dario Amodei and Daniela Amodei led a group of core researchers out of OpenAI to found Anthropic.

This wasn't an ordinary departure. Dario was OpenAI's VP of Research at the time and a core leader of the GPT-2 and GPT-3 projects. Daniela was VP of Operations. Several researchers who left with them had made key contributions to Scaling Laws and RLHF research.

The reason, in a nutshell: they believed OpenAI was commercializing too fast and not prioritizing AI safety enough.

In hindsight, this concern looks increasingly prescient. OpenAI's shift from nonprofit to "capped profit" — and the controversies that followed — has, to some extent, validated the Amodeis' original worries. Anthropic positioned itself as an "AI safety company" from day one — not a company that makes safe AI products, but a company whose core mission is safety research, that also happens to build products.

5.2 Constitutional AI: The Soul of Claude

Anthropic's most significant theoretical contribution to the LLM field is Constitutional AI, published in late 2022. This is also the core technology that sets Claude apart from other models.

The traditional RLHF pipeline looks like this:

Model generates multiple responses → Human annotators rank them → Train a reward model → Optimize via reinforcement learning

This pipeline has several problems:

  • Expensive: Requires large numbers of human annotators, and annotation quality varies
  • Opaque: The "preferences" the model learns are buried inside the reward model, hard to audit
  • Inconsistent: Different annotators apply different standards, leading to unstable model behavior

Constitutional AI takes a fundamentally different approach:

1. Give the model an explicit set of principles (a "constitution"), e.g.:
   - "Choose the response least likely to be exploited by harmful actors"
   - "Choose the most honest, least deceptive response"
   - "Choose the most cautious response on sensitive topics"

2. Model self-critique:
   Model generates response → Model evaluates its own response against the principles → Model revises

3. Train on the model's self-improved data (RLAIF: Reinforcement Learning from AI Feedback)

The advantages:

  • Scalable: No dependency on massive human annotation
  • Transparent: The principles are written out explicitly, auditable and modifiable
  • Consistent: A single set of principles produces consistent behavioral standards

Constitutional AI is more than a training technique — it's a philosophy. Rather than having the model learn "what's good" by imitating human preferences, give it an explicit value framework and let it learn to judge for itself.

5.3 Claude Series Evolution

Claude's development arc mirrors Anthropic's rapid growth in technical capability:

2023.03 ── Claude 1
            └─ First public release, demonstrated Constitutional AI in practice
            └─ Found a solid balance between safety and usefulness

2023.07 ── Claude 2
            └─ 100K token context window (a breakthrough number at the time)
            └─ Comprehensive capability upgrade, enterprises started taking notice

2024.03 ── Claude 3 series ⭐
            └─ First three-tier lineup: Haiku (fast) / Sonnet (balanced) / Opus (powerful)
            └─ Opus surpassed GPT-4 on multiple benchmarks
            └─ Marked Anthropic's shift from "safe but mediocre" to "safe AND top-tier"

2024.06 ── Claude 3.5 Sonnet ⭐⭐
            └─ The cost-performance benchmark: mid-range price, near-Opus capability
            └─ Outstanding coding ability, became a developer favorite
            └─ Introduced Artifacts — the model can generate interactive content directly

2025    ── Claude 4 series ⭐⭐⭐
            └─ Full lineup upgrade across Opus / Sonnet / Haiku
            └─ 200K token context, industry-leading long-text processing
            └─ Reasoning, coding, and multilingual capabilities all took a major leap
            └─ Claude Code became one of the strongest AI coding assistants

5.4 Claude's Technical Differentiators

Claude has carved out distinct advantages across several dimensions:

Long Context and Information Retrieval

Claude's 200K token context window isn't just a big number. In "Needle in a Haystack" tests (finding a specific piece of information buried in extremely long text), Claude has consistently performed at the top. This means you can feed it an entire codebase or complete technical documentation, and it can genuinely locate and leverage the key information within.

Honesty First

Most models, when uncertain, tend to "make up a plausible-sounding answer." Claude's training objective explicitly includes honesty — it's more inclined to say "I'm not sure" or "I might be wrong" than to confidently fabricate. In practice, this matters enormously: a model that acknowledges uncertainty is far more trustworthy than one that's always brimming with confidence.

Instruction Following

Claude has strong comprehension of complex, multi-step, multi-constraint instructions. For example: "Write a Python function that doesn't use third-party libraries, runs in O(n log n) time, includes type annotations, and handles edge cases" — Claude's compliance rate on these kinds of compound instructions is notably higher than most competitors.

Coding and Claude Code

Claude Code is one of the most popular AI coding tools among developers. It doesn't just "write code" — it understands the full project context, reads the design intent behind existing code, and maintains style consistency when making changes. This reflects Claude's systematic advantage in code comprehension and generation.

5.5 Anthropic's Impact on the LLM Ecosystem

Anthropic's influence extends well beyond the Claude product itself:

  • Raising industry-wide safety awareness: Anthropic's research (Constitutional AI, model interpretability) has pushed the entire industry to take AI safety more seriously. Alignment research at Google, Meta, and others has been influenced by Anthropic's work.

  • Pioneering interpretability research: Anthropic has invested heavily in understanding what's happening inside models. Their Mechanistic Interpretability research aims to open the neural network black box and understand what each neuron is doing — an extraordinarily difficult but critically important direction.

  • MCP protocol: Anthropic's Model Context Protocol aims to establish an open standard for any model to connect to external tools and data sources in a unified way. If MCP becomes an industry standard, it could dramatically lower the barrier to building AI applications — think of it as "USB for AI."

  • From conversation to action: Claude Code and the Agent SDK represent the shift from "chatbot" to "autonomous executor." This isn't just a change in product form — it's an expansion of AI's capability boundary.

In a sense, Anthropic's existence has added a "who's safer, who's more responsible" dimension to the LLM competition, beyond just "who's smarter." That's healthy for the industry as a whole.


6. Core Concepts Explained

This section breaks down the most important concepts in the LLM space. For each one: what it is, why it matters, and how it's used.

6.1 Pre-training → Fine-tuning → Alignment

LLM training has three stages:

Stage 1: Pre-training
  └─ Train on massive text data, learn "language" itself
  └─ Output: Base Model — can continue text, but can't converse

Stage 2: Instruction Fine-tuning (SFT)
  └─ Train on "instruction-response" pairs
  └─ Output: A model that understands instructions and can chat

Stage 3: Alignment
  └─ RLHF / RLAIF / Constitutional AI
  └─ Output: A safe, helpful, and honest model

The ChatGPT and Claude you use have gone through all three stages. A base model is like a scholar who's read every book but has no social skills; an aligned model is a knowledgeable and tactful assistant.

6.2 RLHF and Constitutional AI

RLHF (Reinforcement Learning from Human Feedback) was one of the key technologies behind ChatGPT's success:

  1. Human annotators rank multiple model responses (which is better)
  2. These rankings train a "reward model"
  3. Reinforcement learning optimizes the model toward "human preferences"

Constitutional AI is Anthropic's (Claude's company) approach: using a set of "constitutional principles" (e.g., "be honest," "don't be harmful") to guide model self-improvement, reducing dependence on human annotation.

6.3 Prompt Engineering

Prompt Engineering is the core skill for interacting with LLMs. Key techniques:

  • System Prompt: Set the model's role and behavioral rules
  • Few-shot: Provide examples so the model understands your desired format and style
  • Chain of Thought (CoT): Ask the model to "think step by step," significantly improving reasoning accuracy
  • ReAct: Combine Reasoning and Acting, enabling the model to use tools
# CoT Example
Q: A room has 3 windows, each window has 2 panes of glass. How many panes total?

Without CoT: 6 (direct answer, sometimes wrong)

With CoT:
- The room has 3 windows
- Each window has 2 panes
- 3 × 2 = 6 panes
Answer: 6

6.4 RAG: Retrieval-Augmented Generation

RAG solves two LLM pain points:

  1. Knowledge freshness: Model training data has a cutoff date; RAG can plug in the latest information
  2. Hallucination: Models may fabricate facts; RAG grounds answers in retrieved real documents
User question → Retrieve relevant docs → Feed docs + question to model → Generate evidence-based answer

RAG is the most common approach for enterprise LLM deployment because it doesn't require fine-tuning — just maintain a good knowledge base.

6.5 Function Calling & Tool Use

LLMs can only generate text natively, but through Function Calling, they can invoke external tools:

User: What's the weather in Beijing today?
Model: (decides to call weather API) → get_weather(city="Beijing")
System: (executes API, returns result) → Sunny, 25°C
Model: It's sunny in Beijing today, 25°C.

This transforms models from "can only talk" to "can take action" — a foundational capability for building AI applications.

6.6 Agents: From Conversation to Autonomous Execution

Agents are the advanced form of LLMs — no longer single Q&A, but autonomous planning and multi-step task execution:

User: Analyze this CSV file, find the top-selling product, and generate a chart

Agent execution:
1. Read the CSV file
2. Analyze data structure
3. Write analysis code
4. Execute code to generate chart
5. Summarize findings and present results

Claude Code is a textbook Agent — you give it a task, and it reads code, writes code, runs tests, and fixes errors on its own.

6.7 MCP: Model Context Protocol

MCP is an open protocol proposed by Anthropic to standardize how models connect to external tools and data sources. Think of it as "USB for AI":

  • Unified tool description format
  • Standardized invocation protocol
  • Any model can connect to any MCP server

Write one MCP server (say, connecting to a database), and every MCP-compatible model can use it.

6.8 Multimodal

Early LLMs only handled text. Today's mainstream models are going multimodal:

  • Text + Image: GPT-4o, Claude 4, Gemini (image understanding, image generation)
  • Text + Audio: GPT-4o native voice conversation
  • Text + Video: Gemini supports video understanding
  • Text + Code Execution: Claude Code, ChatGPT Code Interpreter

Multimodal isn't simply stitching multiple models together — it's unified understanding of different modalities within a single model.


7. Limitations and Challenges

LLMs are powerful but far from perfect. Understanding limitations helps you use them better.

7.1 Hallucination

Models will confidently fabricate nonexistent facts, fake citations, and made-up data. This isn't a bug — it's an inherent feature of generative models. They optimize for "looks plausible," not "factually correct."

Mitigation: RAG, asking models to cite sources, human review of critical information.

7.2 Context Window Limits

While context windows keep expanding (Claude is at 200K tokens), they're still finite. When processing very long documents, model attention to middle sections degrades (the "Lost in the Middle" problem).

7.3 Inference Cost

LLM inference requires substantial GPU resources. GPT-4 class models cost tens of dollars per million output tokens. This is why DeepSeek's cost-efficient approach has generated enormous interest.

7.4 Safety and Alignment

Ensuring models don't output harmful content and can't be maliciously exploited is an ongoing challenge. Alignment techniques (RLHF, Constitutional AI) are improving but far from solved.

7.5 Data Privacy

Using cloud-based LLMs means your data is sent to third-party servers. For sensitive data, consider local deployment (open-source models like LLaMA, Qwen) or APIs with data protection commitments.


8. A Systematic Learning Path

If you want to learn LLMs systematically, here's my recommended path:

Phase 1: Getting Started (1-2 weeks)

GoalResources
Understand Transformers3Blue1Brown's "But what is a GPT?" video series
Intuitive understanding of AttentionJay Alammar's "The Illustrated Transformer"
Make your first API callOpenAI / Anthropic official docs — build a simple chatbot
Learn Prompt techniquesAnthropic's Prompt Engineering guide

The goal here is building intuition. Don't get stuck on the math.

Phase 2: Intermediate (1-2 months)

GoalResources
Deep Prompt EngineeringLearn CoT, Few-shot, ReAct patterns
Practice RAGBuild a RAG app with LangChain / LlamaIndex
Try Fine-tuningFine-tune an open-source model with LoRA (e.g., LLaMA 3 8B)
Understand AgentsRead AutoGPT / CrewAI source code, understand the Agent loop

The focus here is hands-on work. Don't just watch tutorials — write code, run experiments.

Phase 3: Going Deep (Ongoing)

GoalResources
Read core papersAttention Is All You Need, GPT-3, InstructGPT, Constitutional AI
Train a small modelAndrej Karpathy's nanoGPT / minbpe
Agent developmentBuild real tools with Claude Code / OpenAI Agents SDK
Track the frontierFollow arXiv, AI researchers on Twitter/X

This phase has no endpoint. The LLM field has new developments every week. Staying curious and hands-on matters more than anything.


9. Conclusion

From rule-based systems in the 1950s to today's GPT-4 and Claude 4, NLP has traveled over 70 years. But the real paradigm shift happened after Transformer appeared in 2017 — in just a few years, LLMs went from academic experiments to tools that are changing how everyone works.

Understanding this history isn't for passing exams — it's for having a stable coordinate system in a rapidly changing field. When you know why Attention matters, what RLHF does, and what problem RAG solves, you can better judge which new technologies deserve attention and which are just noise.

The rest is up to you. Pick a direction, write your first line of code, run your first experiment. The world of LLMs is vast, but the entry point is right there in your terminal.


Further Reading