Technical

Attention Mechanisms in Transformers: MHA vs MQA vs GQA

This guide explores the core attention variants in modern transformers, focusing on the mechanisms themselves: Multi-Head Attention (MHA), Multi-Query Attention (MQA), and Grouped-Query Attention (GQA). We’ll understand why each exists and their fundamental architectural differences. Quick Overview Self-Attention: Each token looks at other tokens to build contextualized representations Multi-Head Attention (MHA): Multiple independent attention “heads” in parallel; each head has its own Q, K, V projections Multi-Query Attention (MQA): Share Key/Value projections across all query heads; reduces parameters significantly Grouped-Query Attention (GQA): Groups of query heads share K/V projections; balances expressiveness and efficiency 1. Self-Attention Foundations Core Intuition Self-attention is a content-based lookup over the sequence: ...

Understanding Tokenization: From Text to Integers

Introduction Language models are mathematical functions; they operate on numbers, not raw text. Tokenization is the crucial first step in converting human-readable text into a sequence of integers (tokens) that a model can process. These tokens are then mapped to embedding vectors. 1. Naive Approaches and Their Flaws Word-Level Tokenization The most intuitive approach: split text by spaces and punctuation. Problems: Vocabulary Explosion: A language like English has hundreds of thousands of words. The model’s vocabulary would be enormous, making the final embedding and output layers computationally massive. Out-of-Vocabulary (OOV) Words: If the model encounters a word not seen during training (e.g., a new slang term, a typo, or a technical name), it has no token for it. It typically maps it to an <UNK> (unknown) token, losing all semantic meaning. Poor Generalization: The model treats eat, eating, and eaten as three completely separate, unrelated tokens. It fails to capture the shared root eat, making it harder to learn morphological relationships. Character-Level Tokenization The opposite extreme: split text into individual characters. ...

Reinforcement Learning Foundations: From MDPs to Deep Q-Learning

Introduction Reinforcement Learning (RL) has exploded in popularity — first in game-playing agents, and now in large language models via methods like RLHF (Reinforcement Learning from Human Feedback). These approaches don’t just help models learn context better; they also improve reasoning by teaching them to “think in steps.” My fascination with RL began when the GPT-3 paper was published and ChatGPT emerged as the so-called “tool of the decade.” I wanted to go beyond using these models — I wanted to understand how they work under the hood. That meant building RL concepts from the ground up: deriving equations, implementing toy solutions in environments like CartPole and FrozenLake, and seeing theory come alive in code. ...