GPT OSS Style Architectures

This document is my personal deep-dive into how modern open-source GPT-style language models are actually built. It’s not meant to be a full survey of the literature; instead, it captures the pieces that helped me build an intuition for these models: how text becomes vectors, how positional information is injected, how attention is optimized, and how FFN variants like MoE increase capacity without blowing up compute. 1. Tokenization & Embeddings The journey from human-readable text to a format the model understands begins here. ...

November 29, 2025 · 16 min · 3309 words

Attention Mechanisms in Transformers: MHA vs MQA vs GQA

This guide explores the core attention variants in modern transformers, focusing on the mechanisms themselves: Multi-Head Attention (MHA), Multi-Query Attention (MQA), and Grouped-Query Attention (GQA). We’ll understand why each exists and their fundamental architectural differences. Quick Overview Self-Attention: Each token looks at other tokens to build contextualized representations Multi-Head Attention (MHA): Multiple independent attention “heads” in parallel; each head has its own Q, K, V projections Multi-Query Attention (MQA): Share Key/Value projections across all query heads; reduces parameters significantly Grouped-Query Attention (GQA): Groups of query heads share K/V projections; balances expressiveness and efficiency 1. Self-Attention Foundations Core Intuition Self-attention is a content-based lookup over the sequence: ...

September 20, 2025 · 8 min · 1589 words