Stage 06

LLM Inference Optimization

Profile, optimize, and deploy LLM inference at scale — from KV cache to quantization to multi-GPU serving.

20notebooks

13hestimated

INF-0030 min

Inference Basics

Prefill vs decode phases, throughput vs latency, and the inference performance landscape.

PrefillDecodeThroughputLatency+1

Study Colab

INF-0135 min

Profiling Inference

Profile GPU utilization, memory bandwidth, and compute using PyTorch profiler.

ProfilingGPU UtilizationMemory Bandwidthtorch.profiler

Study Colab

INF-0235 min

Prefill vs Decode

Deep dive into the two-phase inference process and how to optimize each independently.

Prefill LatencyDecode ThroughputTTFTTBT

Study Colab

INF-0330 min

Baseline Benchmarking

Build a reproducible inference benchmark: tokens/sec, latency P50/P95/P99.

BenchmarkingTokens/secLatency PercentilesReproducibility

Study Colab

INF-1040 min

KV Cache Implementation

Implement manual KV cache from scratch. Understand why it gives 5-10x decode speedup.

KV CachePast Key ValuesCache ManagementMemory

Study Colab

INF-1135 min

Mixed Precision Inference

FP16/BF16 inference: 2x memory reduction, minimal quality loss.

FP16BF16Half PrecisionMemory Reduction

Study Colab

INF-1230 min

Static Batching

Batch multiple requests together for 2-8x throughput improvement.

Static BatchingThroughputPaddingBatch Size

Study Colab

INF-1335 min

torch.compile

Use torch.compile for 1.2-1.5x additional speedup through kernel fusion and graph optimization.

torch.compileDynamoInductorGraph Optimization

Study Colab

INF-1440 min

INT8 Quantization

Post-training INT8 quantization: 4x memory reduction vs FP32 with <1% quality loss.

INT8PTQDynamic QuantizationLLM.int8()

Study Colab

INF-2045 min

Flash Attention Explained

IO-aware FlashAttention: why it's fast, how it tiles computation, and when to use it.

FlashAttentionIO-AwareTilingHBM Bandwidth

Study Colab

INF-2140 min

MQA & GQA

Multi-Query Attention and Grouped-Query Attention: reduce KV cache size while maintaining quality.

MQAGQAKV HeadsLLaMA-2+1

Study Colab

INF-2345 min

Long Context Inference

RoPE scaling, YaRN, and sliding window attention for inference beyond training context.

RoPEYaRNSliding WindowLong Context+1

Study Colab

INF-2440 min

Memory Optimization

Compare FP32 vs FP16 vs BF16 vs INT8 vs INT4 memory and quality tradeoffs.

Memory OptimizationQuantizationDtypeModel Size

Study Colab

INF-3040 min

Continuous Batching

Orca-style iteration-level batching: insert and remove sequences without stopping the engine.

Continuous BatchingOrcaRequest SchedulingServing

Study Colab

INF-3145 min

PagedAttention

vLLM's PagedAttention: virtual memory paging for KV cache — near-zero waste.

PagedAttentionvLLMMemory PagesKV Blocks

Study Colab

INF-3335 min

Prefix Caching

Cache KV states for repeated system prompts. Eliminate redundant computation across requests.

Prefix CachingSystem PromptCache Hit RateReuse

Study Colab

INF-3450 min

Multi-GPU Serving

Tensor parallelism and pipeline parallelism for serving models larger than a single GPU.

Tensor ParallelismPipeline ParallelismMulti-GPUServing

Study Colab

INF-4150 min

GPTQ Quantization

GPTQ: layer-wise INT4 quantization with Hessian-based optimal rounding. <2% quality loss.

GPTQINT4HessianAutoGPTQ

Study Colab

INF-4250 min

AWQ Quantization

Activation-Aware Weight Quantization: scale weights by activation magnitude for better quality.

AWQActivation-AwareWeight ScalingAutoAWQ

Study Colab

INF-4345 min

GGUF & CPU Inference

GGUF format and llama.cpp for CPU inference. Run 7B models on a MacBook.

GGUFllama.cppCPU InferenceQ4_0+1

Study Colab

← Previous

Custom Kernels & Production

2024–2025 Techniques