Stage 06

LLM Inference Optimization

Profile, optimize, and deploy LLM inference at scale — from KV cache to quantization to multi-GPU serving.

20notebooks
13hestimated
INF-0030 min

Inference Basics

Prefill vs decode phases, throughput vs latency, and the inference performance landscape.

PrefillDecodeThroughputLatency+1
INF-0135 min

Profiling Inference

Profile GPU utilization, memory bandwidth, and compute using PyTorch profiler.

ProfilingGPU UtilizationMemory Bandwidthtorch.profiler
INF-0235 min

Prefill vs Decode

Deep dive into the two-phase inference process and how to optimize each independently.

Prefill LatencyDecode ThroughputTTFTTBT
INF-0330 min

Baseline Benchmarking

Build a reproducible inference benchmark: tokens/sec, latency P50/P95/P99.

BenchmarkingTokens/secLatency PercentilesReproducibility
INF-1040 min

KV Cache Implementation

Implement manual KV cache from scratch. Understand why it gives 5-10x decode speedup.

KV CachePast Key ValuesCache ManagementMemory
INF-1135 min

Mixed Precision Inference

FP16/BF16 inference: 2x memory reduction, minimal quality loss.

FP16BF16Half PrecisionMemory Reduction
INF-1230 min

Static Batching

Batch multiple requests together for 2-8x throughput improvement.

Static BatchingThroughputPaddingBatch Size
INF-1335 min

torch.compile

Use torch.compile for 1.2-1.5x additional speedup through kernel fusion and graph optimization.

torch.compileDynamoInductorGraph Optimization
INF-1440 min

INT8 Quantization

Post-training INT8 quantization: 4x memory reduction vs FP32 with <1% quality loss.

INT8PTQDynamic QuantizationLLM.int8()
INF-2045 min

Flash Attention Explained

IO-aware FlashAttention: why it's fast, how it tiles computation, and when to use it.

FlashAttentionIO-AwareTilingHBM Bandwidth
INF-2140 min

MQA & GQA

Multi-Query Attention and Grouped-Query Attention: reduce KV cache size while maintaining quality.

MQAGQAKV HeadsLLaMA-2+1
INF-2345 min

Long Context Inference

RoPE scaling, YaRN, and sliding window attention for inference beyond training context.

RoPEYaRNSliding WindowLong Context+1
INF-2440 min

Memory Optimization

Compare FP32 vs FP16 vs BF16 vs INT8 vs INT4 memory and quality tradeoffs.

Memory OptimizationQuantizationDtypeModel Size
INF-3040 min

Continuous Batching

Orca-style iteration-level batching: insert and remove sequences without stopping the engine.

Continuous BatchingOrcaRequest SchedulingServing
INF-3145 min

PagedAttention

vLLM's PagedAttention: virtual memory paging for KV cache — near-zero waste.

PagedAttentionvLLMMemory PagesKV Blocks
INF-3335 min

Prefix Caching

Cache KV states for repeated system prompts. Eliminate redundant computation across requests.

Prefix CachingSystem PromptCache Hit RateReuse
INF-3450 min

Multi-GPU Serving

Tensor parallelism and pipeline parallelism for serving models larger than a single GPU.

Tensor ParallelismPipeline ParallelismMulti-GPUServing
INF-4150 min

GPTQ Quantization

GPTQ: layer-wise INT4 quantization with Hessian-based optimal rounding. <2% quality loss.

GPTQINT4HessianAutoGPTQ
INF-4250 min

AWQ Quantization

Activation-Aware Weight Quantization: scale weights by activation magnitude for better quality.

AWQActivation-AwareWeight ScalingAutoAWQ
INF-4345 min

GGUF & CPU Inference

GGUF format and llama.cpp for CPU inference. Run 7B models on a MacBook.

GGUFllama.cppCPU InferenceQ4_0+1
← Previous
Custom Kernels & Production
Next →
2024–2025 Techniques