Stage 05

Custom Kernels & Production

Write CUDA/Triton kernels, master quantization, implement speculative decoding, and deploy with vLLM.

10notebooks
9hestimated
5060 min

CUDA Basics

Writing CUDA kernels in Python with torch.cuda. Understand warps, blocks, shared memory, and roofline analysis.

CUDAWarpsShared MemoryRoofline Model+1
5160 min

Triton Kernels

Write GPU kernels in pure Python with OpenAI Triton. Fuse operations for speed.

TritonKernel FusionGPU Programmingtorch.compile
5255 min

Custom Attention Kernel

Implement an optimized causal attention kernel with FlashAttention memory tiling pattern.

Attention KernelCausal MaskSDPAMemory Tiling
5340 min

Fused Operations

Fuse LayerNorm + Linear, GELU + Linear into single kernels. Eliminate memory round-trips.

Kernel FusionLayerNorm FusionMemory BandwidthOperator Fusion
5455 min

Quantization Methods

GPTQ, AWQ, and GGUF quantization compared. INT4/INT8 weight compression with quality analysis.

GPTQAWQGGUFINT4+2
5545 min

KV Cache Optimization

PagedAttention, MQA, GQA — strategies to reduce KV cache memory and increase throughput.

KV CachePagedAttentionMQAGQA+1
5650 min

Speculative Decoding

Use a small draft model to propose tokens, verify with the large model. 2-3x generation speedup.

Speculative DecodingDraft ModelToken VerificationThroughput
5755 min

vLLM Serving

Deploy LLMs in production with vLLM's continuous batching and PagedAttention.

vLLMContinuous BatchingPagedAttentionProduction Serving
5855 min

TensorRT-LLM

NVIDIA's TensorRT-LLM for maximum throughput on A100/H100 GPUs.

TensorRTTensorRT-LLMNVIDIAINT8 Inference+1
5945 min

Continuous Batching

Dynamic batch scheduling for production LLM serving — replace finished sequences without stopping.

Continuous BatchingDynamic SchedulingOrcaIteration-Level Batching
← Previous
Alignment & Specialized Techniques
Next →
LLM Inference Optimization