00Stage 0030 min

Environment Setup

Configure your GPU environment, install PyTorch and HuggingFace libraries, verify CUDA, and set up HuggingFace authentication.

CUDAPyTorchHuggingFaceGPU Setupbitsandbytes

Why Environment Setup Matters

The most common cause of failed LLM training runs is not the algorithm — it's the environment. Wrong CUDA version, missing libraries, or insufficient GPU memory can silently corrupt results or cause cryptic errors hours into a training run.

GPU Requirements by Task

TaskMinimum VRAMRecommended
Fine-tune 7B (QLoRA)12 GB24 GB
Fine-tune 13B (QLoRA)16 GB40 GB
Full fine-tune GPT-24 GB8 GB
Inference 7B (4-bit)6 GB8 GB

Colab GPU Tiers

Google Colab provides free access to NVIDIA T4 (15 GB VRAM) GPUs. Colab Pro adds A100 (40/80 GB) access. For this curriculum, a T4 is sufficient through Stage 4. Stage 5 kernels benefit from A100 for Flash Attention and TensorRT-LLM.

Essential Checks Before Every Session

import torch

# 1. Verify GPU
assert torch.cuda.is_available(), "No GPU — switch to GPU runtime"

# 2. Check VRAM
free, total = torch.cuda.mem_get_info()
print(f"Free: {free/1e9:.1f} GB / {total/1e9:.1f} GB")

# 3. Clear stale memory
torch.cuda.empty_cache()
import gc; gc.collect()

HuggingFace Authentication

Models like LLaMA-2, Mistral, and Gemma are "gated" — you must accept the license on HuggingFace Hub before downloading. Set your token as a Colab secret named HF_TOKEN to avoid exposing it in notebook cells.

Key Takeaways

  • 01Verify CUDA availability with torch.cuda.is_available() before any training
  • 02bitsandbytes and trl are NOT pre-installed on Colab — always pip install them
  • 03Use HuggingFace token for gated models like LLaMA-2 and Mistral
  • 04GPU memory monitoring prevents OOM crashes during training
  • 05Project directory structure prevents notebook clutter and model confusion

Core Concepts

CUDA & GPU Detection

CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform. PyTorch uses CUDA to offload tensor operations to the GPU, enabling training speeds 50-100x faster than CPU. Always check torch.cuda.is_available() and inspect compute capability — Flash Attention requires CC ≥ 8.0 (Ampere or newer).

import torch
print(f"CUDA: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

Key Libraries for LLM Training

transformers provides pre-trained models and training utilities. datasets enables fast, memory-mapped data loading. peft implements LoRA/QLoRA. bitsandbytes provides 4-bit/8-bit quantization. trl contains PPO and DPO trainers for alignment. accelerate handles multi-GPU and mixed precision transparently.

pip install transformers datasets peft bitsandbytes trl accelerate evaluate