Environment Setup — Foundations & Environment | LLM Dojo

Why Environment Setup Matters

The most common cause of failed LLM training runs is not the algorithm — it's the environment. Wrong CUDA version, missing libraries, or insufficient GPU memory can silently corrupt results or cause cryptic errors hours into a training run.

GPU Requirements by Task

Task	Minimum VRAM	Recommended
Fine-tune 7B (QLoRA)	12 GB	24 GB
Fine-tune 13B (QLoRA)	16 GB	40 GB
Full fine-tune GPT-2	4 GB	8 GB
Inference 7B (4-bit)	6 GB	8 GB

Colab GPU Tiers

Google Colab provides free access to NVIDIA T4 (15 GB VRAM) GPUs. Colab Pro adds A100 (40/80 GB) access. For this curriculum, a T4 is sufficient through Stage 4. Stage 5 kernels benefit from A100 for Flash Attention and TensorRT-LLM.

Essential Checks Before Every Session

import torch

# 1. Verify GPU
assert torch.cuda.is_available(), "No GPU — switch to GPU runtime"

# 2. Check VRAM
free, total = torch.cuda.mem_get_info()
print(f"Free: {free/1e9:.1f} GB / {total/1e9:.1f} GB")

# 3. Clear stale memory
torch.cuda.empty_cache()
import gc; gc.collect()

HuggingFace Authentication

Models like LLaMA-2, Mistral, and Gemma are "gated" — you must accept the license on HuggingFace Hub before downloading. Set your token as a Colab secret named HF_TOKEN to avoid exposing it in notebook cells.

Key Takeaways

01Verify CUDA availability with torch.cuda.is_available() before any training

02bitsandbytes and trl are NOT pre-installed on Colab — always pip install them

03Use HuggingFace token for gated models like LLaMA-2 and Mistral

04GPU memory monitoring prevents OOM crashes during training

05Project directory structure prevents notebook clutter and model confusion

Core Concepts

CUDA & GPU Detection

CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform. PyTorch uses CUDA to offload tensor operations to the GPU, enabling training speeds 50-100x faster than CPU. Always check torch.cuda.is_available() and inspect compute capability — Flash Attention requires CC ≥ 8.0 (Ampere or newer).

import torch
print(f"CUDA: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

Key Libraries for LLM Training

transformers provides pre-trained models and training utilities. datasets enables fast, memory-mapped data loading. peft implements LoRA/QLoRA. bitsandbytes provides 4-bit/8-bit quantization. trl contains PPO and DPO trainers for alignment. accelerate handles multi-GPU and mixed precision transparently.

pip install transformers datasets peft bitsandbytes trl accelerate evaluate