π Fine-Tuning Llama 3.2 3B on Windows with QLoRA: Complete Guide
Iβll walk you through setting up your Windows machine with a 3060 GPU for fine-tuning Llama 3.2 3B using QLoRA in 8-bit quantization with your Alpaca format dataset.
π 1. Windows Setup with WSL
First, youβll need to set up Windows Subsystem for Linux (WSL) since it provides a more compatible environment for ML tools:
1. **Install WSL with Ubuntu**:
- Open PowerShell as administrator and run:
wsl --install -d Ubuntu
- Set up a username and password when prompted
2. **Configure GPU Access in WSL**:
-
Ensure you have the latest NVIDIA drivers for your 3060 GPU installed on Windows
-
In WSL, verify GPU access with:
nvidia-smi
- If the command doesnβt work, you may need to install the WSL2 GPU driver from NVIDIA: https://developer.nvidia.com/cuda/wsl
3. **Install Essential Development Tools in WSL**:
sudo apt-get update
sudo apt-get install -y build-essential python3-dev git wget
π 2. Python Environment Setup
1. **Install Miniconda in WSL**:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86\_64.sh
bash Miniconda3-latest-Linux-x86\_64.sh
-
Follow the prompts to complete installation
-
Restart your terminal or run
source ~/.bashrc
2. **Create and Activate Conda Environment**:
conda create -n llama\_ft python=3.10
conda activate llama\_ft
3. **Install PyTorch with CUDA Support**:
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
4. **Install Fine-Tuning Dependencies**:
pip install transformers datasets accelerate peft trl bitsandbytes xformers
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
π 3. Fine-Tuning Script with QLoRA 8-bit
Now Iβll provide a complete Python script for fine-tuning Llama 3.2 3B with QLoRA in 8-bit:
π 4. Running the Fine-Tuning
Now letβs create a script to test your fine-tuned model:
π 5. Step-by-Step Usage Instructions
1. **Prepare Your Alpaca Format Dataset**:
-
Ensure your dataset is in Alpaca format with βinstructionβ, βinputβ, and βoutputβ fields
-
Save it as a JSON file or upload it to Hugging Face datasets
2. **Start Fine-Tuning**:
Navigate to your project directory in WSL and run:
python fine\_tune\_llama32\_3b.py --dataset\_path /path/to/your/dataset.json --output\_dir ./finetuned\_model --epochs 3 --batch\_size 2 --gradient\_accumulation 4
Key parameters:
-
--dataset\_path
: Path to your Alpaca format dataset JSON file -
--batch\_size
: Start with 2 for a 3060 (12GB VRAM) and adjust based on CUDA out-of-memory errors -
--gradient\_accumulation
: Higher values allow for larger effective batch sizes with limited VRAM -
--max\_seq\_length
: Maximum sequence length (default 2048, reduce if needed)
3. **Test Your Fine-Tuned Model**:
After training completes, test the model with:
python test\_finetuned\_model.py --model\_path ./finetuned\_model/final --prompt "Your test prompt here"
π 6. Memory Optimization Tips for RTX 3060 (12GB VRAM)
1. **If You Encounter CUDA Out-of-Memory Errors**:
-
Reduce
max\_seq\_length
to 1024 or 512 -
Decrease batch size to 1
-
Increase gradient accumulation steps to 8
-
Use 8-bit quantization (already implemented)
2. **Monitoring GPU Usage**:
-
Run
nvidia-smi
in another terminal to monitor GPU memory usage -
Use
watch -n 1 nvidia-smi
for continuous monitoring
3. **Efficient Training**:
-
Start with a small subset of your dataset to test the setup
-
Use checkpointing to save progress regularly
π 7. Understanding the Process
-
**QLoRA** performs fine-tuning by adding small adapter modules to the frozen quantized model, dramatically reducing memory requirements while maintaining performance.
-
**8-bit Quantization** reduces the model precision from FP16/FP32 to INT8, cutting memory usage by up to 50%.
-
**Unsloth** optimizes the training process with memory-efficient operations specifically designed for Llama models.
-
**Format Adaptation**: The script automatically formats your Alpaca dataset for Llama 3.2βs chat template.
π 8. Additional Resources
-
You can modify the learning rate and other hyperparameters in the script based on your specific task
-
For very specialized tasks, consider increasing the LoRA rank (
r
parameter) to 32 or 64 -
The models will be automatically saved to your specified output directory after each epoch
With this setup, you should be able to effectively fine-tune Llama 3.2 3B on your Windows machine with an RTX 3060. The 8-bit quantization and QLoRA approach makes this possible even with the 12GB VRAM limitation.