🌌 Fine-Tuning Large Language Models with Axolotl on Windows using WSL2, Docker, and an Nvidia RTX 4060
This manual provides a comprehensive, step-by-step guide for fine-tuning Large Language Models (LLMs) using the axolotl.ai framework on a Windows 10/11 machine equipped with Docker Desktop and an Nvidia RTX 4060 GPU. It assumes the user has properly configured CUDA support via the Windows Subsystem for Linux (WSL2).
🌟 1. Prerequisites
Before embarking on the fine-tuning process, it is essential to ensure the system meets the necessary hardware and software requirements and to install prerequisite tools.
⚡ 1.1 Hardware Requirements
-
GPU: Nvidia RTX 4060 (Laptop or Desktop variant) with 8GB GDDR6 VRAM. This GPU supports CUDA and has sufficient compute capability (Ampere architecture) for modern LLM techniques like 4-bit quantization and bfloat16 computation.
-
RAM: Minimum 16GB system RAM recommended, 32GB or more is preferable, especially during model merging or handling large datasets.
-
Storage: Sufficient free disk space is required for:
- WSL2 distribution (~1GB base, plus environment dependencies).
- Docker Desktop and pulled/built Docker images (can be several GBs).
- Base LLM downloads (e.g., 7B models are ~14GB in FP16).
- Datasets (variable size).
- Fine-tuning outputs (checkpoints, adapters - can grow significantly depending on save frequency).
- Merged models and GGUF conversions. A fast SSD (NVMe recommended) is highly advised for performance.
- CPU: A modern multi-core CPU will aid in data preprocessing and general system responsiveness.
⚡ 1.2 Software Requirements
-
Operating System: Windows 10 (Version 21H2 or later) or Windows 11.
-
WSL2: Windows Subsystem for Linux version 2 is mandatory for GPU passthrough to Docker.
-
Nvidia Driver: Recent Nvidia Game Ready or Studio driver with CUDA support compatible with WSL2 and the intended Docker image’s CUDA version (e.g., CUDA 11.8 or 12.1).
-
Docker Desktop: Latest stable version configured to use the WSL2 backend.
-
Git: Required for cloning the Axolotl repository.
-
Conda/Miniconda (Optional but Recommended): Useful for managing Python environments, particularly for the llama.cpp GGUF conversion step outside the main Axolotl Docker container.
⚡ 1.3 RTX 4060 (8GB) Fine-tuning Capability
The 8GB VRAM of the RTX 4060 imposes significant constraints on LLM fine-tuning. Success depends heavily on employing memory optimization techniques.
⚡ Table 1: RTX 4060 (8GB) Fine-tuning Capability Estimate
Model Size (Parameters) | Feasible Technique(s) | Typical Max Sequence Length | Notes |
---|---|---|---|
~1B - 3B | FP16/BF16 LoRA, 8-bit LoRA (QLoRA optional) | 2048 - 4096+ | Relatively flexible; full fine-tuning might be possible for <1B. |
~7B | QLoRA (4-bit quantization) essential | 512 - 2048 | Requires careful tuning of batch size, gradient accum., seq. length. |
> 10B | Generally infeasible for fine-tuning on 8GB VRAM | N/A | Inference might be possible with heavy quantization (e.g., GGUF). |
⚡ Key Optimization Strategies for 8GB VRAM:
-
QLoRA: Quantizing the base model to 4-bit (load_in_4bit: true) drastically reduces memory footprint.
-
Parameter-Efficient Fine-Tuning (PEFT): Using LoRA or QLoRA means only training small adapter layers, not the entire model.
-
Gradient Checkpointing: Trades compute time for memory by recomputing activations during the backward pass instead of storing them.
-
Gradient Accumulation: Processes smaller batches sequentially and accumulates gradients before updating model weights, simulating a larger batch size without the corresponding memory cost.
-
Reduced Sequence Length: Shorter sequences (sequence_len) significantly decrease VRAM usage.
-
Small Micro Batch Size: Often necessary to set micro_batch_size: 1.
-
Memory-Efficient Optimizers: Using optimizers like adamw_bnb_8bit saves memory compared to standard AdamW. Achieving successful fine-tuning on this hardware requires understanding that these techniques are often mandatory, not optional, especially for models in the 7B parameter range.
⚡ 1.4 Necessary Installations
1. Git: Download and install Git for Windows from https://git-scm.com/download/win. Accept the default settings during installation. Verify installation by opening Command Prompt or PowerShell and typing git —version. 2. Conda/Miniconda (Optional):
-
Download Miniconda (a minimal installer for Conda) from https://docs.conda.io/en/latest/miniconda.html. Choose the Windows 64-bit installer.
-
Run the installer. It’s generally recommended to install for “Just Me” and not add Conda to the system PATH (allow the installer to initialize Conda in your chosen shell, e.g., PowerShell or Git Bash, if preferred).
-
This provides environment management capabilities useful for tasks outside the primary Docker workflow, like GGUF conversion.
🌟 2. Windows Environment Setup
Properly configuring Windows, WSL2, Nvidia drivers, and Docker Desktop is critical for enabling GPU acceleration within the Axolotl container.
⚡ 2.1 Install and Configure WSL2
1. Enable WSL Feature: Open PowerShell as Administrator and run:
PowerShell
dism.exe /online /enable-feature /featurename:Microsoft-Windows-Subsystem-Linux /all /norestart
dism.exe /online /enable-feature /featurename:VirtualMachinePlatform /all /norestart
2. Restart: Reboot your computer when prompted.
3. Set WSL2 as Default: Open PowerShell as Administrator and run:
PowerShell
wsl —set-default-version 2
4. Install a Linux Distribution: Open the Microsoft Store, search for a distribution (e.g., “Ubuntu 22.04 LTS”), and install it.
5. Launch and Initialize: Launch the installed distribution (e.g., Ubuntu) from the Start Menu. It will perform a one-time initialization and prompt you to create a username and password for the Linux environment. Remember these credentials.
6. Update Distribution: Once inside the WSL terminal, update the package lists and upgrade packages:
Bash
sudo apt update && sudo apt upgrade -y
7. Verify WSL Version: In PowerShell, run wsl -l -v. Ensure your distribution is listed with VERSION set to 2. If not, run wsl —set-version
⚡ 2.2 Install Nvidia Drivers for CUDA on WSL2
Crucially, the Nvidia driver installed on the Windows host provides CUDA support within WSL2.
1. Download Driver: Go to the Nvidia Driver Downloads page (https://www.nvidia.com/Download/index.aspx). Select your GPU (GeForce RTX 40 Series -> RTX 4060), OS (Windows 10/11), and choose either the latest Game Ready Driver (GRD) or Studio Driver (SD). Studio drivers are often preferred for stability in compute workloads, but Game Ready drivers typically work fine. Ensure the driver version supports the CUDA toolkit version used by the Axolotl Docker image (check Axolotl docs/Dockerfile). 2. Clean Installation Recommended: Download the driver installer. During installation, select “Custom (Advanced)” installation options and check the box for “Perform a clean installation.” This removes previous driver profiles and can prevent conflicts. 3. Install: Complete the installation process and restart your computer if prompted. 4. Verify Driver in Windows: Right-click the desktop -> Nvidia Control Panel -> System Information (bottom left). Check the driver version and CUDA version listed. 5. Verify Driver in WSL2: Open your WSL2 terminal (e.g., Ubuntu) and run: Bash nvidia-smi This command should execute successfully and display your GPU details (RTX 4060), driver version, and CUDA version. If this command fails, revisit the driver installation steps or troubleshoot WSL2 setup. Common issues include needing a system reboot after driver installation or ensuring the WSL kernel is up-to-date (wsl —update in PowerShell).
⚡ 2.3 Install and Configure Docker Desktop for Windows
Docker Desktop acts as the bridge, allowing containers running within the WSL2 environment to access the host’s GPU.
1. Download Docker Desktop: Get the installer from https://www.docker.com/products/docker-desktop/. 2. Install Docker Desktop: Run the installer. Ensure the option “Use WSL 2 instead of Hyper-V” (or similar wording) is checked during installation or initial setup. This is the default and required setting. 3. Configure WSL2 Backend: After installation, Docker Desktop should start automatically. Go to Settings (gear icon):
-
General: Ensure “Use the WSL 2 based engine” is checked.
-
Resources > WSL Integration: Ensure “Enable integration with my default WSL distro” is checked. Also, explicitly enable integration for the specific Linux distribution you installed (e.g., Ubuntu-22.04) by toggling its switch to ‘on’. Apply & Restart Docker Desktop if changes were made. 4. Allocate Resources (Optional but Recommended): Under Resources > Advanced, consider adjusting the allocated CPUs, Memory, and Swap space for Docker if needed, though default settings are often sufficient for single-GPU fine-tuning. Ensure enough memory is allocated (e.g., 8GB+).
⚡ 2.4 Verify GPU Access within WSL2/Docker Environment
Confirm that Docker containers running via the WSL2 backend can correctly detect and utilize the Nvidia GPU.
1. Open WSL2 Terminal: Launch your Linux distribution (e.g., Ubuntu). 2. Run CUDA Test Container: Execute the following Docker command: Bash docker run —rm —gpus all nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi
-
Replace nvidia/cuda:12.1.1-base-ubuntu22.04 with a CUDA image version compatible with your installed driver if necessary.
-
—gpus all tells Docker to provide GPU access to the container.
-
—rm automatically removes the container when it exits. 3. Check Output: This command should successfully pull the CUDA image (if not present locally) and then execute nvidia-smi inside the container. The output should mirror the nvidia-smi output seen directly in WSL2, showing your RTX 4060. If this command fails with errors related to GPU devices or drivers, revisit the Nvidia driver installation (Section 2.2) and Docker Desktop WSL2 integration settings (Section 2.3). Ensure Docker Desktop is running. Successful execution of this command confirms that the entire stack (Windows Driver -> WSL2 -> Docker Desktop -> Container) is correctly configured for GPU-accelerated workloads.
🌟 3. Axolotl Setup
With the environment prepared, the next step is to obtain the Axolotl code and set up its Docker environment.
⚡ 3.1 Clone the Axolotl GitHub Repository
1. Navigate to Project Directory: Open your WSL2 terminal. Navigate to a suitable location where you want to store your LLM fine-tuning projects (e.g., cd ~ or mkdir ~/projects && cd ~/projects). 2. Clone Repository: Use Git to clone the official Axolotl repository: Bash git clone https://github.com/OpenAccess-AI-Collective/axolotl.git 3. Navigate into Directory: Change into the cloned directory: Bash cd axolotl
⚡ 3.2 Set up the Docker Environment for Axolotl
Axolotl provides Dockerfiles to create containerized environments with all necessary dependencies. You can either pull a pre-built image or build one locally.
- Option 1: Pulling the Pre-built Image (Recommended)
- This is generally faster and less prone to local build issues. Axolotl maintainers typically provide pre-built images on Docker Hub.
- Command: Bash docker pull winglian/axolotl:main-latest
🌌 Or specify a version tag if needed, e.g., winglian/axolotl:0.4.0-cuda11.8-py3.10
- > ⚠️ Note: Check the Axolotl documentation or GitHub repository for the currently recommended image tag. main-latest often points to the latest development build using a recent CUDA version (e.g., 12.1). Ensure this CUDA version is compatible with your installed Nvidia driver (Section 2.2). Using a specific version tag corresponding to your setup might be more reliable.
- Advantage: Quicker setup, avoids potential local environment conflicts during build.
- Option 2: Building the Image Locally
- Necessary if you need specific customizations within the Docker environment or if a suitable pre-built image isn’t available. This requires more time, disk space, and a stable internet connection.
- Command (run from within the cloned axolotl directory): Bash docker build -t axolotl-custom. -f docker/Dockerfile.main
🌌 Or use a different Dockerfile if specified (e.g., Dockerfile.cuda118)
🌌 Use ‘-t axolotl-custom’ to name your local image distinctly
- Considerations: The build process can take a significant amount of time (30 minutes or more). Ensure Docker Desktop has sufficient resources allocated (RAM, CPU). Build failures can occur due to transient network issues or changes in upstream dependencies.
- Verification: After pulling or building, run docker images. You should see the winglian/axolotl image (if pulled) or your custom-named image (e.g., axolotl-custom) listed.
⚡ 3.3 Recommended Directory Structure
Organizing your configuration files, datasets, and model outputs is crucial for managing fine-tuning experiments effectively. A structured approach prevents confusion and aids reproducibility. Consider the following structure, created alongside (or outside) the cloned axolotl directory:
/home/your_wsl_user/llm_finetuning/ # Main directory in WSL
├── axolotl/ # The cloned Axolotl repository
├── project_gemma_instruct/ │ ├── config_gemma_qlora.yml # Axolotl YAML config for this project
│ ├── data/ # Directory for training/validation data
│ │ └── instructions_dataset.jsonl │ └── output/ # Directory for Axolotl outputs
│ └── # Checkpoints, adapters, logs will appear here
├── project_phi3_chat/ │ ├── config_phi3_mini_qlora.yml │ ├── data/ │ │ └── chat_conversations.json │ └── output/ └── base_models/ # (Optional) Central location for downloaded base models
└── meta-llama/Llama-2-7b-hf/ # Example downloaded model
⚡ Explanation:
-
Keep the axolotl repository clone separate from your specific project files.
-
Create a dedicated directory for each fine-tuning experiment or project (e.g., project_gemma_instruct).
-
Inside each project directory:
- Store the specific Axolotl YAML configuration file(s) used for that project. Naming them descriptively (e.g., including model and method) is helpful.
- Create a data subdirectory to hold the training and validation datasets.
- Create an output subdirectory. This is where Axolotl will save checkpoints, adapter weights, logs, and potentially the final merged model, as specified by the output_dir parameter in the YAML configuration.
- Optionally, maintain a central base_models directory outside individual projects if you frequently reuse base models. This avoids redundant downloads but requires careful path management when mounting volumes into the Docker container and referencing paths in the YAML. This organization isolates the components for each fine-tuning task, making it easier to manage configurations, track datasets, locate results, and rerun experiments.
🌟 4. Axolotl Configuration (YAML Explained)
Axolotl uses YAML (.yml) files to define every aspect of the fine-tuning process. These files are human-readable and specify the base model, dataset(s), training parameters, quantization methods, PEFT strategies, and more. The configuration file is the primary way to control the fine-tuning run.
⚡ 4.1 Overview
The YAML configuration file should be placed within your project directory (as per the recommended structure in Section 3.3). When launching the fine-tuning process via Docker, this file will be mounted into the container and passed to the Axolotl training script.
⚡ 4.2 Sample YAML Structure for RTX 4060 (QLoRA focus)
The following is a heavily commented sample YAML configuration, specifically tailored for fine-tuning a ~7B parameter model on an RTX 4060 (8GB VRAM) using QLoRA. Parameters critical for memory management are highlighted. YAML
🌌 Sample Axolotl Config for RTX 4060 (8GB VRAM) - QLoRA Example for a ~7B Model
🌌 Base model identifier from Hugging Face Hub or local path
base_model: meta-llama/Llama-2-7b-hf
🌌 Type of model architecture (refer to model’s config.json on Hugging Face)
model_type: LlamaForCausalLM
🌌 Type of tokenizer (usually AutoTokenizer works)
tokenizer_type: AutoTokenizer
🌌 === Quantization (Crucial for 8GB VRAM) ===
🌌 Enable 4-bit quantization for the base model (QLoRA)
load_in_4bit: true
🌌 Quantization type (nf4 recommended for NormalFloat4)
bnb_4bit_quant_type: nf4
🌌 Compute dtype during 4-bit ops (bfloat16 recommended for Ampere+)
bnb_4bit_compute_dtype: bfloat16
🌌 Use nested quantization for minor memory savings
bnb_4bit_use_double_quant: true
🌌 Allows loading models with slight architecture mismatches (use with caution)
strict: false
🌌 === Dataset Configuration ===
datasets:
- path: /workspace/data/your_dataset.jsonl # Path *inside* the Docker container
type: alpaca # Specify dataset format (e.g., alpaca, sharegpt, jsonl)
🌌 Optional: Define roles for chat formats if needed
🌌 Optional: Define prompt templates if using custom formats
🌌 Optional: Path to save/load preprocessed data (speeds up subsequent runs)
dataset_prepared_path: last_run_prepared
🌌 Percentage of dataset to use for validation (e.g., 0.05 = 5%)
val_set_size: 0.05
🌌 === PEFT Configuration (QLoRA) ===
🌌 Use QLoRA adapter type
adapter: qlora
🌌 LoRA rank (dimension of adapter matrices). Lower = less memory/params.
lora_r: 16
🌌 LoRA alpha (scaling factor, often 2*r).
lora_alpha: 32
🌌 Dropout probability for LoRA layers (regularization).
lora_dropout: 0.05
🌌 Target modules for LoRA injection (VERY model-specific!).
🌌 Find these in model documentation or Axolotl examples for the specific model.
lora_target_modules:
-
q_proj
-
k_proj
-
v_proj
-
o_proj
🌌 Add other linear layers often targeted in Llama-style models if needed:
🌌 - gate_proj
🌌 - up_proj
🌌 - down_proj
🌌 Sometimes needed depending on library versions and model implementation
🌌 lora_fan_in_fan_out: false
🌌 === Training Hyperparameters ===
🌌 Maximum sequence length (tokens). Major VRAM factor! Adjust carefully.
sequence_len: 1024
🌌 Pack multiple short sequences into one sample (improves throughput)
sample_packing: true
🌌 Pad sequences to sequence_len (needed for sample_packing)
pad_to_sequence_len: true
🌌 Effective Batch Size = micro_batch_size * gradient_accumulation_steps * num_gpus
🌌 Batch size per GPU. Keep low (often 1) for 8GB VRAM.
micro_batch_size: 1
🌌 Accumulate gradients over N steps to simulate larger batch size.
gradient_accumulation_steps: 8 # Effective batch size = 1 * 8 * 1 = 8
🌌 Number of training epochs (passes over the dataset).
num_epochs: 1
🌌 Alternatively, specify max_steps instead of num_epochs
🌌 max_steps: -1
🌌 Optimizer choice. adamw_bnb_8bit is memory-efficient.
optimizer: adamw_bnb_8bit
🌌 Learning rate. Often lower for fine-tuning (e.g., 1e-5 to 5e-5).
learning_rate: 2e-5
🌌 Learning rate scheduler type.
lr_scheduler: cosine
🌌 Number of warmup steps for the scheduler.
warmup_steps: 10
🌌 Whether to train on the input/instruction parts of prompts (usually false for instruction tuning)
train_on_inputs: false
🌌 Group sequences by length for potentially faster training (can interact with sample_packing)
group_by_length: false
🌌 === Precision and Memory Optimization ===
🌌 Use BrainFloat16 mixed precision (requires Ampere+ GPU like RTX 4060).
🌌 > ⚠️ Note: If load_in_4bit=true, bnb_4bit_compute_dtype handles compute precision.
🌌 bf16/fp16 here primarily affect non-quantized weights (e.g., adapter weights).
bf16: true # Preferred over fp16 on Ampere+ if supported
🌌 Use Float16 mixed precision (wider compatibility).
fp16: false # Set only one of bf16/fp16 to true, or neither for FP32 (uses most memory)
🌌 Enable gradient checkpointing (Essential memory saver!).
gradient_checkpointing: true
🌌 Recommended setting for newer PyTorch/Transformers versions
gradient_checkpointing_kwargs: { use_reentrant: false }
🌌 === Logging and Saving ===
🌌 Log metrics every N steps.
logging_steps: 10
🌌 Save model checkpoint (adapter weights) every N steps.
save_steps: 100
🌌 Evaluate on validation set every N steps (requires val_set_size > 0).
eval_steps: 100
🌌 Output directory *inside* the Docker container.
output_dir: /workspace/output
🌌 === Advanced Distributed Training (Usually not needed for single RTX 4060) ===
🌌 DeepSpeed configuration path (optional, adds complexity).
🌌 deepspeed: deepspeed_configs/zero0.json
🌌 Fully Sharded Data Parallel configuration (optional, adds complexity).
🌌 fsdp:…
🌌 fsdp_config:…
🌌 === Monitoring Integration (Optional) ===
🌌 Weights & Biases configuration
🌌 wandb_project: axolotl-llama2-7b-qlora
🌌 wandb_entity: your_wandb_username # 🌌 Your WandB account username
🌌 wandb_watch: gradients # 🌌 options: false, gradients, parameters, all
🌌 wandb_log_model: false # 🌌 Set ‘true’ or ‘checkpoint’ to upload adapters to WandB (consumes storage)
🌌 === Other Options ===
🌌 Use Flash Attention 2 implementation if available/supported (can save memory/speed up)
🌌 flash_attention: true # 🌌 Requires compatible model, libraries, and hardware
⚡ 4.3 Key Parameter Breakdown (RTX 4060 Focus)
Understanding these parameters and their impact on VRAM is key to successful tuning on the RTX 4060:
-
base_model, model_type, tokenizer_type: Specify the starting LLM. Find identifiers on Hugging Face Hub (e.g., meta-llama/Llama-2-7b-hf). The model_type should match the architecture class in the model’s config.json. AutoTokenizer usually works.
-
load_in_4bit / load_in_8bit: Essential for 8GB VRAM. load_in_4bit: true enables QLoRA, drastically reducing the base model’s memory footprint. load_in_8bit: true uses more memory but might be viable for smaller models (< 7B). The associated bnb_4bit_* parameters fine-tune the 4-bit loading: nf4 is standard, bfloat16 compute dtype is good for the RTX 4060, and use_double_quant offers minor extra savings.
-
strict: false: Can sometimes help load models with custom code or minor configuration differences, but use cautiously.
-
datasets: Defines the training data. Requires path (the location inside the Docker container, e.g., /workspace/data/my_data.jsonl) and type (e.g., alpaca, sharegpt, jsonl, completion). The path mapping happens during the docker run command (Section 6.1).
-
sequence_len: Major VRAM determinant. Memory usage scales quadratically with sequence length. Start conservatively (e.g., 512 or 1024 for 7B models) and increase only if VRAM allows. sample_packing: true is highly recommended as it fills sequences efficiently, reducing wasted computation and memory on padding, especially with variable-length data.
-
adapter: Specifies the PEFT method. qlora is the standard choice for 4-bit base models on memory-constrained hardware. lora is used if load_in_4bit and load_in_8bit are false (requires more VRAM).
-
lora_r, lora_alpha, lora_dropout, lora_target_modules: Parameters controlling the LoRA adapters.
- lora_r: Rank of the adapter matrices. Higher values mean more trainable parameters, potentially better adaptation, but increased VRAM usage. Common values are 8, 16, 32, 64. Start low (e.g., 16) for 8GB VRAM.
- lora_alpha: Scaling factor for LoRA, often set to 2 * lora_r.
- lora_dropout: Dropout rate for regularization within LoRA layers.
- lora_target_modules: Critically important and model-specific. This list tells Axolotl which linear layers (typically within the attention mechanism) to inject the LoRA adapters into. Incorrect or incomplete lists will lead to poor fine-tuning results or errors. Consult Axolotl examples for your specific base model, or inspect the model’s architecture (print(model) in Python) to identify potential target layers (e.g., q_proj, v_proj, query_key_value, fc1, fc2).
-
gradient_accumulation_steps: Simulates a larger batch size by accumulating gradients over multiple smaller forward/backward passes before performing an optimizer step. If micro_batch_size must be 1 due to VRAM limits, increase gradient_accumulation_steps (e.g., to 4, 8, 16) to achieve a reasonable effective batch size (e.g., 8 or 16). This increases training time per epoch but significantly reduces peak VRAM usage.
-
micro_batch_size: The actual number of samples processed by the GPU in a single forward/backward pass. For 7B models with QLoRA on 8GB VRAM, this often must be set to 1.
-
num_epochs / max_steps: Control how long the training runs. Use one or the other. num_epochs depends on dataset size, while max_steps is an absolute number of training steps.
-
optimizer: Selects the optimization algorithm. adamw_bnb_8bit uses 8-bit optimizer states, saving considerable VRAM compared to the standard 32-bit adamw_torch. paged_adamw_8bit might offer further savings by paging optimizer states between GPU and CPU RAM.
-
learning_rate, lr_scheduler, warmup_steps: Standard training controls. Fine-tuning often benefits from smaller learning rates (e.g., 1e-5 to 5e-5) compared to pre-training. A cosine scheduler with a small number of warmup_steps (e.g., 10-100) is common.
-
fp16 / bf16: Enable mixed-precision training. For the RTX 4060 (Ampere architecture), bf16: true is generally preferred for numerical stability if load_in_4bit is false. If load_in_4bit: true, the compute precision is primarily determined by bnb_4bit_compute_dtype. Enabling bf16 or fp16 mainly affects the precision of non-quantized weights (like the LoRA adapters themselves) and gradient computations, reducing memory compared to full FP32 precision.
-
gradient_checkpointing: true: Essential memory-saving technique. It avoids storing intermediate activations for the entire model during the forward pass, instead recomputing them during the backward pass. This significantly reduces VRAM usage at the cost of increased computation time (typically ~20-30% slower training). The gradient_checkpointing_kwargs: { use_reentrant: false } setting is often recommended with recent library versions for compatibility.
-
output_dir: Specifies the directory inside the container where Axolotl saves checkpoints, adapters, and logs. This path must correspond to a mounted volume (Section 6.1) for outputs to persist on the host machine.
-
logging_steps, save_steps, eval_steps: Control the frequency (in training steps) of logging metrics, saving checkpoints, and performing evaluation on the validation set. More frequent saving allows resuming closer to a point of failure but consumes more disk space.
-
deepspeed / fsdp: Configuration for distributed training frameworks. While powerful for multi-GPU setups, they typically add unnecessary complexity for single-GPU fine-tuning on an RTX 4060. The combination of QLoRA, gradient checkpointing, and gradient accumulation is usually sufficient and simpler to manage. DeepSpeed ZeRO Stage 0 might offer minor benefits or overhead but is generally not required. The interplay between these parameters defines a trade-off space. For instance, increasing sequence_len consumes more VRAM, potentially requiring a decrease in lora_r or an increase in gradient_accumulation_steps (which slows down training) to compensate. Finding the optimal configuration for the RTX 4060 involves balancing adaptation quality (influenced by lora_r, training data, duration), training speed, and the hard 8GB VRAM limit.
⚡ 4.4 Specific Model Examples/Recommendations (RTX 4060)
Adapting the base configuration for specific popular models:
- Gemma (e.g., google/gemma-7b-it or google/gemma-2b-it)
- model_type: GemmaForCausalLM
- 7B Model: load_in_4bit: true is almost certainly required. Use bnb_4bit_compute_dtype: bfloat16 as Gemma was trained with BFloat16. Start with sequence_len: 1024 or 2048 and adjust based on memory.
- 2B Model: Might fit with load_in_8bit: true or even FP16/BF16 LoRA (load_in_4bit: false, load_in_8bit: false, adapter: lora, bf16: true). Allows for potentially longer sequence_len.
- lora_target_modules: Typically includes attention (q_proj, k_proj, v_proj, o_proj) and potentially feed-forward layers (gate_proj, up_proj, down_proj). Verify against Gemma examples in the Axolotl repository or documentation.
- Phi-3 (e.g., microsoft/phi-3-mini-4k-instruct - 3.8B)
- model_type: Phi3ForCausalLM (Verify from model’s config.json)
- load_in_4bit: true is recommended for the 3.8B model on 8GB VRAM, although load_in_8bit: true might be feasible with careful tuning.
- lora_target_modules: Phi-3 uses different layer naming conventions. Check Axolotl examples or Phi-3 documentation. Common targets might be [“qkv_proj”, “o_proj”, “gate_up_proj”, “down_proj”] or similar variations like fc1, fc2 in older Phi models. Accuracy here is critical.
- sequence_len: Phi-3 Mini supports up to 4k or even longer contexts in some variants. However, on 8GB VRAM, start much lower (e.g., 2048) and increase cautiously only if memory permits.
- Consider adding flash_attention: true to the YAML if using a compatible version of transformers and Axolotl, as Phi-3 is optimized for Flash Attention 2, which can save memory and improve speed.
- Granite Code (e.g., ibm-granite/granite-3b-code-instruct - 3B)
- model_type: Check the config.json. Often based on GPTNeoXForCausalLM or similar architectures.
- load_in_4bit: true is recommended for safety on 8GB, though 8-bit or BF16 LoRA might work.
- lora_target_modules: For GPT-NeoX based models, targets are often [“query_key_value”, “dense”, “dense_h_to_4h”, “dense_4h_to_h”]. Verify for the specific Granite model.
- sequence_len: Start conservatively, e.g., 1024 or 2048.
General RTX 4060 Strategy: Begin experiments with QLoRA (load_in_4bit: true), micro_batch_size: 1, gradient_checkpointing: true, adamw_bnb_8bit optimizer, a moderate gradient_accumulation_steps (e.g., 8), a conservative sequence_len (e.g., 1024), and lora_r: 16. Monitor VRAM usage closely during the initial training steps (using nvidia-smi or WandB).
-
If encountering OOM errors: Reduce sequence_len first, then consider increasing gradient_accumulation_steps further, or decreasing lora_r.
-
If significant VRAM headroom exists: Cautiously increase sequence_len, decrease gradient_accumulation_steps, or increase lora_r or micro_batch_size (if > 1 is feasible) to potentially improve training speed or adaptation quality.
🌟 5. Dataset Preparation
The quality and format of your training data are paramount for successful LLM fine-tuning. Axolotl expects data to be structured in specific ways.
⚡ 5.1 Overview
LLMs learn by example. Fine-tuning adapts a pre-trained model to better perform specific tasks (like following instructions, adopting a persona, or understanding a specific domain) based on the examples provided in the dataset. Axolotl supports several common data formats, simplifying the process, but requires the data to adhere strictly to the chosen format’s structure.
⚡ 5.2 Common Axolotl Dataset Formats
Axolotl can handle various dataset structures, specified using the type parameter within the datasets list in the YAML configuration. Some common formats include:
- JSON Lines (JSONL): Each line in the file is a separate, valid JSON object. This is a flexible base format.
- Example for completion type (where the model learns to complete the text): JSON {“text”: “Instruction: Translate to French.\nInput: Hello world\nOutput: Bonjour le monde”} {“text”: “Instruction: Summarize the following text.\nInput: [Long text here]\nOutput:”}
- YAML: type: completion (or potentially text_completion depending on Axolotl version/config). The text field is concatenated and used for training.
- Alpaca Format: Designed for instruction-following tasks. Typically a JSON list of objects, or a JSONL file where each line is such an object. Axolotl uses the fields to construct a prompt.
- Example (JSON list): JSON
- YAML: type: alpaca. Axolotl automatically formats this into a prompt like: “Below is an instruction… \n### Instruction:\n{instruction}\n### Input:\n{input}\n### Response:\n{output}”. Variations like alpaca_chat, alpaca_simple, or custom prompt templates might also be available.
- ShareGPT Format: Suitable for multi-turn conversation data. Usually a JSON list where each object contains a list of turns.
- Example (JSON list): JSON }, { “conversations”: } //… more conversation objects ]
- YAML: type: sharegpt. Axolotl processes the conversations list, often applying specific chat templates based on roles (human, gpt, system).
-
Conversational Format: A generic term for chat data. Axolotl might support specific structures with fields like role and content, or system, prompt, completion. Consult the Axolotl documentation for currently supported generic conversational types and their expected JSON structures.
-
Axolotl inst_tune.html Format: This refers to a specific internal format Axolotl might use or have used previously. If required, its exact structure (likely HTML tags or a JSON representation derived from them) should be detailed in the official Axolotl documentation. > ⚠️ Note: Standard formats like Alpaca, ShareGPT, and JSONL are more commonly used and recommended unless specific documentation points to this format.
-
Hugging Face Hub Datasets: Axolotl can often load datasets directly from the Hugging Face Hub. Instead of a local path, you provide the Hub dataset identifier (e.g., databricks/databricks-dolly-15k) and potentially specify splits (train, test), column remappings, etc., within the datasets YAML section. Check Axolotl documentation for examples.
⚡ 5.3 Formatting Custom Datasets
If your data is not already in one of these formats, you need to convert it.
1. Choose Target Format: Select the Axolotl format that best matches your fine-tuning goal (e.g., Alpaca for instruction tasks, ShareGPT for chatbots, JSONL completion for general text modeling). 2. Structure Data: Ensure each data record precisely matches the chosen format’s structure (correct JSON keys, nesting, data types). Consistency is crucial. 3. Encoding: Save text files using UTF-8 encoding. 4. Conversion Script (Example): A simple Python script can help convert data (e.g., from CSV or Python dictionaries) into the required JSONL format. Python import json import csv # Or pandas for more complex data loading
🌌 Assume input data is in a list of dictionaries
🌌 (e.g., loaded from a CSV or database)
custom_data = [
{“user_query”: “Explain black holes.”, “model_answer”: “A black hole is a region of spacetime…”},
{“user_query”: “Write Python code to sort a list.”, “model_answer”: “python\nmy\_list.sort()\n
”},
]
output_jsonl_file = “formatted_dataset.jsonl”
🌌 Example: Convert to Alpaca-style JSONL for instruction tuning
try: with open(output_jsonl_file, “w”, encoding=“utf-8”) as outfile: for entry in custom_data:
🌌 Adapt the keys based on your source data and target format
alpaca_record = { “instruction”: entry[“user_query”], “input”: "", # Add context here if applicable
“output”: entry[“model_answer”] }
🌌 Write each record as a JSON object on a new line
outfile.write(json.dumps(alpaca_record) + “\n”) print(f”Successfully converted data to {output_jsonl_file}”)
except Exception as e: print(f”Error during conversion: {e}”)
⚡ 5.4 Tips for Data Cleaning and Validation
High-quality data is essential for effective fine-tuning. Low-quality data leads to poorly performing models.
-
Remove Duplicates: Identify and remove identical or near-identical examples.
-
Handle Malformed Entries: Check for and fix or remove entries with missing required fields (e.g., a missing output in Alpaca format) or invalid JSON syntax. Use a JSON validator or linter.
-
Normalize Text: Ensure consistent whitespace usage, handle or remove problematic special characters, and potentially normalize case if appropriate for the task.
-
Validate Structure: Write a small script or use tools to verify that every entry in your dataset conforms to the expected structure (e.g., all dictionaries in Alpaca JSON have instruction, input, output keys).
-
Content Quality: Review the data for relevance, accuracy, and alignment with the desired model behavior. Remove examples that are nonsensical, harmful, or contradictory to the fine-tuning objective.
-
Use Libraries: The Hugging Face datasets library can be helpful for loading various formats, performing basic cleaning operations, and validating data. Dataset preparation can be time-consuming but is a high-leverage activity. Errors in format often lead to cryptic loading failures within Axolotl, while errors in quality lead to suboptimal model performance. Investing time here prevents significant headaches later.
🌟 6. Running the Fine-tune
Once the environment is set up, the Axolotl configuration YAML is prepared, and the dataset is formatted correctly, you can launch the fine-tuning process using Docker.
⚡ 6.1 Launching the Docker Container and Training
The core idea is to run the Axolotl Docker image as a container, providing it access to the GPU and mounting local directories (containing the config file, data, and output location) into the container’s filesystem.
⚡ Key Concept: Volume Mounts (-v flag)
The -v host_path:container_path flag in the docker run command maps a directory from your host system (specifically, from the WSL2 filesystem accessible at paths like /mnt/c/Users/… or within your WSL user’s home directory ~) to a specific path inside the running container. This mechanism is essential:
1. It allows the Axolotl process inside the container to read your config.yml file. 2. It allows Axolotl to read your dataset files from the specified data directory. 3. It allows Axolotl to write outputs (checkpoints, logs, adapters) to the specified output directory, making them persistent on your host machine even after the container exits.
⚡ Command Structure:
Execute the following command from your WSL2 terminal, ensuring you are inside the specific project directory (e.g., ~/llm_finetuning/project_gemma_instruct) that contains your config.yml, data/, and output/ subdirectories:
Bash
🌌 Ensure you are in your project directory containing config.yml, data/, output/
🌌 Example: cd ~/llm_finetuning/project_gemma_instruct
docker run
—gpus all
—rm
-it
-v $(pwd)/config_gemma_qlora.yml:/workspace/config.yml
-v $(pwd)/data:/workspace/data
-v $(pwd)/output:/workspace/output \
🌌 Optional: Mount a central base models directory if used
🌌 -v /mnt/c/Users/YourUser/Documents/llm_base_models:/base_models \
🌌 Specify the Axolotl image (use the tag you pulled or built)
winglian/axolotl:main-latest \
🌌 Command to execute inside the container
accelerate launch -m axolotl.cli.train /workspace/config.yml
⚡ Explanation of Flags and Arguments:
-
docker run: Command to create and start a new container.
-
—gpus all: Grants the container access to all available Nvidia GPUs. Essential for hardware acceleration.
-
—rm: Automatically removes the container’s filesystem when the container exits. Since outputs are saved via volume mounts, this keeps the system clean. Omit this if you need to inspect the container’s filesystem after it stops. Use -d instead of -it to run in detached (background) mode.
-
-it: Allocates a pseudo-TTY and keeps stdin open, allowing interactive access to the container’s terminal output.
-
-v $(pwd)/config_gemma_qlora.yml:/workspace/config.yml: Mounts the specific YAML file from your current host directory ($(pwd) resolves to the current path) to /workspace/config.yml inside the container. Adjust the host path (config_gemma_qlora.yml) to match your actual config file name.
-
-v $(pwd)/data:/workspace/data: Mounts the data subdirectory from your host project folder to /workspace/data inside the container. Crucially, the path specified under datasets in your YAML file must refer to this container path (e.g., /workspace/data/instructions_dataset.jsonl).
-
-v $(pwd)/output:/workspace/output: Mounts the output subdirectory from your host project folder to /workspace/output inside the container. The output_dir parameter in your YAML file must be set to this container path (e.g., /workspace/output).
-
winglian/axolotl:main-latest: Specifies the Docker image to use. Replace main-latest with the specific tag you pulled (Section 3.2) or the name you gave your locally built image (e.g., axolotl-custom).
-
accelerate launch -m axolotl.cli.train /workspace/config.yml: This is the command executed inside the container once it starts. accelerate launch is part of the Hugging Face Accelerate library, used by Axolotl to handle device placement and distributed training setups (even for a single GPU). axolotl.cli.train is the main training script, and it takes the path to the configuration file inside the container (/workspace/config.yml) as its argument. Getting the volume mount paths (-v flags) and the corresponding paths inside the YAML configuration (datasets.path, output_dir) exactly right is critical. Mismatches are a common source of “file not found” errors during dataset loading or when Axolotl tries to save checkpoints.
⚡ 6.2 Passing the Configuration File
As highlighted above, the final argument to the command inside the container (accelerate launch…) must be the path within the container where the YAML file has been mounted. In the example command, the host file $(pwd)/config_gemma_qlora.yml is mounted to /workspace/config.yml inside the container, and /workspace/config.yml is then passed to the training script.
🌟 7. Monitoring the Training
Observing the training process is essential for understanding progress, diagnosing issues, and deciding when to stop. Axolotl provides feedback via terminal output and integrates with monitoring tools like TensorBoard and Weights & Biases (WandB).
⚡ 7.1 Terminal Output
The most direct way to monitor is by watching the terminal output from the docker run command. Key information includes:
-
Initialization: Logs related to loading the base model, tokenizer, processing the dataset, and setting up training. Pay attention to any warnings or errors during this phase.
-
Progress: Hugging Face transformers typically displays progress bars showing the current training step, epoch, and estimated time remaining.
-
Loss Metrics: Regularly printed lines showing the training loss (e.g., loss: 1.2345). This value should generally decrease as training progresses, indicating the model is learning from the data. If an evaluation dataset is configured (val_set_size > 0 and eval_steps set), you will also see evaluation loss (eval_loss).
-
Learning Rate: The current learning rate used by the optimizer might be logged, showing the effect of the learning rate scheduler.
-
Throughput: Steps per second or samples per second might be displayed, giving an idea of training speed.
-
VRAM Usage: While not always explicitly logged by Axolotl itself, underlying libraries might occasionally print GPU memory usage summaries. It’s often useful to monitor VRAM independently (see below).
-
Saving Checkpoints: Notifications when checkpoints (adapter weights if using PEFT) are saved to the specified output_dir (e.g., Saving checkpoint to /workspace/output/checkpoint-100).
⚡ 7.2 TensorBoard Integration
TensorBoard provides a web-based interface for visualizing metrics logged during training, such as loss curves over time.
-
Configuration: Axolotl typically logs TensorBoard data automatically if the necessary libraries are installed in the Docker image (which official images usually have). The logs are saved within a subdirectory (often named runs or similar) inside the output_dir specified in your YAML.
-
Launching TensorBoard: 1. While the docker run command is executing the training, open a new WSL2 terminal window. 2. Navigate to the host’s output directory for your project: cd ~/llm_finetuning/project_gemma_instruct/output 3. Launch TensorBoard, pointing it to the directory containing the event files: tensorboard —logdir. (or tensorboard —logdir./runs if logs are in a runs subdirectory). 4. TensorBoard will output a URL, usually http://localhost:6006. Open this URL in your web browser on Windows to access the dashboard and view the training/evaluation loss curves and other logged metrics.
⚡ 7.3 Weights & Biases (WandB) Integration
Weights & Biases is a popular cloud-based platform for experiment tracking, offering more extensive logging and visualization features than TensorBoard, including system resource monitoring (GPU VRAM, temperature, utilization). It requires a free account (https://wandb.ai/).
- Configuration: 1. Enable WandB logging by setting parameters in your Axolotl YAML file: YAML wandb_project: your_project_name # e.g., axolotl-gemma-finetune
wandb_entity: your_wandb_username # Your WandB account username (optional, uses default)
wandb_run_id: gemma-qlora-run1 # Optional: specific ID for this run
wandb_watch: gradients # Optional: Log gradients/parameters (‘gradients’, ‘parameters’, ‘all’, ‘false’)
wandb_log_model: false # Set ‘true’ or ‘checkpoint’ to upload model artifacts (consumes WandB storage)
2. Authentication: The container needs your WandB API key.
- Method 1 (Login inside container): Before the main docker run command, you might run an interactive shell in the image (docker run -it —rm winglian/axolotl:main-latest bash), run wandb login, paste your API key (from your WandB settings page), exit, and then run the training command. The login might persist for subsequent runs depending on Docker’s volume/layer caching.
- Method 2 (Environment Variable): Pass your API key as an environment variable in the docker run command (more secure and reproducible):
Bash
docker run
—gpus all —rm -it
-v… # Your volume mounts
-e WANDB_API_KEY=“YOUR_ACTUAL_API_KEY_HERE”
winglian/axolotl:main-latest
accelerate launch -m axolotl.cli.train /workspace/config.yml
- Monitoring: Once training starts, Axolotl will automatically log metrics, the configuration YAML, and potentially system stats to your specified WandB project. You can access the live dashboard by logging into the WandB website. Monitoring VRAM usage is particularly important on the RTX 4060. WandB provides real-time graphs of GPU memory usage, which can help anticipate OOM errors. Alternatively, you can manually check VRAM usage by opening another WSL2 terminal and running docker exec <container_id_or_name> nvidia-smi, where <container_id_or_name> is the ID or name of your running Axolotl container (find it using docker ps).
🌟 8. Evaluating and Using the Model
After the fine-tuning process completes (or is stopped), the next steps involve evaluating the resulting model, potentially merging adapter weights, and converting it for use in other applications like Ollama.
⚡ 8.1 Merging LoRA Adapters (if applicable)
If you used a PEFT method like LoRA or QLoRA (where adapter: was set in the YAML), the output_dir will contain the trained adapter weights, not a complete fine-tuned model. For standalone inference or conversion to formats like GGUF, these adapters usually need to be merged back into the original base model’s weights.
- Axolotl Merge Script: Axolotl often includes a utility script for this purpose. Check the documentation or use python -m axolotl.cli.merge_lora —help inside the container environment to find the exact command and options. A typical command might look like this (run inside the Axolotl Docker container or a compatible Python environment with axolotl, torch, transformers, peft installed): Bash
🌌 Example command - verify options with Axolotl documentation
python -m axolotl.cli.merge_lora
—base_model_name_or_path meta-llama/Llama-2-7b-hf
—lora_model_name_or_path /workspace/output
—output_dir /workspace/merged_model
—load_in_8bit=false
—load_in_4bit=false # Merge into full precision (FP16/BF16) usually
🌌 May need —lora_model_name_or_path to point to specific checkpoint-XXX dir if multiple exist
- —base_model_name_or_path: Identifier or path to the original base model used for fine-tuning.
- —lora_model_name_or_path: Path to the Axolotl output directory containing the adapter_model.bin and adapter_config.json files (often the main output_dir or a specific checkpoint-XXX subdirectory).
- —output_dir: Path where the merged model (in Hugging Face format) will be saved. Ensure this path is mapped via a volume mount if running in Docker.
- Quantization flags (—load_in_4bit, —load_in_8bit) during merging are typically set to false to produce a merged model in standard precision (FP16 or BF16), which is usually required for subsequent GGUF conversion. Merging directly into a quantized model might be possible but less common and may have specific library requirements.
- Manual Merging with transformers and peft: Alternatively, you can merge using a Python script: Python import torch from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel import os
🌌 --- Configuration ---
base_model_id = “meta-llama/Llama-2-7b-hf” # HF Hub ID or path to downloaded base model
🌌 Path to the directory containing adapter_model.bin, adapter_config.json etc.
🌌 This should be the path on your HOST system after training completes.
adapter_host_path = “/home/your_wsl_user/llm_finetuning/project_gemma_instruct/output” # ADJUST THIS PATH
🌌 Path on your HOST system where the merged model will be saved
merged_model_host_path = “/home/your_wsl_user/llm_finetuning/project_gemma_instruct/merged_model” # ADJUST THIS PATH
🌌 --- End Configuration ---
print(f”Loading base model: {base_model_id}”)
🌌 Load base model in a higher precision for merging (e.g., float16)
🌌 Use device_map=‘auto’ if you have enough VRAM/RAM, otherwise ‘cpu’
base_model = AutoModelForCausalLM.from_pretrained( base_model_id, torch_dtype=torch.float16, # Or torch.bfloat16 if preferred/compatible
device_map=“auto” # Use ‘cpu’ if merge fails due to OOM
) tokenizer = AutoTokenizer.from_pretrained(base_model_id)
print(f”Loading adapter: {adapter_host_path}”)
🌌 Load the LoRA adapter onto the base model
model_to_merge = PeftModel.from_pretrained(base_model, adapter_host_path)
print(“Merging adapter…”)
🌌 Merge the adapter weights into the base model
merged_model = model_to_merge.merge_and_unload() print(“Merge complete.”)
print(f”Saving merged model to: {merged_model_host_path}”) os.makedirs(merged_model_host_path, exist_ok=True) merged_model.save_pretrained(merged_model_host_path) tokenizer.save_pretrained(merged_model_host_path) print(“Merged model and tokenizer saved.”)
- > ⚠️ Note: This script requires significant RAM or VRAM, especially for larger models. Running device_map=“cpu” might be necessary if GPU OOM occurs during merging. Ensure the necessary libraries (transformers, peft, torch, accelerate) are installed in the Python environment where you run this script (e.g., your Conda environment).
⚡ 8.2 Basic Inference/Testing
Perform a quick test to see if the fine-tuned model generates coherent outputs related to the fine-tuning task.
- Using Hugging Face pipeline: A simple way to test inference (run this in an environment with necessary libraries installed, potentially the Axolotl container or your Conda env): Python import torch from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer from peft import PeftModel # Only needed if loading adapter separately
🌌 --- Configuration ---
🌌 Option 1: Path to the MERGED model directory (on host or accessible path)
model_load_path = “/home/your_wsl_user/llm_finetuning/project_gemma_instruct/merged_model” # ADJUST THIS PATH
load_merged = True
🌌 Option 2: Paths for loading base + adapter separately (if not merged)
🌌 base_model_id = “meta-llama/Llama-2-7b-hf”
🌌 adapter_host_path = “/home/your_wsl_user/llm_finetuning/project_gemma_instruct/output” # 🌌 ADJUST THIS PATH
🌌 model_load_path = base_model_id
🌌 load_merged = False
🌌 --- End Configuration ---
print(f”Loading tokenizer from: {model_load_path if load_merged else base_model_id}”) tokenizer = AutoTokenizer.from_pretrained(model_load_path if load_merged else base_model_id)
print(f”Loading model from: {model_load_path}”)
🌌 Load model - adjust dtype and device_map for inference needs/hardware
model = AutoModelForCausalLM.from_pretrained( model_load_path, torch_dtype=torch.float16, # Or bfloat16, or auto
device_map=“auto” # Use GPU for faster inference if possible
)
🌌 If loading adapter separately (Option 2)
if not load_merged: print(f”Loading adapter from: {adapter_host_path}”) model = PeftModel.from_pretrained(model, adapter_host_path)
🌌 Set model to evaluation mode (disables dropout etc.)
model.eval() print(“Model loaded successfully.”)
🌌 Create text generation pipeline
pipe = pipeline(“text-generation”, model=model, tokenizer=tokenizer)
🌌 --- Test Prompt ---
🌌 Adapt this prompt based on how your model was fine-tuned
🌌 Example for an Alpaca-style instruction model:
prompt = “Instruction: Explain the concept of quantum entanglement in simple terms.\nInput:\nOutput:”
🌌 --- End Test Prompt ---
print(f”\nGenerating text for prompt: ‘{prompt}’”)
🌌 Adjust generation parameters as needed
results = pipe(prompt, max_new_tokens=150, num_return_sequences=1, do_sample=True, temperature=0.7, top_p=0.9) print(“\nGenerated Text:”) print(results[‘generated_text’])
- Axolotl Inference Server: Check the Axolotl documentation – it might offer built-in scripts or commands to run a simple Gradio or FastAPI-based inference server for testing.
⚡ 8.3 GGUF Conversion for Ollama
To use your fine-tuned model with local inference tools like Ollama or llama.cpp, you need to convert the merged model into the GGUF format. This format uses various quantization techniques optimized for efficient CPU and GPU inference.
-
Tool: The primary tool for conversion is the llama.cpp library.
-
Prerequisites: 1. Merged Model: You need the fine-tuned model with adapters merged, saved in the standard Hugging Face format (containing pytorch_model.bin* files, config.json, tokenizer.model or tokenizer.json, etc.). Ensure this merged model exists on your host system (e.g., at the merged_model_host_path from Section 8.1). 2. llama.cpp Repository: Clone the repository if you haven’t already. Open a WSL2 terminal: Bash git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp 3. Python Environment: Create and activate a dedicated Python environment (using Conda as recommended in Section 1.4, or venv) and install the required dependencies for conversion scripts: Bash
🌌 Using Conda
conda create -n llama_cpp python=3.10 -y conda activate llama_cpp pip install -r requirements.txt
🌌 OR using venv
🌌 python -m venv.venv
🌌 source.venv/bin/activate
🌌 pip install -r requirements.txt
Ensure pip install -r requirements.txt completes successfully. You might need build tools like build-essential or cmake installed in your WSL distribution (sudo apt install build-essential cmake).
- Conversion Steps:
1. Navigate and Activate: Ensure you are in the llama.cpp directory in your WSL2 terminal, with the correct Python environment activated (conda activate llama_cpp or source.venv/bin/activate).
2. Run convert.py: Execute the main conversion script, pointing it to your merged model directory and specifying an output file path and quantization type. Bash
python convert.py /home/your_wsl_user/llm_finetuning/project_gemma_instruct/merged_model
—outfile /home/your_wsl_user/llm_finetuning/project_gemma_instruct/gguf_exports/GemmaInstruct-Finetuned-Q5_K_M.gguf
—outtype q5_k_m 3. Explanation of Arguments:
- First argument: Path to the directory containing the merged Hugging Face model files (e.g., /home/your_wsl_user/…/merged_model).
- —outfile: Full path where the resulting GGUF file will be saved. Create the output directory (gguf_exports in the example) beforehand (mkdir gguf_exports). It’s good practice to include the model name and quantization type in the filename.
- —outtype: Crucial parameter specifying the quantization method applied during GGUF conversion. This determines the final file size, inference speed, and potential quality loss. Common options include:
-
f32: 32-bit float (largest, highest precision, rarely used).
-
f16: 16-bit float (good quality, large size, best for GPU inference).
-
q8_0: 8-bit quantization (good balance of size, speed, quality).
-
q5_k_m: 5-bit K-quants mix (popular balance, good quality, smaller size).
-
q4_k_m: 4-bit K-quants mix (very popular, small size, fast CPU inference, good quality).
-
q4_0: Older 4-bit quantization.
-
q3_k_m, q2_k: Smaller sizes, more quality loss.
-
(Check python convert.py —help for the full list of supported outtype values).
- Recommendation: Generate a few different quantizations (e.g., f16, q5_k_m, q4_k_m) to test which provides the best trade-off for your specific needs and inference hardware when using Ollama.
- Using the GGUF with Ollama: 1. Create Ollama Modelfile: Create a text file named Modelfile (no extension) in a convenient location. This file tells Ollama how to use your GGUF model. Code snippet
🌌 Modelfile for the fine-tuned Gemma Instruct model (Q5_K_M quant)
🌌 Specify the path to your generated GGUF file
🌌 Use the path accessible by the Ollama service (usually host path)
FROM /home/your_wsl_user/llm_finetuning/project_gemma_instruct/gguf_exports/GemmaInstruct-Finetuned-Q5_K_M.gguf
🌌 Optional: Set default inference parameters
PARAMETER temperature 0.7 PARAMETER num_ctx 4096 # Set context window size (ensure model supports it)
🌌 Optional: Define a system prompt
🌌 SYSTEM """You are a helpful assistant fine-tuned on custom data."""
🌌 Optional: Define the prompt template if needed.
🌌 This should match the template used during fine-tuning or the base model’s default.
🌌 Example for Alpaca-like models:
TEMPLATE """{{ if. System }}<|im_start|>system {{.System }}<|im_end|> {{ end }}{{ if. Prompt }}<|im_start|>user {{.Prompt }}<|im_end|> {{ end }}<|im_start|>assistant {{.Response }}<|im_end|>"""
🌌 Example for Gemma instruct models (check model card for exact format):
🌌 TEMPLATE """<start_of_turn>user
🌌 {{.Prompt }}<end_of_turn>
🌌 <start_of_turn>model
🌌 {{.Response }}<end_of_turn>"""
- Adjust the FROM path to point to your actual GGUF file.
- Modify PARAMETER and TEMPLATE sections as needed for your model and desired behavior. Getting the TEMPLATE right is important for models fine-tuned with specific chat/instruction structures. 2. Create Ollama Model: Open a terminal where the ollama command is available (usually PowerShell or Command Prompt on Windows, or your WSL terminal if Ollama is installed there) and run: Bash ollama create my-custom-gemma -f /path/to/your/Modelfile
- Replace my-custom-gemma with the name you want to give your model in Ollama.
- Replace /path/to/your/Modelfile with the actual path to the Modelfile you created. 3. Run the Model: You can now interact with your fine-tuned model using Ollama: Bash ollama run my-custom-gemma “Your prompt here…”
The path from a completed Axolotl fine-tune (especially with PEFT) to a running Ollama model involves several distinct steps: merging adapters, converting the merged model using llama.cpp, choosing appropriate GGUF quantization, and configuring Ollama via a Modelfile. Each stage uses different tools and requires careful attention to paths and configurations.
🌟 9. Troubleshooting Common Issues
Fine-tuning LLMs in a multi-layered environment like Windows/WSL2/Docker can present unique challenges. This section addresses common problems and potential solutions.
⚡ Table 2: Common Troubleshooting Scenarios
Symptom / Error Message | Likely Cause(s) | Solution(s) | Relevant Sections |
---|---|---|---|
nvidia-smi fails in WSL2. docker run —gpus all fails (“could not select device”, “driver mismatch”). | Incorrect/corrupt Nvidia driver; WSL2 not installed/running; Docker not using WSL2 backend; WSL integration disabled for distro. | 1. Verify WSL2: wsl -l -v (ensure v2). |
Restart: wsl —shutdown. 2. Clean reinstall latest Nvidia driver (Game Ready/Studio). Reboot. 3. Verify nvidia-smi in WSL2. 4. Check Docker Desktop Settings (WSL2 engine, WSL Integration enabled for distro). Restart Docker. 5. Test with docker run —rm —gpus all nvidia/cuda:XX. X… nvidia-smi. | CUDA out of memory (OOM) during training. | micro_batch_size too high; sequence_len too long; gradient_accumulation_steps too low; lora_r too high; Memory optimizations (load_in_4bit, gradient_checkpointing) disabled/misconfigured. | 1. Set micro_batch_size: 1 in YAML. 2. Increase gradient_accumulation_steps (e.g., 4 -> 8 -> 16). 3. Reduce sequence_len (e.g., 2048 -> 1024 -> 512). 4. Ensure load_in_4bit: true (for QLoRA) and gradient_checkpointing: true. 5. Lower lora_r (e.g., 16 -> 8). 6. Try flash_attention: true if applicable. 7. Use adamw_bnb_8bit optimizer. 8. Consider a smaller base model. | Axolotl fails immediately: YAML parse error, “unrecognized parameter”, “expected bool but got string”. | Typo in YAML parameter name; Incorrect indentation (YAML sensitive); Incorrect value type (e.g., True instead of true); Missing required parameter; Path errors (config, model). | 1. Validate YAML syntax (online validator, IDE). Check indentation (usually 2 spaces). 2. Use lowercase true/false. 3. Double-check parameter names against Axolotl docs/examples. 4. Verify all paths (base_model, datasets.path, output_dir) are correct within the container context. 5. Start from a known-good example YAML and modify incrementally. | Error during data loading/preprocessing: “FileNotFoundError”, “JSONDecodeError”, Hugging Face datasets library error. | datasets.path in YAML doesn’t match container path from -v mount; Dataset file missing on host; Incorrect datasets.type specified; Malformed data file (invalid JSON/structure).
| 1. Verify -v $(pwd)/data:/workspace/data mount in docker run. 2. Ensure datasets.path in YAML is /workspace/data/your_file.jsonl (or similar container path). 3. Check host path $(pwd)/data contains the file. 4. Verify datasets.type matches actual file format. 5. Validate data file syntax (JSON lint) and structure against format spec. Test with a small subset of data. | Errors mentioning transformers, torch, bitsandbytes, peft, accelerate version conflicts, missing attributes/functions. | Outdated/incompatible Axolotl Docker image; Manual package installs conflicting; Base model requires different library versions; CUDA version mismatch (driver vs image). | 1. Pull latest recommended Axolotl image (docker pull winglian/axolotl:tag). 2. Avoid pip install inside running container; rebuild image if customization needed. 3. Check Axolotl GitHub issues for known compatibility problems. 4. Ensure host Nvidia driver CUDA version is compatible with image’s CUDA toolkit version. Debugging issues in this stack often requires systematically checking each layer:
1. Host/WSL/Driver: Does nvidia-smi work correctly in the WSL2 terminal? 2. Docker Integration: Does docker run —gpus all… nvidia-smi work? Are Docker settings correct? 3. Volume Mounts & Paths: Are the -v mounts in docker run correct? Do the paths inside the YAML (datasets.path, output_dir) match the container-side paths from the mounts? 4. YAML Configuration: Is the YAML syntax valid? Are parameters correct and appropriate for the hardware/model (especially memory-related ones)? 5. Dataset: Is the dataset file present and correctly formatted according to the specified type? 6. Axolotl/Libraries: Are you using a compatible/recent Axolotl image? Are there known issues with the specific model or libraries being used?
Carefully reading the error messages and considering where in this stack the error might originate is key to efficient troubleshooting.
🌟 10. Conclusion
This manual has provided a detailed walkthrough for setting up a Windows 10/11 environment with WSL2, Docker Desktop, and Nvidia drivers to fine-tune Large Language Models using Axolotl on an Nvidia RTX 4060 GPU. We covered the prerequisites, environment configuration, Axolotl setup, YAML configuration specifics (emphasizing QLoRA and memory optimization for the 8GB VRAM constraint), dataset preparation, launching and monitoring training runs, and post-training steps including adapter merging and GGUF conversion for use with Ollama. Fine-tuning models up to the ~7B parameter class is demonstrably feasible on an RTX 4060 by leveraging techniques like 4-bit quantization (QLoRA), gradient checkpointing, gradient accumulation, and careful management of sequence lengths and batch sizes. The Axolotl framework provides a powerful and flexible tool for orchestrating this process. Success requires meticulous attention to detail, particularly in environment setup, YAML configuration, and data formatting. The troubleshooting guide provides starting points for addressing the inevitable issues that arise in such a complex software stack. Users are encouraged to experiment with different hyperparameters, explore various base models suitable for the 8GB VRAM budget, and curate high-quality datasets tailored to their specific goals. The field of LLM fine-tuning is rapidly evolving, and continuous learning is essential. For further information and community support, consult the following resources:
-
Official Axolotl Documentation: https://docs.axolotl.ai/
-
Axolotl GitHub Repository (Code, Issues, Discussions): https://github.com/OpenAccess-AI-Collective/axolotl
-
Hugging Face Hub (Models, Datasets, Documentation): https://huggingface.co/
-
llama.cpp GitHub Repository: https://github.com/ggerganov/llama.cpp
-
Ollama Website: https://ollama.com/
-
Local LLM Communities (e.g., r/LocalLLaMA on Reddit): Valuable source for practical tips, troubleshooting, and community knowledge sharing. With careful configuration and systematic experimentation, the setup described herein empowers users to effectively fine-tune powerful language models locally on consumer-grade hardware.