🌌 Comprehensive Guide to Local LLM Fine-Tuning using Pre-built Docker Images
🌟 1. Introduction
🚀 Welcome to this comprehensive guide! This section will give you the foundational knowledge you need.
⚡ 1.1. Purpose and Scope
The proliferation of Large Language Models (LLMs) has opened new frontiers in artificial intelligence applications. However, general-purpose models often require adaptation to perform optimally on specialized tasks or within specific domains. Fine-tuning allows tailoring these powerful models, but the process involves managing complex software environments, including specific versions of machine learning libraries, CUDA toolkits, and system dependencies. This guide addresses these challenges by focusing on the use of pre-built Docker images for local LLM fine-tuning. Docker provides a robust solution for encapsulating application dependencies, creating isolated, portable, and reproducible environments. By leveraging Docker, developers and researchers can significantly simplify the setup process, ensuring that the fine-tuning environment is consistent and reliable. The scope of this report centers on fine-tuning open-source LLMs, with a particular focus on models such as Phi4, Gemma3, and Granite3.2, alongside other popular architectures like Llama and Mistral. It provides practical, in-depth instructions for two specific fine-tuning objectives:
1. Structured JSON Output: Training LLMs to generate responses adhering to a predefined JSON schema, crucial for tasks like automated document creation, data extraction, and API integration. 2. Knowledge Base Question Answering (KB QA): Fine-tuning LLMs on custom datasets (e.g., internal documentation, research papers) to enable accurate one-shot or few-shot question answering about that specific information corpus.
⚡ 1.2. Target Audience and Benefits
This guide is intended for technical professionals, including DevOps Engineers, MLOps Engineers, Machine Learning Researchers, and Software Developers, who possess foundational knowledge of command-line interfaces and Docker concepts. The primary benefit for this audience is a practical roadmap to setting up and executing local LLM fine-tuning tasks efficiently. The core value proposition lies in simplifying the often-arduous task of environment configuration. General-purpose LLMs may struggle with highly specific formatting requirements or lack knowledge of niche domains. Fine-tuning addresses this but necessitates a precise stack of tools: specific versions of PyTorch or TensorFlow, libraries like Hugging Face Transformers, PEFT (Parameter-Efficient Fine-Tuning), bitsandbytes (for quantization), Accelerate (for distributed training), and compatible NVIDIA CUDA drivers and toolkits. Managing these dependencies manually on a local machine is notoriously difficult, prone to version conflicts, and sensitive to the host system’s configuration, especially concerning GPU drivers.
🌟 2. Understanding the Fine-Tuning Objectives
Fine-tuning adapts a pre-trained LLM to better suit a specific downstream task by further training it on a relevant dataset. The nature of this dataset and the evaluation criteria differ significantly based on the target objective.
⚡ 2.1. Fine-tuning for Structured JSON Output
Concept: The goal is to train an LLM to generate outputs that consistently conform to a specified JSON structure or schema. This is invaluable for applications requiring predictable, machine-readable output, such as populating templates, extracting structured information from unstructured text, generating configuration files, or producing consistent API responses.
Data Formatting: The success of this task hinges critically on the format of the training data. The dataset must explicitly demonstrate the desired behavior. Typically, this involves using an instruction-following format where the input prompt clearly requests information in JSON format, and the corresponding output field contains a valid JSON string matching the target schema. For example, a data entry might look like:
JSON
{ “instruction”: “Extract the name, position, and company from the following text and provide it as a JSON object with keys ‘name’, ‘position’, and ‘company’. Text: John Doe is a Senior Software Engineer at Acme Corp.”, “input”: "", // Optional, can be part of the instruction “output”: ”{“name”: “John Doe”, “position”: “Senior Software Engineer”, “company”: “Acme Corp.”}” }
Common dataset formats like the Alpaca format (using instruction, input, output keys) or simple JSON Lines (JSONL), where each line is a JSON object representing a training example, are often used. It is crucial that the JSON strings in the output field are syntactically correct and adhere to the desired schema.
Techniques and Challenges: Instruction fine-tuning is the primary technique. During training and evaluation, it’s beneficial to incorporate validation steps to check if the model’s generated output is valid JSON and, ideally, if it conforms to the specific target schema. Tools or libraries for JSON schema validation might be integrated into the evaluation loop.
⚡ 2.2. Fine-tuning for Knowledge Base Question Answering (One-Shot QA)
Concept: This objective aims to imbue an LLM with specialized knowledge from a specific corpus (e.g., technical manuals, project documentation, legal texts) so that it can answer questions about that information accurately, ideally with the question being the only input provided at inference time (one-shot).
This differs from Retrieval-Augmented Generation (RAG), where knowledge is retrieved from an external database at inference time.
Data Formatting: Preparing a dataset for KB QA fine-tuning involves transforming the source knowledge corpus into a format suitable for training. Common strategies include:
1. Text Chunking: Dividing the large body of text into smaller, manageable segments that fit within the model’s context window. 2. QA Pair Generation: Creating question-answer pairs where the answer is directly supported by the text within a specific chunk. This can be done manually, semi-automatically (e.g., using another LLM to generate initial pairs), or automatically using specialized techniques. 3. Instruction Formatting: Structuring these QA pairs into an instruction-following format. For example:
JSON
{ “instruction”: “Based on the provided context, answer the following question.”, “input”: “Context: \nQuestion: What is the function of the frobnicator?”, “output”: “The frobnicator is responsible for calibrating the widget flux.” }
Or, for a more direct knowledge injection approach:
JSON
{ “instruction”: “What is the function of the frobnicator?”, “input”: "", // No context provided here, relying on fine-tuned knowledge “output”: “The frobnicator is responsible for calibrating the widget flux.” }
The dataset should aim to cover the breadth and depth of the knowledge base adequately.
Techniques and Trade-offs: Instruction fine-tuning using the generated QA pairs is the standard approach. It’s crucial to understand the trade-offs compared to RAG. Fine-tuning can lead to faster inference (no retrieval step) and potentially better integration of nuanced knowledge. However, it requires significant computational resources for training, updating the knowledge requires retraining the model, and there’s a risk of “catastrophic forgetting” (losing general capabilities) or hallucinating information not present in the training data if not done carefully. The fundamental difference between these two fine-tuning tasks lies in the objective being learned. JSON output tuning teaches the model a structural format constraint, evaluated by schema validity and content accuracy, requiring data with explicit JSON examples. KB QA tuning teaches the model specific factual information, evaluated by answer correctness against the knowledge base, requiring representative QA pairs derived from that base. While a single fine-tuning framework might execute both types of training runs, the data preparation pipelines and evaluation methodologies surrounding the core training step must be tailored specifically to the task.
🌟 3. Docker for Local LLM Fine-Tuning: Advantages and Considerations
Using Docker offers significant advantages for managing the complexities of LLM fine-tuning environments, but it also introduces specific considerations that users must address.
⚡ 3.1. Advantages
-
Environment Consistency: Docker containers encapsulate the operating system libraries, Python packages, specific CUDA toolkit versions, and other dependencies. This ensures that the fine-tuning environment is identical wherever the container is run – on a developer’s local machine, a shared testing server, or potentially even in a staging environment. This eliminates the common “it works on my machine” problem, which is particularly prevalent in complex ML workflows.
-
Dependency Management: LLM fine-tuning relies on a precise stack of libraries (e.g., PyTorch, Transformers, PEFT, bitsandbytes, Accelerate). These libraries often have strict version requirements and interdependencies, which can conflict with other software installed on the host system. Docker isolates the fine-tuning environment, preventing such conflicts and simplifying dependency management to defining them within a Dockerfile or using a pre-built image.
-
Reproducibility: A Docker image, defined by a Dockerfile or identified by a specific tag, captures the exact state of the environment. This makes it straightforward to rebuild or share the environment, ensuring that fine-tuning experiments can be reliably reproduced by others or by the same user at a later time. This is crucial for scientific rigor and collaborative development.
-
Isolation: Docker allows multiple containers to run concurrently on the same host without interfering with each other. This enables running multiple fine-tuning experiments simultaneously, perhaps with different models, datasets, or library versions, without worrying about cross-contamination of environments.
⚡ 3.2. Challenges and Considerations
-
Performance Overhead: While generally minimal for GPU-intensive tasks like LLM training, there can be a slight performance overhead associated with containerization compared to running directly on the bare-metal host system. This is typically related to I/O or network operations, but usually negligible for the compute-bound nature of fine-tuning.
-
GPU/Hardware Access Configuration: This is often the most significant hurdle when using Docker for GPU-accelerated tasks. The host system must have the NVIDIA drivers installed, and Docker needs to be configured with the NVIDIA Container Toolkit (formerly nvidia-docker2) to allow containers to access the host’s GPUs. Running the container requires specific flags like —gpus all. Debugging issues related to driver mismatches between the host and the container, incorrect toolkit setup, or Docker runtime configurations can be complex.
-
Image Size: Docker images for LLM fine-tuning can be substantial, often ranging from several gigabytes to tens of gigabytes. This is due to the inclusion of the base OS, CUDA libraries, large Python packages (like PyTorch), and potentially model weights if bundled. This requires significant disk space on the host and can lead to long download times, especially on slower network connections.
-
Debugging: Troubleshooting problems inside a running container can be more challenging than debugging on the host OS. Techniques involve using docker exec to run commands inside the container, attaching an interactive shell (docker exec -it <container_id> /bin/bash), carefully inspecting container logs (docker logs), or using debugging tools compatible with containerized environments.
-
Data Management: Datasets, configuration files, and model checkpoints typically reside on the host filesystem. These need to be made accessible to the container using Docker volumes (-v or —volume flag) or bind mounts. Managing file permissions between the host user and the container user can sometimes require careful handling (e.g., using user/group ID mapping flags in the docker run command).
In essence, Docker significantly simplifies the management of software dependencies inherent in LLM fine-tuning. However, this simplification comes at the cost of introducing a new layer of complexity: configuring the interaction between the container and the host’s hardware, particularly the GPU. Mastering the setup of the NVIDIA Container Toolkit and the appropriate docker run commands becomes crucial for successful execution.
🌟 4. Survey of Pre-built Docker Images & Frameworks
Identifying suitable pre-built Docker images requires searching across platforms like Docker Hub, GitHub repositories, and referencing documentation from popular fine-tuning frameworks. The selection criteria for this guide prioritize images that:
-
Explicitly mention support for open-source models, including variants of Phi, Gemma, and Granite where possible.
-
Are well-suited for the target fine-tuning tasks (structured JSON output and KB QA via instruction tuning).
-
Are based on popular, well-maintained fine-tuning frameworks like Axolotl, LLaMA Factory, or Hugging Face TRL.
-
Are readily available on Docker Hub or provide clear instructions for building from source (e.g., via a Dockerfile on GitHub).
-
Have reasonable community support, documentation, or examples available. Based on these criteria, several key candidates emerge, primarily centered around established fine-tuning frameworks:
-
Axolotl: A highly popular YAML-configured framework built on Hugging Face libraries (Transformers, PEFT, TRL, Datasets). It supports a wide range of models and parameter-efficient fine-tuning methods like LoRA and QLoRA, along with full fine-tuning and DeepSpeed integration. Its flexibility makes it suitable for both JSON and KB QA tasks.
-
LLaMA Factory: Another comprehensive framework supporting numerous models (including Phi and Gemma) and fine-tuning techniques (LoRA, QLoRA, Freeze-tuning, Full).
It offers both a command-line interface (CLI) and an optional user-friendly web UI built with Gradio, making it potentially more accessible for users less comfortable with YAML configurations. Official Docker images are available, often hosted on GitHub Container Registry (ghcr.io), simplifying setup.
- Hugging Face TRL (Transformer Reinforcement Learning) & Base Images: While TRL is a library rather than a full framework like Axolotl or LLaMA Factory, it provides core components like the SFTTrainer for supervised fine-tuning (instruction tuning).
One could use base PyTorch Docker images (e.g., pytorch/pytorch) and install TRL, PEFT, Transformers, etc., or find community images that bundle these.
- Other Potential Candidates: Frameworks like Predibase’s LoRAX focus more on serving multiple LoRA adapters efficiently but might offer fine-tuning capabilities. Some model providers might release specific fine-tuning containers, although these are less common for general open-source models. Academic projects sometimes release Docker images, but these may lack long-term maintenance. The LLM fine-tuning ecosystem demonstrates a strong reliance on abstraction layers. Core libraries like Hugging Face Transformers, PEFT, and TRL provide the fundamental building blocks. Frameworks such as Axolotl and LLaMA Factory build upon these, offering simplified configuration interfaces (YAML, CLI, Web UI), pre-configured support for numerous models and methods, and streamlined workflows that handle data loading, model patching (for PEFT), training loops, and checkpointing. Constructing and maintaining Docker images that correctly bundle all necessary dependencies (including specific CUDA versions compatible with libraries like bitsandbytes) for these frameworks is a non-trivial task.
🌟 5. In-Depth Guide: Fine-Tuning with Selected Docker Images
This section provides detailed instructions for using Docker images based on the Axolotl and LLaMA Factory frameworks.
⚡ 5.1. Framework: Axolotl
Axolotl is favored for its flexibility and comprehensive YAML-based configuration.
- Image Identifier & Source:
- Common Docker Hub Repositories: winglian/axolotl, openaccessaicollective/axolotl.
- Tags usually specify CUDA version (e.g., -cuda11.8, -cuda12.1) and potentially library versions. Example: winglian/axolotl:main-py3.10-cu118-2.0.1
- Check the Axolotl GitHub repository for the latest recommended images or Dockerfiles: https://github.com/OpenAccess-AI-Collective/axolotl.
- Docker Hub links: https://hub.docker.com/r/winglian/axolotl, https://hub.docker.com/r/openaccessaicollective/axolotl.
- Supported LLMs & Methods:
- Extensive support for models based on Llama, Mistral, Falcon, MPT, and others. Support for Phi and Gemma models is typically available; check the Axolotl documentation or GitHub issues for specific variant compatibility (e.g., Phi-3). Granite support might be less common but potentially achievable if compatible with underlying Hugging Face architecture classes.
- Methods: LoRA, QLoRA (4-bit and 8-bit via bitsandbytes), Full fine-tuning, ReLoRA, DeepSpeed (ZeRO stages 1, 2, 3), FSDP, Flash Attention 2 (hardware dependent).
- System Requirements:
- Hardware:
- GPU: NVIDIA GPU strongly recommended. VRAM is the primary constraint.
-
QLoRA (4-bit) on 7B models: ~10-16GB+ VRAM.
-
QLoRA on larger models (e.g., 13B, 34B) or smaller models with larger batch sizes: 24GB+ VRAM.
-
Full fine-tuning or larger models: 40GB, 80GB, or multi-GPU setups often required.
-
CPU: Modern multi-core CPU (e.g., 8+ cores).
-
RAM: 32GB+ recommended, more for larger datasets or models. Data loading and preprocessing can be RAM-intensive.
-
Disk Space: 50GB+ free space recommended for the Docker image, datasets, model downloads, and checkpoints. Base models and checkpoints can consume significant space.
- Software:
-
OS: Linux is highly recommended. WSL2 on Windows can work but may introduce complexities. macOS support for GPU acceleration is limited/experimental.
-
Docker Engine: Latest stable version.
-
NVIDIA Container Toolkit: Required for GPU access within Docker. Installation instructions vary by Linux distribution.
-
NVIDIA Driver: A compatible driver version is crucial. Check the requirements of the specific CUDA version used in the Docker image tag (e.g., CUDA 11.8 often requires drivers >= 450.80.02, CUDA 12.1 requires >= 530.30.02). Driver/CUDA mismatches are a common source of errors.
- Setup:
1. Install Docker Engine and NVIDIA Container Toolkit on the host machine. Verify toolkit installation (e.g., docker run —rm —gpus all nvidia/cuda:11.8.0-base-ubuntu20.04 nvidia-smi).
2. Pull the desired Axolotl image: docker pull winglian/axolotl:main-py3.10-cu118-2.0.1 (replace tag as needed).
3. Prepare host directories for configuration, data, and output.
4. Structure the basic docker run command:
Bash
docker run —gpus all —rm -it
-v /path/to/your/config:/workspace/config
-v /path/to/your/data:/workspace/data
-v /path/to/your/output:/workspace/output
-v /path/to/huggingface/cache:/root/.cache/huggingface
—shm-size=16g
winglian/axolotl:main-py3.10-cu118-2.0.1
bash
- —gpus all: Grants access to all host GPUs.
- —rm: Removes the container filesystem on exit.
- -it: Runs in interactive mode with a pseudo-TTY.
- -v /host/path:/container/path: Mounts host directories into the container. Mounting the Hugging Face cache (~/.cache/huggingface) avoids re-downloading models inside the container. Adjust /root/.cache/huggingface if the container user is different.
- —shm-size=16g: Increases shared memory size, often needed for multi-processing operations in PyTorch. Adjust size based on system RAM.
- The final bash command starts an interactive shell inside the container. Training commands will be run from this shell.
- Configuration Files (YAML): Axolotl uses a single YAML file to define the entire fine-tuning job. Key parameters include:
- base_model: Hugging Face identifier of the model to fine-tune (e.g., microsoft/phi-2, google/gemma-7b).
- model_type: Corresponding AutoModelForCausalLM class (e.g., PhiForCausalLM, GemmaForCausalLM).
- tokenizer_type: Corresponding AutoTokenizer class.
- load_in_4bit / load_in_8bit: Set to true for QLoRA.
- adapter: Set to lora or qlora.
- lora_r: Rank of the LoRA matrices (e.g., 8, 16, 32, 64).
- lora_alpha: LoRA scaling factor (often 2 * lora_r).
- lora_dropout: Dropout probability for LoRA layers.
- lora_target_modules: List of module names within the model to apply LoRA to (e.g., [“q_proj”, “v_proj”], [“gate_proj”, “up_proj”, “down_proj”]). Crucial for effectiveness.
- datasets: List of datasets, each with path (to data file/directory inside container, e.g., /workspace/data/my_data.jsonl), type (e.g., alpaca, sharegpt, json, custom script).
- dataset_prepared_path: Location to save processed/tokenized data.
- val_set_size: Fraction or number of examples for validation.
- sequence_len: Maximum sequence length for truncation/padding.
- learning_rate: Peak learning rate (e.g., 2e-5, 1e-4).
- per_device_train_batch_size: Batch size per GPU.
- gradient_accumulation_steps: Number of steps to accumulate gradients before optimizer step (effective batch size = batch_size * num_gpus * grad_accum_steps).
- num_train_epochs: Number of training epochs.
- optimizer: e.g., adamw_torch, paged_adamw_8bit.
- lr_scheduler: e.g., cosine, linear.
- warmup_steps or warmup_ratio: Steps/ratio for learning rate warmup.
- fp16 or bf16: Set to true for mixed-precision training (BF16 recommended on Ampere+ GPUs).
- gradient_checkpointing: Set to true to trade compute for VRAM.
- output_dir: Path inside container for saving checkpoints/adapter (e.g., /workspace/output/my_finetune).
- logging_steps, save_steps, eval_steps: Frequency for logging, saving checkpoints, and evaluation.
-
Workflow: JSON Output Fine-tuning: 1. Data Prep: Create a dataset (e.g., data_json.jsonl) in Alpaca format where instructions ask for JSON output and the output field contains valid JSON strings. Place it in the mounted data directory. 2. Config Prep: Create config_json.yml specifying base_model, QLoRA/LoRA parameters, datasets pointing to data_json.jsonl with type alpaca (or similar), output_dir, etc. 3. Execution: Inside the running Docker container (started via docker run… bash): Bash accelerate launch -m axolotl.cli.train /workspace/config/config_json.yml 4. Considerations: Ensure training data JSON is rigorously validated. Consider adding a custom evaluation step that checks JSON validity and schema adherence of model outputs on the validation set.
-
Workflow: Knowledge Base QA Fine-tuning: 1. Data Prep: Process your knowledge base into QA pairs. Format them into an instruction-following dataset (e.g., data_kbqa.jsonl). Place it in the mounted data directory. 2. Config Prep: Create config_kbqa.yml. The main difference from the JSON config will be the datasets section pointing to data_kbqa.jsonl. Other parameters (model, LoRA settings, learning rate) might need tuning based on the task and data size. 3. Execution: Inside the running Docker container: Bash accelerate launch -m axolotl.cli.train /workspace/config/config_kbqa.yml 4. Considerations: Experiment with text chunking strategies for context. The number and quality of QA pairs are critical. Evaluate using QA metrics (EM, F1) on a held-out set of questions derived from the knowledge base. Monitor for potential knowledge hallucination.
-
Usage Best Practices (Axolotl specific):
- Start with QLoRA (load_in_4bit: true) for lower VRAM usage.
- Use gradient_checkpointing: true to further save VRAM at the cost of slightly slower training.
- Monitor training progress via logs. Axolotl supports Weights & Biases (wandb) integration (wandb_project, wandb_entity in YAML) for better visualization.
- Debug configuration and data loading using small dataset subsets and few training steps (max_steps).
- Carefully select lora_target_modules. Consult model-specific recommendations or experiment; targeting attention blocks (q_proj, k_proj, v_proj, o_proj) and sometimes feed-forward layers (gate_proj, up_proj, down_proj) is common.
- Save checkpoints regularly (save_steps) to avoid losing progress. The final output is typically a LoRA adapter, not a full model merge (though merging is possible post-training).
- Relevant Links:
- GitHub: https://github.com/OpenAccess-AI-Collective/axolotl
- Example Configs: Check the examples/ directory in the GitHub repo.
⚡ 5.2. Framework: LLaMA Factory
LLaMA Factory offers a more structured approach with CLI and optional Web UI options.
- Image Identifier & Source:
- Official images often hosted on GitHub Container Registry (ghcr.io). Example: ghcr.io/hiyouga/llama-factory:latest or version-specific tags.
- Check the LLaMA Factory GitHub repository for current image details and build instructions: https://github.com/hiyouga/LLaMA-Factory.
- Supported LLMs & Methods:
- Broad support for many architectures, including Llama, Mistral, Phi (Phi-2, Phi-3), Gemma. Check documentation for specific Granite support.
- Methods: LoRA, QLoRA (via bitsandbytes or auto-gptq), Freeze-tuning, Full fine-tuning. Uses Hugging Face TRL’s SFTTrainer internally.
- System Requirements:
- Hardware: Similar to Axolotl. VRAM needs depend heavily on model size and fine-tuning method. QLoRA on 7B models generally requires >=12-16GB VRAM.
- Software: Linux/WSL2, Docker Engine, NVIDIA Container Toolkit, compatible NVIDIA drivers.
-
Setup: 1. Install Docker, NVIDIA Container Toolkit, and verify GPU access. 2. Pull the image: docker pull ghcr.io/hiyouga/llama-factory:latest (or specific tag). 3. Prepare host directories for data, output, and cache. 4. Basic docker run command (for CLI usage): Bash docker run —gpus all —rm -it
-v /path/to/your/data:/app/data
-v /path/to/your/output:/app/output
-v /path/to/huggingface/cache:/root/.cache/huggingface
—shm-size=16g
ghcr.io/hiyouga/llama-factory:latest
bash (> ⚠️ Note: Container paths like /app/data might differ; check image documentation. Adjust cache path if needed.) 5. For Web UI: Expose the Gradio port (default 7860): Bash docker run —gpus all —rm -it -p 7860:7860
-v… (mount volumes as above)…
ghcr.io/hiyouga/llama-factory:latest
python src/train_web.py Then access http://localhost:7860 in your browser. -
Configuration (CLI Arguments / JSON): Fine-tuning is typically launched via the llamafactory-cli train command, taking arguments directly or via a JSON configuration file (—config_file). Key parameters:
- —model_name_or_path: Base model identifier (e.g., google/gemma-7b).
- —dataset: Comma-separated list of dataset names defined in data/dataset_info.json or paths to custom data files (e.g., /app/data/my_data.jsonl).
- —template: Name of the prompt template (e.g., alpaca, vicuna, default). Important for formatting instructions correctly.
- —finetuning_type: lora, freeze, full.
- —lora_rank: LoRA rank (e.g., 8, 16).
- —lora_alpha: LoRA alpha.
- —lora_target: Target modules (e.g., q_proj,v_proj or all). all automatically targets common layers.
- —output_dir: Path for saving results (e.g., /app/output/my_finetune).
- —per_device_train_batch_size: Batch size per GPU.
- —gradient_accumulation_steps: Gradient accumulation.
- —learning_rate: Learning rate.
- —num_train_epochs: Number of epochs.
- —fp16 / —bf16: Enable mixed precision.
- —quantization_bit: Set to 4 or 8 for QLoRA.
- —gradient_checkpointing: Enable gradient checkpointing.
- —logging_steps, —save_steps, —eval_steps: Frequencies.
-
Workflow: JSON Output Fine-tuning: 1. Data Prep: Create data_json.jsonl (Alpaca format recommended) with JSON instructions/outputs. Place in mounted data directory. 2. Execution (CLI Example): Inside the running Docker container: Bash llamafactory-cli train
—model_name_or_path google/gemma-7b
—dataset /app/data/data_json.jsonl
—template default
—finetuning_type lora
—lora_target all
—lora_rank 16
—quantization_bit 4
—per_device_train_batch_size 2
—gradient_accumulation_steps 8
—learning_rate 1e-4
—num_train_epochs 3
—output_dir /app/output/gemma_json_lora
—fp16
—gradient_checkpointing (Adjust parameters as needed). 3. Considerations: Use an appropriate —template. Ensure JSON validity in training data. Evaluate outputs for JSON correctness. -
Workflow: Knowledge Base QA Fine-tuning: 1. Data Prep: Create data_kbqa.jsonl with QA pairs formatted using a suitable instruction template. Place in mounted data directory. 2. Execution (CLI Example): Inside the running Docker container: Bash llamafactory-cli train
—model_name_or_path microsoft/phi-2
—dataset /app/data/data_kbqa.jsonl
—template default
—finetuning_type lora
—lora_target all
—lora_rank 8
—quantization_bit 4
—per_device_train_batch_size 4
—gradient_accumulation_steps 4
—learning_rate 2e-4
—num_train_epochs 5
—output_dir /app/output/phi2_kbqa_lora
—fp16
—gradient_checkpointing 3. Considerations: Choose a template that matches your QA data format. Evaluate using QA metrics. -
Usage Best Practices (LLaMA Factory specific):
- The Web UI can be helpful for exploring available models, datasets, and parameters interactively before running longer jobs via CLI.
- Leverage predefined dataset formats and templates (—template) for simplicity when applicable.
- Check the data/dataset_info.json file in the repository to understand built-in dataset formats and how to add custom ones.
- Using —lora_target all can be a reasonable starting point, but fine-tuning specific modules might yield better results for some models/tasks.
- Consult the GitHub repository for the most up-to-date list of supported models and features.
- Relevant Links:
- GitHub: https://github.com/hiyouga/LLaMA-Factory
- Documentation: Found within the GitHub repository (e.g., README, docs folder).
The choice between Axolotl and LLaMA Factory often comes down to user preference and specific needs. Both frameworks offer robust implementations of core fine-tuning techniques like LoRA and QLoRA. Axolotl’s comprehensive YAML configuration appeals to users who prefer declarative setups and easy version control of experiment parameters. LLaMA Factory’s CLI and Web UI provide a potentially gentler learning curve and facilitate interactive exploration. While core performance for standard methods should be comparable, the availability of cutting-edge features, specific model integrations (e.g., optimized kernels like Flash Attention or integrations like Unsloth), or support for the very latest model releases might differ slightly between the two due to their independent development cycles.
🌟 6. Comparative Analysis and Recommendations
Choosing the right Docker image and underlying framework depends on various factors, including the specific model, hardware constraints, user expertise, and preferred workflow.
⚡ 6.1. Comparative Overview
The following table summarizes key aspects of the discussed Dockerized frameworks:
Feature | Axolotl-based Images (winglian/axolotl, etc.) | LLaMA Factory-based Images (ghcr.io/hiyouga/llama-factory) |
---|---|---|
Ease of Use | Moderate (YAML expertise helpful) | Beginner-friendly to Moderate (CLI / Web UI) |
Configuration Method | YAML | CLI Arguments, JSON file, Web UI |
Model Support | Broad (Llama, Mistral, Phi, Gemma, etc.) | Broad (Llama, Mistral, Phi, Gemma, etc.) |
Phi4 Support | Likely (Check latest commits/docs) | Likely (Check latest commits/docs) |
Gemma3 Support | Likely (Check latest commits/docs) | Likely (Check latest commits/docs) |
Granite3.2 Support | Less common, may require custom config | Less common, may require custom config |
Fine-tuning Methods | LoRA, QLoRA, Full, DeepSpeed, FSDP | LoRA, QLoRA, Freeze, Full |
Suitability: JSON Output | High (Flexible data handling via YAML) | High (Standard dataset formats well-supported) |
Suitability: KB QA | High (Flexible data handling via YAML) | High (Standard dataset formats well-supported) |
Typical VRAM (QLoRA 7B) | ~10-16GB+ | ~12-16GB+ |
Community / Docs | Active community, good docs/examples | Active community, good docs/examples |
Flexibility | Very High (Detailed YAML control) | High (Good balance of options and ease of use) |
This table serves as a quick reference, enabling users to align their requirements with the strengths of each framework. For instance, a user prioritizing ease of setup for a Gemma model might lean towards LLaMA Factory, while someone needing fine-grained control over DeepSpeed settings for a complex KB QA task might prefer Axolotl’s YAML interface.
⚡ 6.2. Guidance on Selection
-
Model Compatibility: Always verify the latest documentation or repository commits for explicit support of the target model variant (e.g., Phi-3-mini-4k-instruct, gemma-2-9b, specific Granite models). If support is experimental or missing, one framework might be ahead of the other, or custom configuration might be needed.
-
Task Requirements: Both frameworks are well-suited for instruction fine-tuning required for JSON output and KB QA tasks. Axolotl’s flexible dataset configuration might offer a slight edge if dealing with highly non-standard data formats that require custom processing logic defined within the config.
-
Hardware Constraints: QLoRA is the most accessible method for local fine-tuning on consumer GPUs. VRAM estimates are broadly similar, but minor implementation differences could exist. If VRAM is extremely limited (<12-16GB for 7B models), fine-tuning might only be feasible with smaller models (e.g., Phi-3-mini), aggressive quantization, or require cloud resources. Hardware limitations often dictate feasibility regardless of the chosen framework.
-
User Experience Preference: This is a significant factor. Users comfortable with detailed configuration files and version control will likely prefer Axolotl’s YAML approach. Those who prefer interactive exploration via a GUI or more straightforward command-line arguments may find LLaMA Factory more intuitive.
-
Bleeding-Edge Features: If requiring specific optimizations (e.g., Flash Attention 2, specific kernels) or experimental PEFT methods, check which framework has integrated them most recently.
⚡ 6.3. Recommendation Synthesis
There isn’t a single “best” Docker image for all scenarios; the optimal choice depends on the specific context. However, based on the analysis:
-
For users prioritizing flexibility, extensive configuration options via YAML, and potentially needing advanced features like DeepSpeed: The Axolotl-based Docker images (e.g., winglian/axolotl, openaccessaicollective/axolotl) are highly recommended. Ensure the image tag matches your system’s CUDA version.
-
For users seeking a more user-friendly interface (Web UI or simpler CLI), broad model support out-of-the-box, and a well-structured workflow: The LLaMA Factory Docker image (ghcr.io/hiyouga/llama-factory) is an excellent choice. Both options provide robust platforms for fine-tuning models like Phi and Gemma for both JSON output and KB QA tasks using QLoRA on typical local GPU setups. Support for Granite models should be verified in their respective documentation, as it might be less standard. It is crucial to consult the specific framework’s documentation and repository for the latest updates on model compatibility and features before starting a project.
🌟 7. Universal Best Practices for Local Fine-Tuning in Docker
While Docker simplifies environment setup, successful fine-tuning still requires adherence to MLOps best practices, particularly concerning data, resource management, and evaluation.
- Data Preparation:
- JSON Output: Rigorously validate the syntax and schema of JSON examples in the training data. Use linters or programmatic checks (e.g., Python’s json module, libraries like pydantic). Ensure the instruction prompt clearly and consistently asks for the desired JSON format. Ambiguity here leads to poor results.
- Knowledge Base QA: Implement effective text chunking strategies (e.g., sentence splitting, overlapping chunks) to ensure context coherence and fit within model limits. Focus on generating high-quality QA pairs where the answer is directly supported by the context. Avoid questions that are too trivial or unanswerable from the provided text. Consider the diversity of question types (what, why, how, etc.).
- General: Use standard dataset formats (JSONL, Alpaca-style JSON) when possible for better compatibility with frameworks. Always split data into training and validation sets to monitor for overfitting. Store datasets on the host machine and mount them into the container using Docker volumes (-v) for persistence and easier management. Ensure file permissions allow the container user to read the data.
- Hardware Resource Management:
- Continuously monitor GPU VRAM usage during training using nvidia-smi (can be run on the host or inside the container if the toolkit is set up correctly). OOM errors are common.
- If hitting VRAM limits: reduce per_device_train_batch_size, increase gradient_accumulation_steps (maintains effective batch size but uses less VRAM), enable gradient_checkpointing, use QLoRA (load_in_4bit or —quantization_bit 4), or choose a smaller base model.
- Monitor CPU and system RAM usage (htop, docker stats), especially during data loading and preprocessing stages, which can be bottlenecks. Ensure sufficient shared memory (—shm-size) for Docker.
- Effective Training Monitoring:
- Pay close attention to the training and validation loss curves reported in the framework’s logs. A decreasing training loss but stagnant or increasing validation loss indicates overfitting.
- Utilize integrated experiment tracking tools like Weights & Biases (W&B) or TensorBoard if supported by the framework. Axolotl and LLaMA Factory often have flags or configuration options to enable these, providing valuable visualizations of metrics over time.
- Evaluation Techniques:
- JSON Output: Evaluation must go beyond standard loss metrics. Implement checks on a held-out test set to measure: 1. Percentage of outputs that are syntactically valid JSON. 2. Percentage of valid JSON outputs that adhere to the target schema (requires programmatic validation). 3. Semantic correctness of the generated content within the JSON structure (may require human review or automated checks against ground truth).
- Knowledge Base QA: Evaluate on a held-out set of QA pairs derived from the knowledge base but not seen during training. Standard metrics include: 1. Exact Match (EM): Percentage of generated answers identical to the ground truth. 2. F1 Score: Harmonic mean of precision and recall at the token level, allowing partial credit. 3. ROUGE Scores: Measure overlap of n-grams between generated and ground truth answers. 4. Semantic Similarity: Use embedding models (e.g., sentence-transformers) to compare the meaning of generated and ground truth answers, providing a more nuanced view than lexical overlap. Human evaluation is often crucial for assessing factual accuracy and relevance.
- Managing Docker Images and Containers:
- Regularly clean up unused Docker images, build caches, and stopped containers using commands like docker system prune -a to reclaim disk space, which can accumulate quickly with large LLM images.
- For reproducibility, always use specific image tags (e.g., winglian/axolotl:main-py3.10-cu118-2.0.1) instead of potentially ambiguous tags like latest.
- Document the exact docker run commands, volume mounts, and configuration files used for each experiment to ensure traceability and reproducibility. Consider using shell scripts or tools like Docker Compose to manage complex run commands. Ultimately, Docker provides the controlled environment, but the success of the fine-tuning process itself remains dependent on the quality of the data engineering, the rigor of the training and resource management, and the appropriateness of the evaluation strategy for the specific task. These core MLOps practices are essential regardless of the containerization technology used.
🌟 8. Conclusion
⚡ 8.1. Summary of Findings
Docker presents a highly effective solution for mitigating the environmental complexities associated with local LLM fine-tuning. By encapsulating dependencies within pre-built images, it ensures reproducibility, simplifies setup, and promotes consistency across different systems. Frameworks like Axolotl and LLaMA Factory, available through Docker images, provide robust and user-friendly platforms for fine-tuning open-source models such as Phi, Gemma, and potentially Granite. These frameworks readily support parameter-efficient techniques like QLoRA, making local fine-tuning feasible on consumer-grade GPUs for specific tasks like generating structured JSON output and building knowledge base question-answering capabilities.
⚡ 8.2. Final Recommendations
For practitioners embarking on local LLM fine-tuning using Docker:
1. Select a Framework Image:
-
Consider Axolotl-based images (e.g., winglian/axolotl) for maximum flexibility via YAML configuration, especially if advanced features or highly custom datasets are involved.
-
Opt for the LLaMA Factory image (ghcr.io/hiyouga/llama-factory) for a more guided experience via its CLI or Web UI, particularly suitable for standard instruction fine-tuning tasks. 2. Verify Model Support: Always consult the chosen framework’s latest documentation or GitHub repository to confirm explicit support for the specific target model variant (Phi4, Gemma3, Granite3.2, etc.). 3. Prioritize QLoRA: Start with QLoRA (4-bit quantization) to minimize VRAM requirements, making local fine-tuning more accessible. 4. Master Docker GPU Setup: Ensure Docker Engine, NVIDIA drivers, and the NVIDIA Container Toolkit are correctly installed and configured on the host system. Use the —gpus all flag and manage shared memory (—shm-size). 5. Focus on Data and Evaluation: Recognize that Docker solves the environment problem, but success hinges on meticulous data preparation tailored to the task (JSON validation, quality KB QA pairs) and rigorous, task-specific evaluation metrics. 6. Iterate and Monitor: Start with small experiments, monitor resource usage (nvidia-smi, logs, W&B), and iteratively refine configurations and hyperparameters. By leveraging the power of containerization with robust fine-tuning frameworks and adhering to sound MLOps practices, developers and researchers can effectively harness the potential of open-source LLMs for specialized tasks within their local environments. The continued development of these frameworks and the growing availability of pre-built Docker images promise to further democratize access to custom LLM development.