🌌 Fine-Tuning Large Language Models with Axolotl on Debian 12 and Nvidia RTX 4090: A Comprehensive Manual

🌟 1. Prerequisites

Before embarking on the fine-tuning process, it is essential to ensure the system meets the necessary hardware and software requirements and that foundational tools are installed.

⚡ 1.1. Hardware and Software Requirements

Operating System: Debian 12 (“Bookworm”). This manual specifically targets this version.
GPU: Nvidia RTX 4090 with 24GB VRAM. The capabilities of this GPU allow for fine-tuning larger models and using larger batch sizes compared to lower-VRAM cards. Ensure the GPU is correctly installed and recognized by the system at the hardware level.
CPU: A reasonably modern multi-core CPU is recommended for data processing and overall system responsiveness.
RAM: Minimum 32GB RAM recommended, 64GB or more is preferable, especially if handling large datasets or exploring advanced techniques like DeepSpeed ZeRO offloading.
Storage: Sufficient fast storage (NVMe SSD recommended) for the OS, Docker images, base models (can be tens of GBs), datasets, and training outputs (checkpoints/adapters, which can also grow large). Plan for at least 200-500GB of free space, depending on the scale of experiments.
Internet Connection: Required for downloading software packages, Docker images, base models, and datasets.

⚡ 1.2. Necessary Installations (Host System)

While Axolotl runs within Docker, certain tools are required on the host Debian 12 system.

Git: Essential for cloning the Axolotl repository and potentially other required software like llama.cpp.

Installation via APT (recommended): sudo apt update sudo apt install git -y Verify installation: git —version.
Initial Git configuration (required for committing, though not strictly for cloning): git config —global user.name “Your Name” git config —global user.email “youremail@example.com”

Conda/Miniconda (Optional): While not strictly necessary for the Docker-based workflow, Conda (or its minimal installer, Miniconda) can be useful for managing Python environments outside of Docker, for example, during dataset preparation or model conversion/evaluation if performed separately.

Download the latest Miniconda installer script for Linux from the official Anaconda repository.
Verify the installer integrity using sha256sum against the official hash.

🌌 Example download command (check for the latest version)

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86\_64.sh

🌌 Example verification (replace filename)

sha256sum Miniconda3-latest-Linux-x86_64.sh

Run the installer script: bash Miniconda3-latest-Linux-x86_64.sh Follow the prompts, accepting the license agreement and choosing an installation location (typically ~/miniconda3). Allow the installer to initialize Conda in your shell profile (.bashrc).
Reload shell configuration: source ~/.bashrc.
Verify installation: conda —version.

🌟 Table 1: System Prerequisites Summary

Component	Requirement	Installation Command (Debian 12) / Notes
OS	Debian 12 (“Bookworm”)	-
GPU	Nvidia RTX 4090 (24GB VRAM)	Hardware installation required.
CPU	Modern Multi-core	-
RAM	32GB+ (64GB+ recommended)	-
Storage	200GB+ Fast SSD (NVMe recommended)	-
Internet	Required	-
Host Software
Git	Required	sudo apt update && sudo apt install git -y
Conda (Optional)	Recommended for external environment management	Download installer, run bash Miniconda3-latest-Linux-x86_64.sh, source ~/.bashrc
Nvidia Drivers	Required (See Section 2.1)	Installation detailed below.
Docker Engine	Required (See Section 2.2)	Installation detailed below.
Docker Compose	Required (Installed with Engine)	Installation detailed below.
Nvidia Container Toolkit	Required (See Section 2.4)	Installation detailed below.

🌟 2. Debian Environment Setup for GPU-Accelerated Docker

Correctly configuring the Debian 12 host system to allow Docker containers access to the Nvidia RTX 4090 GPU is paramount. This involves installing the appropriate Nvidia drivers, Docker Engine, and the Nvidia Container Toolkit.

⚡ 2.1. Install Official Nvidia Drivers

Using the drivers packaged by Debian is the recommended approach for stability and integration with the system’s package management. Avoid using the .run file from Nvidia directly unless necessary, as it can conflict with system packages.

2.1.1. Update System: Ensure the system is up-to-date. sudo apt update sudo apt upgrade -y

2.1.2. Enable contrib and non-free Repositories: Nvidia drivers are proprietary and reside in the non-free repository. Some dependencies might be in contrib. Edit the APT sources list: sudo nano /etc/apt/sources.list Ensure lines for bookworm include main contrib non-free at the end. Example: deb http://deb.debian.org/debian bookworm main contrib non-free non-free-firmware deb http://deb.debian.org/debian bookworm-updates main contrib non-free non-free-firmware deb http://security.debian.org/debian-security bookworm-security main contrib non-free non-free-firmware (> ⚠️ Note: non-free-firmware was separated in Debian 12, include it as well). Save the file (Ctrl+O, Enter in nano) and exit (Ctrl+X). Update the package list again: sudo apt update

2.1.3. Install Driver Detection Tool: The nvidia-detect utility helps identify the recommended driver package. sudo apt install nvidia-detect -y

2.1.4. Detect Recommended Driver: Run the tool: nvidia-detect It will output information about the detected GPU (RTX 4090) and recommend a package, typically nvidia-driver for current hardware.
2.1.5. Install the Recommended Driver: Install the package suggested by nvidia-detect. This process also installs necessary dependencies like nvidia-kernel-dkms (which builds the kernel module) and handles blacklisting the open-source Nouveau driver. It also installs firmware-misc-nonfree if needed. sudo apt install nvidia-driver firmware-misc-nonfree -y (If nvidia-detect recommended a different package, use that name instead).
2.1.6. Handle Secure Boot (If Enabled): If Secure Boot is enabled on the system, the nvidia-kernel-dkms package will build kernel modules that need to be signed. During installation or on the next reboot, the system might prompt for Machine Owner Key (MOK) enrollment. Follow the on-screen instructions, which typically involve creating a password, rebooting, entering the MOK management utility (often called mokutil or similar during boot), and enrolling the key using the password created earlier. This step is crucial for the Nvidia kernel modules to load correctly under Secure Boot.
2.1.7. Reboot: A reboot is required to load the newly installed Nvidia driver and kernel modules. sudo reboot

2.1.8. Verify Driver Installation: After rebooting, verify the driver is loaded and functional using the nvidia-smi command: nvidia-smi This command should output detailed information about the RTX 4090, including the driver version, CUDA version, VRAM usage (should be low initially), and temperature. Successful execution confirms the driver installation.

⚡ 2.2. Install Docker Engine and Docker Compose

Install Docker Engine and the Docker Compose plugin directly from the official Docker repository to ensure the latest stable versions are used.

2.2.1. Install Dependency Packages: sudo apt update sudo apt install ca-certificates curl gnupg -y

2.2.2. Add Docker’s Official GPG Key: sudo install -m 0755 -d /etc/apt/keyrings sudo curl -fsSL https://download.docker.com/linux/debian/gpg -o /etc/apt/keyrings/docker.asc sudo chmod a+r /etc/apt/keyrings/docker.asc

. (> ⚠️ Note: Some guides use /usr/share/keyrings , while official docs often use /etc/apt/keyrings. Both work, but /etc/apt/keyrings is becoming more standard).

2.2.3. Add Docker Repository: echo
“deb [arch=$(dpkg —print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/debian
$(. /etc/os-release && echo “$VERSION_CODENAME”) stable” |
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

2.2.4. Install Docker Packages: Update the package list and install Docker Engine, CLI, containerd, and the Compose plugin. sudo apt update sudo apt install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin -y

. (> ⚠️ Note: This single command installs both Engine and Compose v2 plugin, simplifying previous multi-step processes ).

2.2.5. Verify Docker Installation: Run the hello-world container to check if Docker Engine is working correctly. sudo docker run hello-world A confirmation message indicates success.

⚡ 2.3. Manage Docker as a Non-Root User (Post-installation)

To avoid prefixing every docker command with sudo, add the current user to the docker group.

2.3.1. Add User to Group: sudo usermod -aG docker $USER

2.3.2. Apply Group Changes: For the group changes to take effect in the current terminal session, either log out and log back in, or run: newgrp docker. Subsequent terminal sessions will automatically have the correct permissions.

⚡ 2.4. Install Nvidia Container Toolkit

The Nvidia Container Toolkit allows Docker containers to access the host’s Nvidia GPU. It replaces the older nvidia-docker2 package.

2.4.1. Set up Toolkit Repository: Configure the official Nvidia repository for the toolkit. distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
&& curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg —dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
&& curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list |
sed ‘s#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g’ |
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

. (> ⚠️ Note: The script correctly identifies the distribution, e.g., debian12, and adds the GPG key and repository source with proper signing configuration). Update the package list: bash sudo apt-get update .

2.4.2. Install Toolkit Package: sudo apt-get install -y nvidia-container-toolkit. This command installs the toolkit and its dependencies.

⚡ 2.5. Configure Docker Daemon for Nvidia Runtime

Docker needs to be configured to use the Nvidia runtime provided by the toolkit. Using the nvidia-ctk command is the recommended way to manage this configuration, as it automatically modifies the Docker daemon configuration file (/etc/docker/daemon.json) correctly, reducing the risk of manual JSON syntax errors.

2.5.1. Configure Docker using nvidia-ctk: sudo nvidia-ctk runtime configure —runtime=docker. This command registers the nvidia runtime with Docker. While manually editing /etc/docker/daemon.json to add the runtime definition or set nvidia as the default runtime is possible , using the nvidia-ctk tool is less error-prone and ensures compatibility. Relying on the —gpus flag during docker run (or the equivalent in Docker Compose) is generally sufficient and more explicit for selecting GPU access per container.
2.5.2. Restart Docker Daemon: This step is mandatory to apply the configuration changes. sudo systemctl restart docker

⚡ 2.6. Verify GPU Access from Docker

This is the final and most critical verification step. It confirms that the Nvidia driver, Docker, the Nvidia Container Toolkit, and the Docker daemon configuration are all working together correctly.

2.6.1. Run nvidia-smi inside a CUDA container: Execute nvidia-smi within a container, explicitly requesting GPU access using the —gpus all flag. A base CUDA image is suitable for this test. docker run —rm —gpus all nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi

. (> ⚠️ Note: The CUDA image tag 12.1.1-base-ubuntu22.04 is an example; choose a tag compatible with your installed driver version if necessary. The base image is sufficient and smaller than runtime/devel images).

2.6.2. Check Output: The output displayed in the terminal should be identical or very similar to the output of running nvidia-smi directly on the Debian host. It should clearly show the Nvidia RTX 4090 GPU, driver version, and CUDA version recognized by the driver. Success here indicates the environment is fully prepared for GPU-accelerated Docker workloads like Axolotl.

🌟 3. Axolotl Setup

With the host environment prepared, the next step is to set up the Axolotl project itself using Docker.

⚡ 3.1. Clone Axolotl Repository

Navigate to a suitable directory on the host system where projects are stored (e.g., ~/projects or a dedicated ~/llm-finetuning directory) and clone the official Axolotl repository from GitHub.

🌌 Example: Create a directory for LLM work

mkdir ~/axolotl-experiments cd ~/axolotl-experiments

🌌 Clone the repository

git clone https://github.com/OpenAccess-AI-Collective/axolotl.git

🌌 Navigate into the cloned directory

cd axolotl

⚡ 3.2. Setting up the Docker Environment

Axolotl provides a Dockerfile that defines an environment containing all necessary libraries and dependencies (like PyTorch, Transformers, PEFT, bitsandbytes, etc.). Building this image locally is generally recommended.

Option A: Build the Docker Image Locally (Recommended): Building the image from the cloned repository ensures that the container environment precisely matches the version of the Axolotl code being used. This is particularly important when working with the latest developments from the main branch or specific commits, enhancing reproducibility.

🌌 Ensure you are in the root directory of the cloned axolotl repository

docker build -t axolotl. This process might take a significant amount of time, especially on the first run, as it downloads the base CUDA/Python image and installs numerous Python packages.

Option B: Pull Pre-built Image (Alternative): Occasionally, the Axolotl project might offer pre-built Docker images on platforms like Docker Hub or GitHub Container Registry. Check the project’s documentation or releases page for available image tags.

🌌 Example command (replace with actual repository/tag if found)

🌌 docker pull openaccessaicollective/axolotl:latest-gpu

Using pre-built images can save build time but might lag behind the absolute latest code updates or lack specific customizations potentially present in the local Dockerfile. For consistent results aligned with the cloned code, local building is preferred.

⚡ 3.3. Directory Structure Recommendations

Organizing configuration files, datasets, and model outputs on the host filesystem is crucial for managing experiments effectively. A structured approach simplifies Docker volume mounting and keeps projects tidy. Consider the following structure within the main project directory (e.g., ~/axolotl-experiments/):

~/axolotl-experiments/ ├── axolotl/ # The cloned Axolotl repository itself

├── configs/ # Store all custom Axolotl YAML configuration files here

│ ├── gemma-7b-qlora.yml │ └── phi-3-mini-lora.yml ├── data/ # Place all training and validation datasets here

│ ├── alpaca_data.jsonl │ └── custom_chat_data.json └── outputs/ # Axolotl will save checkpoints and adapters here

├── gemma-7b-qlora-output/ └── phi-3-mini-lora-output/

axolotl/: Contains the Axolotl source code and Dockerfile.
configs/: Holds user-created YAML files defining fine-tuning parameters.
data/: Stores datasets in formats compatible with Axolotl.
outputs/: Serves as the target directory for model checkpoints, adapters, and logs generated during training. This separation makes it easy to manage different experiments, reuse datasets across configurations, and mount the necessary directories into the Docker container using -v flags, preventing clutter and potential errors.

🌟 4. Axolotl Configuration (YAML Explained)

The core of controlling an Axolotl fine-tuning run lies in the YAML configuration file. This file specifies everything from the base model and dataset to intricate training parameters and hardware optimizations tailored for the target GPU.

⚡ 4.1. Introduction to the YAML File

The YAML file acts as the central control panel for Axolotl. It uses a human-readable format to define all aspects of the fine-tuning job. Axolotl’s repository typically includes an examples/ directory containing various sample configurations for different models and techniques.

⚡ 4.2. Key Parameters Breakdown (Tailored for RTX 4090)

The following parameters are crucial for configuring a fine-tuning run, with specific considerations for leveraging the 24GB VRAM of the RTX 4090:

Model & Tokenizer:

base_model: (String) Hugging Face model identifier (e.g., google/gemma-1.1-7b-it, microsoft/Phi-3-mini-4k-instruct) or local path to the pre-trained model.
base_model_config: (String, Optional) Path/identifier for the model’s configuration file. Often the same as base_model.
model_type: (String, Optional) The Hugging Face transformers model class (e.g., LlamaForCausalLM, GemmaForCausalLM, AutoModelForCausalLM). Axolotl usually infers this.
tokenizer_type: (String, Optional) The transformers tokenizer class (e.g., LlamaTokenizer, AutoTokenizer). Usually inferred.
trust_remote_code: (Boolean, Default: false) Set to true if the model requires custom code execution from its Hugging Face repository. Use with caution.

Quantization & Precision (VRAM Management):

load_in_8bit: (Boolean, Default: false) Loads the base model using 8-bit quantization (via bitsandbytes). Significantly reduces VRAM usage, allowing larger models or batch sizes on the RTX 4090.
load_in_4bit: (Boolean, Default: false) Loads the base model using 4-bit quantization (via bitsandbytes). Offers maximum VRAM savings, crucial for fine-tuning models larger than 13B on the RTX 4090. Requires setting bnb_4bit_compute_dtype and bnb_4bit_quant_type.
bnb_4bit_compute_dtype: (String, e.g., bfloat16, float16) Data type for computations within 4-bit layers. bfloat16 is recommended for RTX 4090 (Ampere architecture and newer) due to better numerical stability and performance.
bnb_4bit_quant_type: (String, e.g., nf4, fp4) Quantization type for 4-bit. nf4 (NormalFloat 4) is a common and effective choice.
fp16: (Boolean, Default: false) Enables mixed-precision training using 16-bit floating-point numbers (FP16). Reduces VRAM and speeds up computation but can sometimes lead to numerical instability (underflow/overflow).
bf16: (Boolean, Default: false) Enables mixed-precision training using Bfloat16. Preferred over fp16 on RTX 4090 and newer GPUs due to its wider dynamic range, offering better stability while still providing VRAM and speed benefits. Set bf16: true for optimal performance on the 4090.
strict: (Boolean, Default: false) If true, strictly enforces that all keys in the model checkpoint match the model definition. Usually set to false to ignore minor mismatches.

Parameter-Efficient Fine-Tuning (PEFT):

adapter: (String, e.g., lora, qlora) Specifies the PEFT method. lora applies Low-Rank Adaptation. qlora combines 4-bit quantization (load_in_4bit: true) with LoRA, enabling fine-tuning of very large models.
lora_r: (Integer) The rank of the LoRA decomposition matrices. Higher values mean more trainable parameters, potentially capturing finer details but increasing VRAM usage. Common values are 8, 16, 32, 64. The RTX 4090 can often accommodate higher ranks like 64, 128, or even 256, potentially leading to better results compared to lower-VRAM cards limited to smaller ranks.
lora_alpha: (Integer) LoRA scaling factor. A common practice is to set lora_alpha = 2 * lora_r.
lora_dropout: (Float) Dropout probability applied to LoRA layers to prevent overfitting.
lora_target_modules: (List of Strings, Optional) Explicitly lists the names of the modules within the base model where LoRA matrices should be applied (e.g., [q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj]). Consult the model architecture or examples.
lora_target_linear: (Boolean, Default: false) If true, automatically applies LoRA to all linear layers in the model, simplifying configuration compared to manually listing modules.

Dataset Configuration:

datasets: (List of Dictionaries) Defines the training dataset(s).

path: (String) Path to the dataset file or directory, relative to where it’s mounted inside the Docker container (e.g., /workspace/data/my_dataset.jsonl).
type: (String) Specifies the format of the dataset (e.g., alpaca, sharegpt, json, completion, text). Must match the actual data structure (See Section 5).
shards: (Integer, Optional) Number of shards for large datasets.

val_set_size: (Float or Integer, Default: 0.05) Fraction (if < 1.0) or absolute number of samples from the training set to use for validation during training.

Training Arguments (Control the Training Loop):

output_dir: (String) Path where checkpoints, adapters, and logs will be saved, relative to the mount point inside Docker (e.g., /workspace/outputs/my_finetune_run).
sequence_len: (Integer) Maximum sequence length (number of tokens) the model will process. A critical factor for VRAM consumption. Longer sequences require significantly more memory. The RTX 4090 allows for longer sequences (e.g., 2048, 4096, or more depending on the model) compared to lower-VRAM cards. Adjust based on model capability and available VRAM after setting batch size and quantization.
per_device_train_batch_size: (Integer) Number of training samples processed by the GPU in a single forward/backward pass. Directly impacts VRAM usage. Higher values generally lead to more stable training gradients but require more memory. With 24GB VRAM, the RTX 4090 can often handle batch sizes of 4, 8, 16, or potentially higher, depending heavily on the model size, sequence length, and quantization method used.
gradient_accumulation_steps: (Integer, Default: 1) Number of batches to process sequentially before performing an optimizer step and clearing gradients. The effective batch size becomes num_gpus * per_device_train_batch_size * gradient_accumulation_steps. This technique allows simulating larger batch sizes without a proportional increase in VRAM usage, trading compute time for memory. Use this to achieve a larger effective batch size (e.g., 64, 128, 256) if per_device_train_batch_size is limited by VRAM.
num_train_epochs: (Integer) Number of complete passes through the entire training dataset.
learning_rate: (Float) The initial learning rate for the optimizer (e.g., 2e-5, 1e-4, 5e-5). A crucial hyperparameter affecting convergence.
lr_scheduler_type: (String, e.g., cosine, linear, constant) Strategy for adjusting the learning rate during training. cosine is a popular choice.
warmup_steps: (Integer) Number of initial training steps during which the learning rate increases linearly from 0 to the specified learning_rate. Helps stabilize training early on. Can also be a ratio (float < 1.0) of total steps.
optimizer: (String, e.g., adamw_torch, adamw_bnb_8bit, paged_adamw_8bit, paged_adamw_32bit) The optimization algorithm. adamw_bnb_8bit is common for 8-bit training. paged_adamw_8bit or paged_adamw_32bit are often recommended with QLoRA for improved memory efficiency via CPU RAM paging.
gradient_checkpointing: (Boolean, Default: false) Another technique to save VRAM by discarding intermediate activations during the forward pass and recomputing them during the backward pass. This significantly reduces memory usage at the cost of increased computation time (typically 20-30% slower training). On the RTX 4090, this might not be necessary for smaller models (<= 7B) with moderate settings, allowing for faster training. However, it becomes essential when pushing VRAM limits with larger models (> 13B), longer sequences, or larger batch sizes.
logging_steps: (Integer) Frequency (in training steps) at which to log metrics like loss to the console and monitoring tools.
save_steps: (Integer) Frequency (in training steps) at which to save model checkpoints/adapters to the output_dir. Can also be a ratio (float < 1.0) of total steps.
eval_steps: (Integer) Frequency (in training steps) at which to run evaluation on the validation set (val_set_size).

Advanced Parallelism/Optimization (Less common for single 4090 but relevant):

deepspeed: (String or Dictionary, Optional) Path to a DeepSpeed configuration JSON file or a dictionary containing DeepSpeed settings. Enables distributed training and ZeRO memory optimization techniques (offloading optimizer states, gradients, parameters to CPU RAM or NVMe). Can be useful even on a single GPU (ZeRO Stage 1, 2, or 3) to fine-tune models that would otherwise exceed VRAM, but adds complexity.
fsdp: (String or List, Optional) Path to a Fully Sharded Data Parallel (FSDP) configuration file or a list of FSDP options. An alternative framework to DeepSpeed integrated into PyTorch for distributed training and memory optimization. Less commonly used than DeepSpeed for single-GPU optimization in the Axolotl context but available.

⚡ 4.3. RTX 4090 Specific Recommendations & Model Examples

The 24GB VRAM of the RTX 4090 provides significant flexibility compared to cards with less memory.

General Strategy:

Prioritize bf16: true: Leverage the hardware’s native support for better stability and speed.
Larger Batch Sizes: Start with higher per_device_train_batch_size (e.g., 4-16) than possible on lower-VRAM cards, adjusting based on model/sequence length/quantization.
Higher LoRA Rank: Experiment with lora_r values like 64, 128, or 256 if using LoRA/QLoRA, as the VRAM can accommodate the increased parameter count.
Strategic Gradient Checkpointing: For models <= 7B and moderate sequence lengths (<= 2048), try gradient_checkpointing: false first to maximize training speed. Enable it (true) only if OOM errors occur or when fine-tuning larger models (> 13B), using very long sequences (> 4096), or maximizing batch size.
QLoRA for Large Models: Use 4-bit quantization (load_in_4bit: true, adapter: qlora) as the default approach for models larger than 7-13B to fit them into VRAM.
Paged Optimizers: Use paged_adamw_8bit or paged_adamw_32bit with QLoRA to potentially further reduce memory spikes.

Example Snippet (Conceptual - Gemma 7B QLoRA on RTX 4090):

🌌 ~/axolotl-experiments/configs/gemma-7b-qlora-4090.yml

base_model: google/gemma-1.1-7b-it model_type: GemmaForCausalLM tokenizer_type: AutoTokenizer # Usually sufficient

trust_remote_code: true # Gemma might require this

load_in_4bit: true bnb_4bit_quant_type: nf4 bnb_4bit_compute_dtype: bfloat16 # Optimal for 4090

adapter: qlora lora_r: 128 # Higher rank feasible on 4090

lora_alpha: 256 # Typically 2*r

lora_dropout: 0.05 lora_target_linear: true # Easier than listing modules

sequence_len: 2048 # Adjust based on task/data

sample_packing: true # Efficiently packs shorter sequences

datasets:

path: /workspace/data/your_dataset.jsonl # Ensure path matches mount

type: alpaca # Or sharegpt, etc.

val_set_size: 0.01 # Use 1% for validation

output_dir: /workspace/outputs/gemma-7b-qlora-4090-run per_device_train_batch_size: 8 # Start relatively high for 7B QLoRA

gradient_accumulation_steps: 4 # Effective batch size = 1 * 8 * 4 = 32

num_train_epochs: 3 learning_rate: 1e-4 # Common for QLoRA

lr_scheduler_type: cosine warmup_steps: 100

bf16: true # Use bfloat16 on 4090

optimizer: paged_adamw_8bit # Memory efficient for QLoRA

gradient_checkpointing: false # Attempt without first for speed on 7B

logging_steps: 10 save_steps: 0.1 # Save every 10% of total steps

eval_steps: 0.1 # Evaluate every 10% of total steps

Example Snippet (Conceptual - Phi-3-Mini LoRA on RTX 4090):

🌌 ~/axolotl-experiments/configs/phi-3-mini-lora-4090.yml

base_model: microsoft/Phi-3-mini-4k-instruct model_type: AutoModelForCausalLM tokenizer_type: AutoTokenizer trust_remote_code: true # Often needed for newer models

🌌 Not using 4-bit for standard LoRA

load_in_8bit: false # Optional: could use 8-bit if needed

adapter: lora lora_r: 64 # Good starting point for LoRA

lora_alpha: 128 lora_dropout: 0.1 lora_target_linear: true

sequence_len: 4096 # Leverage Phi-3’s context length

sample_packing: true datasets:

path: /workspace/data/your_chat_dataset.json type: sharegpt val_set_size: 500 # Fixed number of validation samples

output_dir: /workspace/outputs/phi-3-mini-lora-4090-run per_device_train_batch_size: 2 # Lower due to long sequence length

gradient_accumulation_steps: 16 # Effective batch size = 1 * 2 * 16 = 32

num_train_epochs: 2 learning_rate: 2e-5 # Often lower for full LoRA vs QLoRA

lr_scheduler_type: linear warmup_steps: 50

bf16: true optimizer: adamw_torch # Standard AdamW

gradient_checkpointing: true # Likely required for 4096 sequence length

logging_steps: 5 save_steps: 100 eval_steps: 100

IBM Granite Models: These models (e.g., ibm/granite-8b-code-instruct) can also be fine-tuned. Check their Hugging Face model cards for recommended model_type, tokenizer_type, and potentially lora_target_modules. Apply the same RTX 4090 optimization principles (QLoRA likely needed for 8B+, bf16, adjust batch size/gradient checkpointing).
Feasibility of Larger Models (13B, 30B+):

13B Models: Fine-tuning models in the 13B parameter range (e.g., Llama 2 13B, CodeLlama 13B) is highly feasible on an RTX 4090 using QLoRA (load_in_4bit: true). Batch sizes might need slight reduction compared to 7B models, and gradient_checkpointing: true might become beneficial or necessary depending on sequence length.
30B+ Models: Tackling models in the 30B-40B range (or even larger models like Llama 3 70B if heavily quantized base models are available) becomes challenging but potentially achievable with QLoRA. Expect to use gradient_checkpointing: true, a very small per_device_train_batch_size (likely 1, maybe 2), significant gradient_accumulation_steps, potentially shorter sequence_len, and memory-efficient optimizers (paged_adamw_8bit).

Success is not guaranteed and depends heavily on the specific model and configuration.

⚡ Table 2: Axolotl YAML Parameters - RTX 4090 Starting Points

Parameter	Notes / RTX 4090 Recommendation	Example (7B QLoRA)	Example (13B QLoRA)
load_in_4bit	true for QLoRA (essential for >7-13B)	true	true
bnb_4bit_compute_dtype	bfloat16 (preferred on 4090)	bfloat16	bfloat16
adapter	qlora (if load_in_4bit: true), lora otherwise	qlora	qlora
lora_r	Higher feasible (64-256)	128	64
lora_alpha	2 * lora_r	256	128
bf16	true (preferred over fp16)	true	true
sequence_len	Model/VRAM dependent (2048-4096+)	2048	2048
per_device_train_batch_size	Higher feasible (4-16+), adjust down for larger models/seq len	8	4
gradient_accumulation_steps	Adjust to reach effective batch size (e.g., 32-128)	4	8
gradient_checkpointing	false initially for smaller models/seq len, true if OOM or for larger models/seq len	false (try first)	true
optimizer	paged_adamw_8bit (good for QLoRA), adamw_torch (standard)	paged_adamw_8bit	paged_adamw_8bit

⚡ 4.4. DeepSpeed / FSDP Considerations

DeepSpeed and PyTorch’s FSDP are frameworks primarily designed for distributed training across multiple GPUs or nodes. However, their ZeRO (Zero Redundancy Optimizer) memory optimization techniques can sometimes be beneficial even on a single high-VRAM GPU like the RTX 4090.

ZeRO Offloading: ZeRO stages allow offloading parts of the model’s training state (optimizer states, gradients, even parameters) from GPU VRAM to CPU RAM or NVMe storage.
Single 4090 Use Case: While quantization (load_in_4bit) and gradient checkpointing are the primary tools for single-GPU memory saving, DeepSpeed ZeRO Stage 2 (offloads optimizer states and gradients) or Stage 3 (offloads parameters as well) could potentially enable fine-tuning models that just exceed the 24GB VRAM limit, provided sufficient CPU RAM is available.
Complexity: Configuring DeepSpeed or FSDP adds complexity compared to standard Axolotl parameters. It involves creating separate JSON configuration files specifying the desired stage and options.
Recommendation: For most single RTX 4090 scenarios, focus on optimizing batch size, sequence length, quantization, and gradient checkpointing first. Explore DeepSpeed/FSDP only if these methods prove insufficient for the target model size. Consult the official Axolotl, DeepSpeed, and FSDP documentation for detailed configuration instructions if needed.

🌟 5. Dataset Preparation

The quality and format of the training data are critical for successful fine-tuning. Axolotl supports various dataset formats, offering flexibility in how data is structured.

⚡ 5.1. Supported Formats

Axolotl can handle several common data formats. The type parameter in the datasets section of the YAML configuration tells Axolotl how to interpret the data file(s).

Alpaca Format (type: alpaca): Ideal for instruction-following tasks. Consists of a JSON list, where each element is a dictionary containing instruction, output, and optionally input keys. Example (alpaca_data.jsonl - each line is a JSON object): {“instruction”: “Explain the theory of relativity.”, “input”: "", “output”: “Albert Einstein’s theory of relativity…”} {“instruction”: “Translate the following sentence to French.”, “input”: “Hello, world!”, “output”: “Bonjour le monde!”}
ShareGPT Format (type: sharegpt): Suitable for multi-turn conversational data. Consists of a JSON list, where each element is a dictionary with a conversations key. The value is a list of turns, each turn having from (human, gpt, or other identifiers) and value keys. Example (sharegpt_data.json): [ { “conversations”: [ {“from”: “human”, “value”: “Hi, can you tell me about Large Language Models?”}, {“from”: “gpt”, “value”: “Certainly! Large Language Models (LLMs) are…”} ] }, { “conversations”: } ]
JSON/JSONL Format (type: json or inferred): Flexible format where each line (JSONL) or the entire file (JSON list) contains JSON objects. The structure required depends on the specific task and how Axolotl’s prompters are configured. Can often mimic Alpaca or ShareGPT structures. When using type: completion, expects a text field per JSON object.
Completion Format (type: completion or text): For training models to complete text prompts. Can be a plain text file where each line is a sample, or a JSONL file where each JSON object has a text key containing the full text sample. Example (completion_data.txt): Once upon a time, in a land far, far away… The quick brown fox jumps over the lazy dog.
Instruction Format (inst_tune.html): The reference to inst_tune.html in the user query seems unusual for standard LLM datasets. It might refer to a specific project’s internal format or potentially a typo. Standard instruction tuning typically uses formats like Alpaca. If using a custom HTML-based format, it would likely require a custom data processing script or Axolotl configuration not covered here. Focus on standard formats like Alpaca or ShareGPT for instruction/chat tuning.

⚡ Table 3: Common Axolotl Dataset Formats

Format Name	type in YAML	Structure Description	Use Case Example
Alpaca	alpaca	JSON list/JSONL of {“instruction”:…, “input”:…, “output”:…} objects	Instruction Following
ShareGPT	sharegpt	JSON list of {“conversations”: [{“from”:…, “value”:…},…]} objects	Multi-turn Chat / Dialogue
Completion	completion	JSONL of {“text”:…} objects, or plain text file (one sample per line)	Text Generation / Completion
JSON/JSONL	json	Generic JSON/JSONL, structure depends on prompter configuration (can mimic others)	Flexible / Custom Tasks

⚡ 5.2. Formatting Custom Datasets

If the source data is not already in a supported format (e.g., it’s in CSV, XML, or a database), it must be converted. Choose a target format that best represents the task (e.g., Alpaca for instructions, ShareGPT for chat). Python scripts using libraries like json and pandas (if dealing with tabular data) are commonly used for this conversion.

Conceptual Python Snippet (CSV to Alpaca JSONL):

import json import csv

def convert_csv_to_alpaca_jsonl(csv_filepath, jsonl_filepath): """Converts a CSV with ‘instruction’, ‘input’, ‘output’ columns to Alpaca JSONL.""" try: with open(csv_filepath, ‘r’, encoding=‘utf-8’) as csvfile,
open(jsonl_filepath, ‘w’, encoding=‘utf-8’) as jsonlfile:

reader = csv. DictReader(csvfile)

🌌 Ensure required columns exist (adjust column names if needed)

if not all(col in reader.fieldnames for col in [‘instruction’, ‘output’]): print(f”Error: CSV must contain ‘instruction’ and ‘output’ columns.”) return

for row in reader:

🌌 Handle optional ‘input’ column

input_text = row.get(‘input’, ”) # Use empty string if ‘input’ column is missing

alpaca_record = { “instruction”: row[‘instruction’], “input”: input_text, “output”: row[‘output’] }

🌌 Write each record as a JSON object on a new line

jsonlfile.write(json.dumps(alpaca_record) + ‘\n’) print(f”Successfully converted {csv_filepath} to {jsonl_filepath}”)

except FileNotFoundError: print(f”Error: File not found - {csv_filepath}”) except Exception as e: print(f”An error occurred: {e}”)

🌌 --- Example Usage ---

🌌 Assuming you have ‘my_custom_data.csv’ in the data directory

🌌 convert_csv_to_alpaca_jsonl(‘data/my_custom_data.csv’, ‘data/converted_alpaca_data.jsonl’)

This script reads a CSV, assumes ‘instruction’ and ‘output’ columns exist (handling an optional ‘input’ column), and writes each row as a JSON object to a .jsonl file, suitable for use with type: alpaca in Axolotl. Adapt the column names and logic based on the actual structure of the custom data.

⚡ 5.3. Data Cleaning and Validation

The principle of “garbage in, garbage out” strongly applies to LLM fine-tuning. High-quality data is essential for achieving good results. Before starting a training run:

Remove Duplicates: Identical or near-identical entries can bias the model.
Filter Low-Quality Data: Remove examples that are nonsensical, incomplete, factually incorrect (if applicable), offensive, or irrelevant to the target task.
Check Formatting: Ensure the data strictly adheres to the chosen format (e.g., valid JSON structure for Alpaca/ShareGPT/JSONL). Use linters or validation scripts.
Ensure Consistency: Maintain a consistent style, tone, and level of detail, especially in instruction/response pairs. Inconsistent data can confuse the model.
Handle Encoding: Ensure files are saved with consistent encoding, typically UTF-8. Basic Python scripts, data analysis libraries like Pandas, or even manual review (for smaller datasets) can be employed for these cleaning and validation steps. Investing time here significantly improves the chances of a successful fine-tuning outcome.

🌟 6. Running the Fine-tune

Once the environment is set up, Axolotl is cloned, the configuration YAML is prepared, and the dataset is ready, the fine-tuning process can be launched using Docker.

⚡ 6.1. Docker Command Structure

The docker run command initiates the Axolotl container, mounts the necessary host directories, grants GPU access, and executes the training script. Key components of the command:

docker run: The fundamental Docker command to run a container from an image.
—rm: Automatically removes the container filesystem when the container exits. Keeps the system clean.
-it: Allocates a pseudo-TTY and keeps STDIN open, allowing interactive output viewing (and potential interaction, though usually not needed for Axolotl training).
—gpus all: Crucial. Grants the container access to all Nvidia GPUs detected on the host system via the Nvidia Container Toolkit. Without this, the training will run on CPU only (if at all)..
—shm-size=: Sets the size of /dev/shm (shared memory) available to the container (e.g., —shm-size=16g). Some parallel processing operations, especially within PyTorch or DeepSpeed, benefit from larger shared memory. A value like 8g or 16g is a safe starting point to avoid potential bottlenecks or errors, especially when dealing with large models or data parallelism..
-v $(pwd)/configs:/workspace/configs: Mounts the configs directory from the current host working directory ($(pwd)) to /workspace/configs inside the container. Adjust the host path if necessary. /workspace is often the working directory inside the Axolotl container.
-v $(pwd)/data:/workspace/data: Mounts the data directory.
-v $(pwd)/outputs:/workspace/outputs: Mounts the outputs directory. This ensures that generated checkpoints and adapters are saved directly to the host filesystem and persist after the container exits.
axolotl: The name given to the Docker image built or pulled earlier (e.g., axolotl if built with docker build -t axolotl.).
accelerate launch -m axolotl.cli.train /path/to/config_in_container.yml: The command executed inside the container. accelerate launch is part of the Hugging Face accelerate library, used by Axolotl to handle device placement (GPU) and potentially distributed training setup. -m axolotl.cli.train invokes Axolotl’s training script, followed by the path inside the container to the YAML configuration file (e.g., /workspace/configs/your_config.yml).

⚡ 6.2. Example Command

Assuming the recommended directory structure (~/axolotl-experiments/ containing axolotl/, configs/, data/, outputs/) and that the current working directory is ~/axolotl-experiments, the command to launch training using a configuration file named gemma-7b-qlora-4090.yml would be:

🌌 Ensure you are in the ’~/axolotl-experiments’ directory

docker run —rm -it —gpus all —shm-size=16g
-v $(pwd)/configs:/workspace/configs
-v $(pwd)/data:/workspace/data
-v $(pwd)/outputs:/workspace/outputs
axolotl
accelerate launch -m axolotl.cli.train /workspace/configs/gemma-7b-qlora-4090.yml

⚡ 6.3. Using Docker Compose (Alternative)

For managing more complex configurations or for repeated runs, Docker Compose provides a declarative way to define the service, volumes, and resource requirements, including GPU access. Create a docker-compose.yml file in the ~/axolotl-experiments directory:

🌌 ~/axolotl-experiments/docker-compose.yml

services: axolotl-train: image: axolotl # Use the image name built/pulled earlier

🌌 Pass the config file path as an argument to the entrypoint/command if needed,

🌌 or override the command directly as shown here.

🌌 The config file path needs to be specified here.

command: accelerate launch -m axolotl.cli.train /workspace/configs/gemma-7b-qlora-4090.yml # Replace with your config file

working_dir: /workspace # Set working directory inside container

volumes: -./configs:/workspace/configs:rw # Read-write access recommended

-./data:/workspace/data:ro # Read-only for data is safer

-./outputs:/workspace/outputs:rw # Need write access for outputs

shm_size: ‘16gb’ # Compose syntax for shared memory

deploy: resources: reservations: devices:

driver: nvidia count: all # Request all available GPUs

capabilities: [gpu] # Specify GPU capability requirement

🌌 Optional: Keep container running if needed for inspection after error

🌌 stdin_open: true

🌌 tty: true

To launch the training using this file (assuming the default config file is specified in the command):

🌌 Run interactively, remove container on exit

docker compose run —rm axolotl-train

🌌 Or, to run potentially in detached mode (logs via ‘docker compose logs’)

🌌 docker compose up axolotl-train

Docker Compose handles the volume mounting and GPU resource allocation based on the YAML definition, offering a cleaner alternative to long docker run commands.

🌟 7. Monitoring the Training

Observing the training process is crucial for understanding progress, diagnosing issues, and deciding when to stop.

⚡ 7.1. Terminal Output

The primary source of information during training is the terminal output from the docker run or docker compose command. Key information includes:

Initialization: Logs related to loading the model, tokenizer, and dataset.
Progress: Updates showing the current epoch, step number, and percentage completion.
Metrics:

loss: The training loss, indicating how well the model fits the training data. It should generally decrease over time.
val_loss: The loss calculated on the validation set (if configured via val_set_size and eval_steps). Monitoring validation loss helps detect overfitting (when training loss keeps decreasing but validation loss starts increasing).
learning_rate: The current learning rate used by the optimizer.

Checkpoints: Messages indicating when model checkpoints/adapters are being saved to the output_dir. Pay close attention to the loss trends. A steadily decreasing training loss is expected. Sudden spikes, stagnation, or NaN (Not a Number) values often indicate problems like an unstable learning rate, bad data, or numerical issues.

⚡ 7.2. GPU Utilization (Host)

While the training runs in the container, monitor the GPU’s status on the Debian host using nvidia-smi in a separate terminal. Using watch provides continuous updates:

watch -n 1 nvidia-smi

Look for:

GPU-Util: Percentage of GPU processing cores being used. Should be high (ideally >80-90%) during active computation steps (forward/backward passes). Low utilization might indicate bottlenecks elsewhere (CPU, data loading, I/O).
Memory-Usage: VRAM consumption. Should be high but remain below the total capacity (e.g., < 24000MiB for a 24GB RTX 4090). If it hits the maximum and stays there, or if training crashes with an OOM error, it confirms memory pressure.
Temp: GPU temperature. Ensure it stays within safe operating limits (typically below 85-90°C) to avoid thermal throttling, which reduces performance.
Pwr:Usage/Cap: Power consumption relative to the card’s limit. High power draw is normal during intense training.

nvidia-smi provides essential real-time feedback on whether the hardware is being utilized effectively and helps diagnose performance issues or memory limits.

⚡ 7.3. TensorBoard / Weights & Biases (WandB)

For more sophisticated and visual monitoring, Axolotl integrates with TensorBoard and Weights & Biases (WandB).

Configuration: Enable in the YAML file:

WandB: Set wandb_project: “YourProjectName”, wandb_entity: “YourWandBUsername”, and potentially wandb_run_id. Requires a WandB account. Logs are automatically synced to the WandB cloud platform.
TensorBoard: Set use_tensorboard: true. Logs are saved locally within the output_dir.

Accessing Logs:

WandB: Access the project dashboard through the WandB website to view live plots of loss, learning rate, system metrics (GPU utilization, VRAM), and potentially custom metrics.
TensorBoard: Run the TensorBoard server from the host, pointing it to the output directory:

🌌 Install tensorboard if needed: pip install tensorboard

🌌 Run from ~/axolotl-experiments directory

tensorboard —logdir outputs/your_output_dir_name Access the web interface, usually at http://localhost:6006, to view plots. These tools provide a much richer view of training dynamics compared to terminal logs alone, making it easier to compare experiments and analyze trends over time.

🌟 8. Evaluating and Using the Model

After the fine-tuning process completes or is stopped, the resulting artifacts (usually LoRA/QLoRA adapters) need to be evaluated and potentially prepared for deployment or use in other applications like Ollama.

⚡ 8.1. Merging LoRA Adapters (If Used)

PEFT methods like LoRA and QLoRA generate adapter weights, which are small sets of parameters modifying the behavior of the original base model. For standalone inference or conversion to formats like GGUF, these adapters typically need to be merged back into the base model’s weights to create a single, self-contained fine-tuned model.

Process: This usually involves loading the original base model, loading the trained adapters from the Axolotl output_dir, applying the adapter weights to the base model layers, and then saving the resulting merged model to a new directory.
Tools:

Axolotl Scripts: Check if Axolotl provides a dedicated CLI script for merging adapters (e.g., potentially via python -m axolotl.cli.merge_lora). Consult the Axolotl documentation or —help output for available commands.
Hugging Face transformers & peft: Standard libraries can perform the merge. A typical workflow involves using PeftModel.from_pretrained() to load the base model and adapter, followed by model.merge_and_unload() to combine them, and then model.save_pretrained() to save the merged model.

Considerations: Merging creates a full-size model, significantly increasing storage requirements compared to just storing the lightweight adapters. However, it simplifies deployment as only one model directory needs to be handled. Unmerged adapters require loading both the base model and the adapters at inference time (which peft handles).

⚡ 8.2. Basic Inference/Testing

Perform quick tests to evaluate the fine-tuned model’s behavior using the Axolotl Docker environment or a compatible Python environment with necessary libraries (transformers, peft, torch, bitsandbytes if using quantization).

Conceptual Python Inference Snippet:

import torch from transformers import AutoModelForCausalLM, AutoTokenizer

🌌 If loading adapter separately:

🌌 from peft import PeftModel

🌌 --- Configuration ---

🌌 Path to the MERGED model directory (after running merge script)

merged_model_path = “/workspace/outputs/gemma-7b-qlora-4090-run/merged_model/”

🌌 OR, if loading base model and adapter separately:

🌌 base_model_path = “google/gemma-1.1-7b-it” # 🌌 Or path to downloaded base model

🌌 adapter_path = “/workspace/outputs/gemma-7b-qlora-4090-run/” # 🌌 Path to the adapter checkpoint dir

device = “cuda” # Use the GPU

🌌 Determine dtype based on training/merging (bf16 often good for 4090)

model_dtype = torch.bfloat16

🌌 --- End Configuration ---

print(f”Loading tokenizer from {merged_model_path}…”) tokenizer = AutoTokenizer.from_pretrained(merged_model_path) # Load from merged path

🌌 Or load from base if using separate adapter:

🌌 tokenizer = AutoTokenizer.from_pretrained(base_model_path)

print(f”Loading model from {merged_model_path}…”)

🌌 Load MERGED model

model = AutoModelForCausalLM.from_pretrained( merged_model_path, torch_dtype=model_dtype, device_map=“auto” # Automatically distribute across available GPUs (or use single GPU)

)

🌌 --- OR ---

🌌 Load base model and apply adapter (if not merged)

🌌 print(f”Loading base model from {base_model_path}…”)

🌌 model = AutoModelForCausalLM.from_pretrained(

🌌 base_model_path,

🌌 torch_dtype=model_dtype, # 🌌 Or load_in_4bit=True, etc. matching original config

🌌 device_map=“auto”

🌌 )

🌌 print(f”Loading adapter from {adapter_path}…”)

🌌 model = PeftModel.from_pretrained(model, adapter_path)

🌌 Optional: Merge in memory if desired for inference speed

🌌 print(“Merging adapter…”)

🌌 model = model.merge_and_unload()

🌌 --- End Alternative Loading ---

model.eval() # Set model to evaluation mode

🌌 --- Inference Example ---

prompt = “Instruction: Write a short story about a robot learning to dream.\nOutput:” # Adapt prompt format

print(f”\nPrompt: {prompt}”)

inputs = tokenizer(prompt, return_tensors=“pt”).to(device)

with torch.no_grad(): # Disable gradient calculations for inference

outputs = model.generate( **inputs, max_new_tokens=150, do_sample=True, temperature=0.7, top_p=0.9 )

response = tokenizer.decode(outputs, skip_special_tokens=True) print(f”\nResponse:\n{response}”)

This snippet demonstrates loading either a merged model or a base model with adapters, preparing a prompt, generating a response, and decoding it. Adjust paths, model loading parameters (torch_dtype, quantization if applicable), and generation parameters (max_new_tokens, temperature, etc.) as needed. Run this script within the Axolotl container or a similarly configured environment.

⚡ 8.3. Exporting to GGUF for Ollama

A key goal for many users is to run fine-tuned models locally using engines like Ollama, which primarily use the GGUF format developed by the llama.cpp project. This requires converting the fine-tuned model (in Hugging Face format, typically after merging adapters) to GGUF.

1. Obtain llama.cpp: Clone the repository and build the necessary tools. This should ideally be done on the host system or within a dedicated environment.

🌌 Navigate to a suitable directory (e.g., ~/tools)

git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp

🌌 Build the conversion tools (may require build-essential, cmake, python3-dev)

🌌 Check llama.cpp documentation for specific build prerequisites on Debian

make

🌌 Install Python requirements for conversion script

🌌 Consider using a virtual environment (python -m venv.venv && source.venv/bin/activate)

pip install -r requirements.txt

2. Prepare the Model: Ensure the fine-tuned model is available in the standard Hugging Face format (containing config.json, pytorch_model.bin or model.safetensors, tokenizer.model, tokenizer_config.json, etc.). This should be the output of the adapter merging step (Section 8.1). Let’s assume the merged model is in ~/axolotl-experiments/outputs/gemma-7b-qlora-4090-run/merged_model/.
3. Run the Conversion Script: Use the convert.py (or potentially convert-hf-to-gguf.py - check the current script name in llama.cpp) script within the llama.cpp directory.

🌌 Ensure you are in the llama.cpp directory or provide full paths

🌌 Activate virtual environment if used: source.venv/bin/activate

python convert.py
~/axolotl-experiments/outputs/gemma-7b-qlora-4090-run/merged_model/
—outfile ~/axolotl-experiments/outputs/gemma-7b-qlora-4090-run/gemma-7b-ft.q4_K_M.gguf
—outtype q4_K_M

🌌 Deactivate virtual environment if used: deactivate

Explanation of Arguments:

First argument: Path to the directory containing the Hugging Face format model to convert.
—outfile: Path where the resulting GGUF file will be saved. Include the desired quantization in the filename for clarity.
—outtype: Specifies the quantization type to apply during conversion. This determines the trade-off between model size/performance and inference speed/resource usage. Common options include:

f16: Float16, no quantization (largest size, highest potential quality).
q4_0, q4_1: Basic 4-bit quantization.
q4_K_M, q4_K_S: Recommended 4-bit K-Quant methods (good balance).
q5_0, q5_1: 5-bit quantization.
q5_K_M, q5_K_S: Recommended 5-bit K-Quant methods (better quality than 4-bit, larger size).
q8_0: 8-bit quantization (good quality, significantly larger than 4/5-bit).
q6_K: 6-bit K-Quant. Consult the llama.cpp documentation for the latest list and recommendations. q4_K_M or q5_K_M are often good starting points.

4. Use with Ollama: Once the .gguf file is created, it can be used with Ollama by defining a Modelfile or potentially using Ollama’s import features (refer to Ollama documentation).

This conversion process itself can be RAM-intensive, especially for larger models or lower quantization levels (like f16). The GGUF file contains the model weights and necessary metadata for llama.cpp-based engines.

🌟 9. Troubleshooting Common Issues

Encountering issues during setup or training is common. This section addresses potential problems specific to the Debian 12 / RTX 4090 / Docker / Axolotl stack.

⚡ Table 4: Common Troubleshooting Issues and Solutions

Issue Symptom	Potential Causes	Solutions & Verification Steps	Relevant Snippets
Nvidia Driver Problems (Host)
nvidia-smi fails on host.	Driver not loaded/installed correctly; Secure Boot MOK issue; Nouveau conflict; Kernel header mismatch.	1. Verify driver package installed (`dpkg -l	grep nvidia-driver). 2. Check Secure Boot status (mokutil —sb-state), re-enroll MOK if needed during boot.span_34 span_34 span_36 span_36 3. Ensure Nouveau is blacklisted (check /etc/modprobe.d/). 4. Install matching kernel headers (sudo apt install linux-headers-$(uname -r)). 5. Check DKMS status (dkms status). 6. Reboot and re-run nvidia-smi`.
Black screen after reboot post-driver install.	Driver conflict (possibly Nouveau); Graphics session failed to start.	1. Boot into recovery mode or text console (Ctrl+Alt+F2-F6). 2. Purge Nvidia drivers (sudo apt remove —purge nvidia*). 3. Reinstall carefully, ensuring Nouveau is handled. 4. Check Xorg logs (/var/log/Xorg.0.log).
Docker GPU Access Problems
docker run —gpus all… fails (runtime error, device error).	nvidia-container-toolkit not installed; Incorrect Docker daemon config (/etc/docker/daemon.json); Docker daemon not restarted after config change.	1. Verify toolkit installed (`dpkg -l	grep nvidia-container-toolkit). 2. Re-run sudo nvidia-ctk runtime configure —runtime=docker.span_83 span_83 span_87 span_87 3. Validate /etc/docker/daemon.jsonsyntax. 4.

**Crucially:** Restart Docker:sudo systemctl restart docker.span_84 span_84 span_88 span_88 5. Run verification: docker run —rm —gpus all nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi`. | nvidia-smi works on host but fails inside container. | Docker daemon config issue; Docker daemon not restarted. | See previous row - focus on nvidia-ctk runtime configure, daemon.json, and systemctl restart docker. | | | Docker Permissions | | | | | docker: permission denied… without sudo. | User not in docker group; Session not updated after adding user to group. | 1. Add user: sudo usermod -aG docker $USER. 2. Log out and log back in, OR run newgrp docker in the current shell. | | | Axolotl Training Issues | | | | | CUDA out of memory (OOM) error during training. | Batch size too large; Sequence length too long; Model too large for VRAM (even 24GB); Gradient checkpointing disabled; Insufficient quantization. | (RTX 4090 specific order): 1. Reduce per_device_train_batch_size. 2. Increase gradient_accumulation_steps. 3. Reduce sequence_len. 4. Enable gradient_checkpointing: true. 5. Use/Increase quantization (load_in_8bit: true or load_in_4bit: true). 6. Use paged_adamw_8bit optimizer. 7. Reduce lora_r. 8. (Advanced) Explore DeepSpeed ZeRO offloading. | Axolotl fails immediately with YAML errors. | Incorrect YAML syntax (indentation!); Misspelled parameter names; Invalid parameter values; Missing required parameters. | 1. Validate YAML syntax using an online linter or IDE plugin. 2. Carefully check parameter names and values against Axolotl documentation/examples. 3. Ensure correct data types (boolean, int, float, string, list). | - | | Dataset loading errors (file not found, format error). | Incorrect path in YAML (relative to container mount); Incorrect type specified; Malformed data file (invalid JSON, etc.); Incorrect volume mount in docker run/compose. | 1. Verify -v volume mounts in docker run command or volumes in compose. 2. Ensure path in YAML uses the container’s path (e.g., /workspace/data/…). 3. Confirm dataset type matches the actual file structure. 4. Validate data file format (e.g., jsonlint for JSON/JSONL). | Python package errors (e.g., ImportError, version conflicts). | Incompatible library versions within container; CUDA/PyTorch/driver mismatch (unlikely with Axolotl Docker image). | 1. Ensure using the correct Axolotl Docker image (rebuild if needed: docker build -t axolotl. in repo root). 2. Check Axolotl GitHub issues for known compatibility problems with specific library versions. 3. Consider using a specific, known-stable commit/tag of Axolotl. | (related concept) |

🌟 Conclusion

This manual provides a comprehensive guide for setting up a Debian 12 system with an Nvidia RTX 4090 GPU to fine-tune Large Language Models using Docker and the Axolotl framework. By following the detailed steps for environment configuration—including Nvidia driver installation , Docker setup , and Nvidia Container Toolkit integration —a robust platform for GPU-accelerated container workloads can be established. The configuration of Axolotl via its YAML file is central to the fine-tuning process. Understanding key parameters related to model selection, quantization (load_in_4bit, load_in_8bit), precision (bf16), PEFT methods (lora, qlora, lora_r), dataset specification, and training arguments (per_device_train_batch_size, gradient_accumulation_steps, sequence_len, gradient_checkpointing) allows for effective utilization of the RTX 4090’s 24GB VRAM. Dataset preparation, proper execution via Docker commands or Compose, and diligent monitoring using terminal output, nvidia-smi, and tools like WandB/TensorBoard are crucial for successful training runs. Furthermore, the steps outlined for merging LoRA adapters and, critically, converting the final model to the GGUF format using llama.cpp enable the practical application of the fine-tuned model in popular local inference engines like Ollama. While the process is detailed, troubleshooting common issues related to drivers, Docker configuration, permissions, memory limits, and configuration errors is often necessary. The troubleshooting section provides targeted solutions for potential roadblocks. By carefully following this guide, technically competent users can effectively harness the power of their Debian 12 / RTX 4090 system to explore the rapidly evolving field of LLM fine-tuning with Axolotl, enabling the creation of customized models for various downstream tasks. Continuous reference to the official documentation for Axolotl, Nvidia, Docker, and Hugging Face, along with engagement with relevant communities (like r/LocalLLaMA), is encouraged for staying updated and resolving more complex challenges.

🔧 Works cited

1. Full Guide: How To Install Git on Debian 12 and Use it - SSD Nodes, https://www.ssdnodes.com/blog/install-git-on-debian-12/ 2. Setting Up Git Version Control on Debian 12 - Reintech, https://reintech.io/blog/setting-up-git-version-control-debian-12 3. Install Git | Atlassian Git Tutorial, https://www.atlassian.com/git/tutorials/install-git 4. How to Install Git on Debian 12, 11, or 10 - LinuxCapable, https://linuxcapable.com/how-to-install-git-on-debian-linux/ 5. Installing Miniconda - Anaconda, https://www.anaconda.com/docs/getting-started/miniconda/install 6. Miniconda - Anaconda, https://www.anaconda.com/docs/getting-started/miniconda/main 7. How to Install Anaconda on Debian 12 | Vultr Docs, https://docs.vultr.com/how-to-install-anaconda-on-debian-12 8. NvidiaGraphicsDrivers - Debian Wiki, https://wiki.debian.org/NvidiaGraphicsDrivers 9. How to Install Nvidia Graphics Drivers on Debian 12, https://www.tecmint.com/install-nvidia-drivers-debian/ 10. NVIDIA install guide - Linux.org, https://www.linux.org/threads/nvidia-install-guide.48421/ 11. How to Install Docker and Docker Compose on Debian 12 ‘Bookworm’ - xTom, https://xtom.com/blog/how-to-install-docker-and-docker-compose-on-debian-12-bookworm/ 12. Installing Nvidia Graphics Drivers on Debian 12 “Bookworm” and enrolling machine owner’s key (MOK) to use DKMS modules - Unix & Linux Stack Exchange, https://unix.stackexchange.com/questions/790879/installing-nvidia-graphics-drivers-on-debian-12-bookworm-and-enrolling-machine 13. How to Configure the NVIDIA vGPU Drivers, CUDA Toolkit and Container Toolkit on Debian 12 | The Virtual Horizon, https://thevirtualhorizon.com/2024/05/31/how-to-configure-the-nvidia-vgpu-drivers-cuda-toolkit-and-container-toolkit-on-debian-12/ 14. Install Docker Engine on Debian, https://docs.docker.com/engine/install/debian/ 15. Install Docker Engine on Debian 12 - G RBE, https://gorbe.io/posts/docker/install/ 16. How to Install Docker on Debian 12 - Vultr Docs, https://docs.vultr.com/how-to-install-docker-on-debian-12 17. How to Install Docker in Debian 12 Server Using Docker Apt Repository - Web Shanks, https://webshanks.com/how-to-install-docker-in-debian-12-server-using-docker-apt-repository/ 18. install the Docker Compose plugin, https://docs.docker.com/compose/install/linux/ 19. Setting Up Docker and Docker Compose on Debian 12 | Reintech media, https://reintech.io/blog/setting-up-docker-docker-compose-debian-12 20. Docker won’t install on debian! - Reddit, https://www.reddit.com/r/debian/comments/1hvn278/docker\_wont\_install\_on\_debian/ 21.

[HowTo] Installing Docker and NVIDIA runtime - My experience and HowTo, https://forum.manjaro.org/t/howto-installing-docker-and-nvidia-runtime-my-experience-and-howto/97017 22. Installation Guide — container-toolkit 1.7.0 documentation - NVIDIA Docs, https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/1.7.0/install-guide.html 23. Installation Guide — container-toolkit 1.12.1 documentation - NVIDIA Docs, https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/1.12.1/install-guide.html 24. Installing the NVIDIA Container Toolkit, https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html 25. Debian 12 Bookworm : NVIDIA Container Toolkit : Install - Server World, https://www.server-world.info/en/note?os=Debian\_12&p=nvidia&f=2 26. How do I specify nvidia runtime from docker-compose.yml? - Stack Overflow, https://stackoverflow.com/questions/47465696/how-do-i-specify-nvidia-runtime-from-docker-compose-yml 27.

docker build with nvidia runtime - Stack Overflow, https://stackoverflow.com/questions/59691207/docker-build-with-nvidia-runtime 28. I am trying to run a Docker container using nvidia/cuda:11.8.0-base-ubuntu22.04 as the base image, with PyTorch and CUDA-enabled dependencies to execute a FastAPI application. The application works perfectly on my local machine and correctly detects CUDA - Docker Community Forums, https://forums.docker.com/t/i-am-trying-to-run-a-docker-container-using-nvidia-cuda-11-8-0-base-ubuntu22-04-as-the-base-image-with-pytorch-and-cuda-enabled-dependencies-to-execute-a-fastapi-application-the-application-works-perfectly-on-my-local-machine-and-correctly-detects-cuda/145160 29. Guide to setup NVIDIA drivers and Docker for GPU pass-through to a container - Reddit, https://www.reddit.com/r/PleX/comments/18gnmmx/guide\_to\_setup\_nvidia\_drivers\_and\_docker\_for\_gpu/ 30. How to get Docker to recognize NVIDIA drivers? - Stack Overflow, https://stackoverflow.com/questions/57066162/how-to-get-docker-to-recognize-nvidia-drivers 31. Hard time setting up Nvidia GPU to Docker container under Debian - Reddit, https://www.reddit.com/r/linuxquestions/comments/1aldy8r/hard\_time\_setting\_up\_nvidia\_gpu\_to\_docker/ 32.

Fine Tuning Large Language Models With Axolotl On Debian 12 And Nvidia Rtx 4090: A Comprehensive Manual

📖 Reading Mode

📖 Table of Contents