🌌 Fine-Tuning Google Gemma 3 Locally: A Comprehensive Guide for Debian 12 and NVIDIA RTX 4090

🌟 1. Introduction

🚀 Welcome to this comprehensive guide! This section will give you the foundational knowledge you need.

⚡ 1.1. The Rise of Gemma 3

Google’s Gemma family represents a significant contribution to the open-source Large Language Model (LLM) landscape, stemming from the same research and technology underpinning the powerful Gemini models.2 The latest iteration, Gemma 3, marks a substantial advancement over its predecessors, incorporating highly requested features and pushing the boundaries of open model capabilities.4 Key enhancements include multimodality (supporting interleaved image and text input for text output), significantly longer context windows (up to 128K tokens), expanded multilingual support (over 140 languages), and improved reasoning, coding, and chat functionalities, including function calling.4 Gemma 3 is available in a range of sizes—1B (text-only), 4B, 12B, and 27B parameters—offering both pre-trained (PT) checkpoints suitable for further fine-tuning and instruction-tuned (IT) variants optimized for direct use in conversational or instruction-following tasks.5 This open nature, combined with its state-of-the-art performance, makes Gemma 3 an attractive candidate for local fine-tuning and development.6

⚡ 1.2. Why Fine-Tune?

While pre-trained LLMs like Gemma 3 possess vast general knowledge, fine-tuning allows for tailoring these models to specific needs and domains, often yielding superior performance compared to relying solely on prompt engineering or even Retrieval-Augmented Generation (RAG) for certain applications.9 The primary benefits of fine-tuning include:

Task Specialization: Adapting the model to excel at specific downstream tasks, such as question-answering (QA) based on provided context or interacting with a large knowledge base (KB).10
Domain Adaptation: Infusing the model with domain-specific terminology, knowledge, and nuances not present or sufficiently represented in the general pre-training data.9
Improved Accuracy & Relevance: Enhancing the precision and relevance of model outputs for specific contexts or user groups.10
Output Control: Gaining better control over the style, tone, and format of the model’s generations, ensuring consistency with brand voice or application requirements.9
Reduced Hallucinations: Mitigating the tendency of LLMs to generate factually incorrect or nonsensical information, which is crucial for critical applications.9
Potential Latency Optimization: Creating smaller, specialized models that might offer faster inference times compared to larger, general-purpose models for the specific target task.9

Fine-tuning differs from RAG, where external knowledge is retrieved and injected into the prompt at inference time.18 RAG excels at incorporating rapidly changing information without retraining 11, while fine-tuning embeds knowledge more deeply into the model’s parameters, potentially leading to better integration and reasoning over the learned information, albeit with the risk of “catastrophic forgetting” of general knowledge.12 Fine-tuning can also be combined with RAG for potentially superior results.11

⚡ 1.3. Tutorial Scope and Target Audience

This tutorial provides a comprehensive, step-by-step guide to fine-tuning the Google Gemma 3 model locally on a Debian 12 system equipped with an NVIDIA RTX 4090 graphics card. The focus is specifically on adapting Gemma 3 for question-answering (QA) tasks and tasks involving interaction with large knowledge bases (KB).

The tutorial covers the entire workflow:

1. Environment Setup: Configuring Debian 12, NVIDIA drivers, CUDA, Python, and essential ML libraries. 2. Data Preparation: Sourcing, cleaning, formatting, and tokenizing data specifically for QA (SQuAD-like format) and large KB tasks (including chunking strategies). 3. Fine-Tuning Configuration: Understanding hyperparameters, comparing full fine-tuning vs. PEFT (LoRA/QLoRA), and selecting optimal settings. 4. Execution: Running the fine-tuning process using Hugging Face libraries (transformers, peft, trl) with runnable PyTorch code examples. 5. Evaluation: Assessing the fine-tuned model’s performance using relevant metrics (EM, F1, ROUGE, BLEU) and code examples. 6. Troubleshooting: Identifying and mitigating common issues like overfitting, catastrophic forgetting, Out-of-Memory (OOM) errors, and training instability. This guide is intended for practitioners with intermediate machine learning experience, who are comfortable working within a Linux environment (specifically Debian) and have a working knowledge of Python and core deep learning concepts.

⚡ 1.4. Hardware Context (RTX 4090)

The NVIDIA RTX 4090, with its 24GB of VRAM, is a powerful consumer-grade GPU. However, fine-tuning large language models like Gemma 3, even the smaller variants, pushes the limits of this hardware.24 Full fine-tuning of models like Gemma 3 4B or 12B is generally infeasible on a single RTX 4090 due to the immense memory requirements for storing model weights, gradients, and optimizer states.5 Therefore, this tutorial heavily relies on Parameter-Efficient Fine-Tuning (PEFT) methods, particularly QLoRA (Quantized Low-Rank Adaptation), which drastically reduce memory consumption by quantizing the base model and training only a small number of adapter parameters.28 Understanding these hardware limitations and the necessity of memory optimization techniques is crucial for successfully fine-tuning Gemma 3 locally.30

🌟 2. Chapter 1: Environment Setup on Debian 12 for RTX 4090

Setting up a stable and correctly configured environment is paramount for successful LLM fine-tuning. This chapter details the steps to prepare a Debian 12 system with an NVIDIA RTX 4090 GPU.

⚡ 2.1. System Preparation

First, ensure your Debian 12 system is up-to-date. Open a terminal and run:

Bash

sudo apt update && sudo apt full-upgrade -y

Next, install essential packages required for building software, managing kernel modules, and interacting with repositories:

Bash

sudo apt install build-essential gcc dirmngr ca-certificates software-properties-common apt-transport-https dkms curl git -y

build-essential: Provides compilers (like gcc) and other tools needed for compiling software.
dkms: Dynamic Kernel Module Support allows kernel modules (like the NVIDIA driver) to be automatically rebuilt when the kernel is updated, preventing driver issues after system updates.31
Other packages (dirmngr, ca-certificates, software-properties-common, apt-transport-https, curl, git) facilitate repository management, secure connections, and code retrieval.33

⚡ 2.2. NVIDIA Driver Installation

While NVIDIA provides .run installers 34, the recommended method for Debian systems is to use the drivers available in the official Debian repositories (non-free component) for better system integration, easier updates, and reduced risk of conflicts.32

1. Enable contrib, non-free, and non-free-firmware Repositories: Edit the APT sources list file: Bash sudo nano /etc/apt/sources.list

Ensure lines for bookworm include contrib non-free non-free-firmware. A typical line might look like this (add or modify as needed): deb http://deb.debian.org/debian/ bookworm main contrib non-free non-free-firmware deb http://deb.debian.org/debian/ bookworm-updates main contrib non-free non-free-firmware deb http://security.debian.org/debian-security bookworm-security main contrib non-free non-free-firmware

Save the file (Ctrl+O in nano, then Enter) and exit (Ctrl+X). 2. Update Package List: Refresh the package list to include packages from the newly enabled components: Bash sudo apt update 3. Install Kernel Headers: The NVIDIA driver needs kernel headers matching your currently running kernel to build its module via DKMS.32 Install them using: Bash sudo apt install linux-headers-$(uname -r)

Alternatively, for the standard AMD64 kernel: Bash sudo apt install linux-headers-amd64 4. Blacklist the Nouveau Driver: The open-source Nouveau driver conflicts with the proprietary NVIDIA driver and must be disabled.31 Create a configuration file to blacklist it: Bash sudo nano /etc/modprobe.d/nvidia-blacklist.conf

Add the following lines: blacklist nouveau options nouveau modeset=0

Save and exit. You may also need to update the initramfs: sudo update-initramfs -u. 5. Install NVIDIA Driver and Firmware: Install the driver package and necessary firmware blobs 32: Bash sudo apt install nvidia-driver firmware-misc-nonfree

This command installs the appropriate nvidia-driver package (e.g., version 525 or newer for RTX 4090 support on Bookworm) and triggers DKMS to build the kernel module. 6. Reboot: A reboot is required to load the new driver and ensure Nouveau is fully unloaded: Bash sudo reboot 7. Verify Installation: After rebooting, open a terminal and run: Bash nvidia-smi

This command should display details about your RTX 4090 and the loaded driver version, confirming the installation was successful.

⚡ 2.3. CUDA Toolkit Installation

With the driver installed, proceed to install the CUDA Toolkit. Using the official NVIDIA repository is recommended for obtaining specific versions compatible with deep learning libraries.33

1. Prerequisites Check:

CUDA-Capable GPU: RTX 4090 confirmed.37
Supported OS: Debian 12 confirmed.38
GCC installed: Done in step 2.1.38
NVIDIA Driver installed: Done in step 2.2.38 2. Add NVIDIA CUDA Repository: Download and install the CUDA keyring package to add the repository and its GPG key 34: Bash

🌌 Adjust version/URL if needed based on NVIDIA’s latest instructions for Debian 12

wget https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86\_64/cuda-keyring\_1.1-1\_all.deb sudo dpkg -i cuda-keyring_1.1-1_all.deb

🌌 Clean up the downloaded file

rm cuda-keyring_1.1-1_all.deb 3. Update Package List: Bash sudo apt update 4. Install CUDA Toolkit: Install a specific version of the CUDA toolkit. Compatibility is key here. While the latest CUDA version might be available, PyTorch often has specific CUDA version requirements.39 Check the PyTorch installation instructions (https://pytorch.org/get-started/locally/) for the recommended CUDA version compatible with the desired PyTorch build (Stable or Nightly). Bash

🌌 Example for CUDA 12.1

sudo apt-get -y install cuda-toolkit-12-1

🌌 Or install the latest generally available version (verify PyTorch compatibility first)

🌌 sudo apt-get -y install cuda-toolkit

> ⚠️ Note: Installing cuda-toolkit-X-Y usually avoids installing a bundled driver, which is preferred since the driver was installed separately via Debian packages. Avoid installing the generic cuda package which might pull in conflicting drivers.33 5. Set Environment Variables: Add the CUDA paths to your shell configuration file (e.g., ~/.bashrc or ~/.zshrc) to make CUDA tools available.35 Open the file (e.g., nano ~/.bashrc) and add these lines at the end (adjust version number if you installed a different CUDA version): Bash export PATH=/usr/local/cuda-12.1/bin${PATH:+:${PATH}} export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

Save the file and apply the changes to your current session: source ~/.bashrc (or source ~/.zshrc). 6. Verification: Check the installed CUDA compiler version: Bash nvcc —version

The output should match the version you installed. For a more thorough check, consider compiling and running samples like deviceQuery from the official CUDA samples repository.35 7. Potential Issues: If you encounter display issues after installing CUDA, it might indicate a conflict with the driver installation method.42 Ensure you installed the driver via Debian packages before installing the CUDA toolkit package (e.g., cuda-toolkit-12-1, not the full cuda metapackage).

⚡ 2.4. Python Environment Setup

Using a dedicated Python virtual environment is crucial for managing dependencies and avoiding conflicts between projects.39 Miniconda provides a robust way to manage environments and complex dependencies like PyTorch with CUDA.

1. Install Miniconda:

Download the latest Linux x86_64 installer script from the Miniconda repository (https://repo.anaconda.com/miniconda/).44 Bash wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86\_64.sh
Verify the installer’s integrity using SHA256 checksum. Find the official hash on the Miniconda repo page and compare 44: Bash sha256sum Miniconda3-latest-Linux-x86_64.sh

🌌 Compare output with the official hash

Run the installer script: Bash bash Miniconda3-latest-Linux-x86_64.sh Follow the prompts: accept the license, confirm the installation location (default is usually fine), and agree to run conda init to initialize Conda in your shell.44
Close and reopen your terminal or reload your shell configuration: Bash source ~/.bashrc # or ~/.zshrc

Your prompt should now be prefixed with (base), indicating the base Conda environment is active.44 2. Create and Activate Environment: Create a dedicated environment for Gemma 3 fine-tuning. Using Python 3.10 or 3.11 is generally recommended for compatibility with recent ML libraries.35 Bash conda create —name gemma3_env python=3.10 -y conda activate gemma3_env

Your prompt should now change to (gemma3_env). All subsequent package installations should be done within this activated environment.

(Alternative: Using venv)

If you prefer Python’s built-in venv:

Bash

python3 -m venv gemma3_env source gemma3_env/bin/activate

While functional, Conda often handles complex dependencies like CUDA-enabled PyTorch more smoothly.39

⚡ 2.5. Installing Core ML Libraries

Inside your activated gemma3_env:

1. Install PyTorch with CUDA Support: The most reliable way is to use the command generator on the official PyTorch website (https://pytorch.org/get-started/locally/).39 Select Linux, Conda (or Pip if using venv), Python, and crucially, the CUDA version matching the one you installed (e.g., 12.1).39

Example command for Conda and CUDA 12.1: Bash conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
Example command for Pip and CUDA 12.1: Bash pip3 install torch torchvision torchaudio —index-url https://download.pytorch.org/whl/cu121
Verification: Run a quick Python check: Python import torch print(f”PyTorch Version: {torch.__version__}”) print(f”CUDA Available: {torch.cuda.is_available()}”) if torch.cuda.is_available(): print(f”CUDA Version: {torch.version.cuda}”) print(f”Device Name: {torch.cuda.get_device_name(0)}”) else: print(“CUDA not available. Check installation.”) This should output the PyTorch version, True for CUDA availability, the CUDA version PyTorch was compiled with, and your “NVIDIA GeForce RTX 4090”.35 2. Install Hugging Face Libraries: Install the necessary libraries from the Hugging Face ecosystem 48: Bash pip install —upgrade datasets accelerate evaluate # Core libraries

pip install —upgrade peft bitsandbytes trl # PEFT, Quantization, Trainer

transformers: Core library for models and tokenizers.
datasets: For loading and processing datasets.
accelerate: Simplifies distributed training and handles device placement.49
evaluate: For model evaluation metrics.50
peft: Parameter-Efficient Fine-Tuning library (LoRA, QLoRA).52
bitsandbytes: Required for 4-bit/8-bit quantization (QLoRA).29
trl: Library containing the SFTTrainer for supervised fine-tuning.9
Gemma 3 Compatibility > ⚠️ Note: Initially, Gemma 3 support might require installing transformers directly from a specific commit or branch on GitHub.7 Check the latest Gemma 3 documentation or Hugging Face model card. If needed: Bash

🌌 Example: Replace with the correct commit/branch if needed

🌌 pip install git+https://github.com/huggingface/transformers@main

However, aim to use the latest stable pip install transformers release once Gemma 3 support is fully integrated.

⚡ 2.6. Environment Setup Summary & Insights

Successfully setting up the environment is often the most challenging part of local LLM fine-tuning. The stability and performance of the entire process hinge on the correct installation and compatibility of NVIDIA drivers, the CUDA Toolkit, and the specific PyTorch build.40 Using the recommended installation paths—Debian’s package manager for drivers 32 and NVIDIA’s official repository for a specific CUDA Toolkit version 33—mitigates many common pitfalls.

⚡ Table 1: Recommended Software Versions (Example)

Component	Recommended Version/Source	Notes
Operating System	Debian 12 (Bookworm)	Stable base for setup.
Kernel	Linux 6.1+ (Default Debian 12)	Ensure linux-headers match uname -r.
NVIDIA Driver	>= 525 (via Debian non-free)	Check nvidia-smi. Supports RTX 4090. Package manager preferred.
CUDA Toolkit	12.1 (via NVIDIA repo)	Verify compatibility with target PyTorch version.39 11.8 also viable.
Python	3.10 / 3.11 (via Miniconda)	Good compatibility with ML ecosystem.
Miniconda	Latest	Robust environment management.
PyTorch	Latest Stable (matching CUDA)	Use official command generator. Check torch.cuda.is_available().
Transformers	Latest Stable (or specific commit)	Check Gemma 3 compatibility notes.7
PEFT, TRL, Accel.	Latest Stable	Install via pip within Conda env.
bitsandbytes	Latest Stable	Required for QLoRA.

> ⚠️ Note: These versions are illustrative. Always verify the latest compatibility information from official sources (NVIDIA, PyTorch, Hugging Face) at the time of setup.

🌟 3. Chapter 2: Understanding Gemma 3

Before fine-tuning, it’s essential to understand the specifics of the Gemma 3 model architecture, its capabilities, and how to access it.

⚡ 3.1. Model Overview

Gemma 3 is the latest generation in Google’s family of open-weight language models, building upon the research behind Gemini.2 It’s designed as a text-to-text, decoder-only transformer model.57 Key architectural advancements over previous Gemma versions include a novel attention mechanism that alternates between local sliding window attention and global self-attention layers, support for a much longer context length, a new tokenizer optimized for multilingual text, and, for larger models, an integrated SigLIP vision encoder enabling multimodal input.4

Gemma 3 models excel at a variety of tasks, including text generation, summarization, reasoning, question answering 2, and even function calling (allowing interaction with external tools or APIs).4 A significant leap is the expanded language support, covering over 140 languages in the 4B, 12B, and 27B parameter variants.4 Furthermore, these larger models are multimodal, capable of processing interleaved image and text inputs to generate text outputs.2

⚡ 3.2. Key Technical Specifications

Context Length: Gemma 3 offers a remarkable 128,000-token context window for the 4B, 12B, and 27B models, a 16-fold increase over previous generations.2 The 1B model supports a 32K token context window.2 This large context is particularly beneficial for tasks involving long documents, extensive knowledge bases, or complex multi-turn interactions.
Tokenizer: Gemma 3 employs a SentencePiece tokenizer with a large vocabulary of approximately 262,000 tokens.4 This large vocabulary, shared with Gemini 2.0, enhances its multilingual capabilities.7 It uses byte-level encoding, allowing it to handle characters from virtually any language.61 This contrasts with simpler word-based or character-based tokenizers, aiming for a balance between vocabulary size and the ability to represent rare words or subword units effectively.62
Available Sizes and Variants: Gemma 3 is released in four sizes: 1B, 4B, 12B, and 27B parameters.5 The 1B model is text-only, while the 4B, 12B, and 27B models are multimodal.5 Each size is available in two variants:

Pre-trained (PT): Base models trained on large datasets, suitable for further fine-tuning (SFT) on specific tasks.4
Instruction-tuned (IT): Models that have undergone additional supervised fine-tuning and potentially reinforcement learning (RLHF/RLMF) to follow instructions and engage in dialogue effectively.4 IT models are generally preferred for direct application in chat or instruction-following scenarios.

Quantization Options: To cater to different hardware constraints, Gemma 3 models are offered with various levels of quantization, reducing memory footprint and potentially accelerating inference at the cost of some precision.5 Available options include full 32-bit float (FP32), BFloat16 (BF16), SFP8 (8-bit float), Q4_0 (4-bit custom quantization), and INT4 (4-bit integer).5

⚡ 3.3. Performance and Benchmarks

Gemma 3 models demonstrate state-of-the-art performance among open models. The Gemma 3 27B IT variant, for instance, achieved a high Elo score on the LMSys Chatbot Arena leaderboard, ranking competitively even against leading closed-source models like Gemini 1.5 Pro.7 The models show strong results across a range of benchmarks measuring reasoning (MATH), coding (LiveCodeBench), knowledge (MMLU-Pro), and multimodal understanding (MMMU).7 Notably, even smaller variants show significant improvements, with Gemma 3 4B IT outperforming the previous generation’s Gemma 2 27B IT model.7

⚡ 3.4. Accessing Gemma 3 via Hugging Face

Gemma 3 models are readily accessible through the Hugging Face Hub:

Model IDs: Models are typically named following the pattern google/gemma-3-{size}-{pt/it}. Examples include google/gemma-3-4b-it 2 and the base pre-trained version google/gemma-3-1b-pt.2
License Agreement: Accessing Gemma models requires users to be logged into their Hugging Face account and explicitly agree to Google’s usage terms and license on the model’s repository page.2
Basic Usage (transformers): The Hugging Face transformers library provides seamless integration. Ensure you have a compatible version installed (check model card requirements, potentially v4.49+ or newer).2 A simple inference example using the pipeline API for an IT model: Python from transformers import pipeline import torch

🌌 Ensure you have accepted the license on Hugging Face Hub

pipe = pipeline(“text-generation”, model=“google/gemma-3-4b-it”, device=“cuda”, # Use “cpu” if no GPU

torch_dtype=torch.bfloat16) # Use bfloat16 for better performance on compatible GPUs

messages = [ {“role”: “user”, “content”: “Explain the concept of Large Language Models in simple terms.”}, ]

🌌 The pipeline automatically applies the chat template for IT models

output = pipe(messages, max_new_tokens=150) print(output[‘generated_text’][-1][‘content’]) # Access the assistant’s response

⚡ 3.5. Gemma 3 Insights for Fine-Tuning

The characteristics of Gemma 3 present both opportunities and challenges for local fine-tuning. The large 128K context window is advantageous for tasks involving extensive text, but processing such long sequences demands significant memory.5 The availability of multiple model sizes allows for selecting a base model appropriate for the hardware, but even the smaller 4B and 12B variants are substantial.5 Given the 24GB VRAM limitation of the RTX 4090, fine-tuning these models necessitates aggressive memory optimization.24 Techniques like QLoRA, which quantizes the large base model to 4-bits while training small adapters, become essential rather than optional.29 The 4B model is a feasible starting point, while the 12B model likely requires QLoRA combined with other techniques like gradient checkpointing to fit within the 24GB budget.5 The model’s large vocabulary (~262K tokens) might also influence fine-tuning dynamics compared to models with smaller vocabularies, potentially requiring more data or longer training to adapt effectively.7

⚡ Table 2: Gemma 3 Model Sizes and Approximate Inference Memory Requirements

Parameters	Full 32bit	BF16 (16-bit)	SFP8 (8-bit)	Q4_0 (4-bit)	INT4 (4-bit)	Notes
1B (Text)	~4 GB	~1.5 GB	~1.1 GB	~892 MB	~861 MB	Text-only, 32K context
4B	~16 GB	~6.4 GB	~4.4 GB	~3.4 GB	~3.2 GB	Multimodal, 128K context, Fits RTX 4090
12B	~48 GB	~20 GB	~12.2 GB	~8.7 GB	~8.2 GB	Multimodal, 128K context, Needs QLoRA on 4090
27B	~108 GB	~46.4 GB	~29.1 GB	~21 GB	~19.9 GB	Multimodal, 128K context, Infeasible on 4090

Source: Adapted from.5 Values are approximate GPU/TPU memory to load the model for inference*. Memory consumption increases with prompt/context length.* Fine-tuning requires significantly more memory due to gradients, optimizer states, and activations.

This table underscores the need for quantization (like Q4_0 or INT4 via QLoRA) to even load the 12B model for inference on an RTX 4090, reinforcing the necessity of these techniques for fine-tuning.

🌟 4. Chapter 3: Data Preparation for Fine-Tuning

The quality and format of the fine-tuning data are critical determinants of the final model’s performance and behavior. This chapter covers preparing datasets for both question-answering and large knowledge base tasks, tailored for Gemma 3.

⚡ 4.1. General Principles

Data Quality: The foundation of successful fine-tuning is high-quality, relevant, and diverse data. Garbage in, garbage out applies strongly.10 The dataset should accurately represent the target task and domain.
Supervised Fine-Tuning (SFT): This tutorial focuses on SFT, which involves training the model on labeled examples, typically structured as prompt-response pairs.11 This allows for explicit instruction following and task adaptation.
Dataset Splitting: Always split your dataset into training, validation, and test sets. The training set is used to update model weights, the validation set monitors performance during training (e.g., for early stopping, hyperparameter tuning), and the test set provides an unbiased final evaluation.14 Common splits are 80/10/10 or 90/5/5.

⚡ 4.2. Preparing Data for Question-Answering (QA)

Task Definition: QA tasks can be extractive, where the answer is a direct span of text within a given context, or abstractive, where the model generates an answer based on the context but not necessarily verbatim.67 This section focuses on preparing data for extractive QA, similar to the popular SQuAD dataset format.
Dataset Format (SQuAD-like): The standard format includes fields for a unique identifier, the context passage, the question, and the answer(s). Crucially for extractive QA, the answer format typically includes the answer text and its starting character position within the context.67

Example JSON structure: JSON { “id”: “unique_example_id_123”, “title”: “Example Document Title”, “context”: “The quick brown fox jumps over the lazy dog. This happened in 1986.”, “question”: “What year did the event occur?”, “answers”: { “text”: [“1986”], “answer_start”: [1] } }

Loading Data: Use the datasets library to load standard datasets like SQuAD (load_dataset(“squad”)) 67 or load custom data from files (CSV, JSON). If loading from a CSV/Pandas DataFrame, ensure the ‘answers’ column is structured correctly as a dictionary containing lists for ‘text’ and ‘answer_start’.75 Python import pandas as pd from datasets import Dataset

🌌 Assuming df has columns: ‘id’, ‘context’, ‘question’, ‘answer_text’, ‘answer_start_char’

🌌 df = pd.read_csv(“your_qa_data.csv”)

🌌 Format the answers column correctly

df[‘answers’] = df.apply(lambda row: {‘text’: [row[‘answer_text’]], ‘answer_start’: [row[‘answer_start_char’]]}, axis=1)

🌌 Select relevant columns

df_formatted = df[[‘id’, ‘context’, ‘question’, ‘answers’]]

🌌 Convert to Hugging Face Dataset object

hf_dataset = Dataset.from_pandas(df_formatted)

Preprocessing & Tokenization for Extractive QA: This is the most complex part. The goal is to convert the character-based answer_start into token-based start and end positions that the model can predict. 1. Tokenize Question and Context: Concatenate the question and context, passing them to the tokenizer. 2. Handle Long Contexts: If the combined length exceeds the model’s maximum sequence length (e.g., model_max_length in TrainingArguments or tokenizer’s model_max_length), truncate the context only. Use truncation=“only_second”.67 3. Get Offset Mapping: Request the mapping between tokens and original character positions using return_offsets_mapping=True.67 4. Identify Context Tokens: Use the sequence_ids() method on the tokenized output to distinguish between tokens belonging to the question (usually sequence ID 0 or None) and tokens belonging to the context (usually sequence ID 1).67 5. Map Answer Characters to Tokens:

Get the character start (start_char) and end (end_char) positions from the answers field.
Find the token index corresponding to the start of the context (context_start) and the end of the context (context_end) using sequence_ids.
Check if the answer span lies completely within the context part of the tokenized input. If start_char is before the first context token’s start offset or end_char is after the last context token’s end offset, the answer is outside the valid range (due to truncation or the answer being in the question). In this case, label the example with start_position = 0 and end_position = 0.67
If the answer is within the context, iterate through the context token offsets (offset_mapping) to find the token index whose start offset (offset) corresponds to start_char and the token index whose end offset (offset2) corresponds to end_char. These become the start_position and end_position labels.67
Apply Preprocessing: Use dataset.map(preprocess_function, batched=True, remove_columns=…) to apply this logic efficiently across the dataset.67

Python

🌌 Example Preprocessing Function [67, 69]

from transformers import AutoTokenizer

🌌 Assume tokenizer is loaded, e.g., tokenizer = AutoTokenizer.from_pretrained(“google/gemma-3-4b-it”)

🌌 Assume max_length is defined, e.g., max_length = 512

def preprocess_qa_function(examples): questions = [q.strip() for q in examples[“question”]] inputs = tokenizer( questions, examples[“context”], max_length=max_length, truncation=“only_second”, # Truncate context, not question

return_offsets_mapping=True, padding=“max_length”, # Pad to max_length

)

offset_mapping = inputs.pop(“offset_mapping”) answers = examples[“answers”] start_positions = end_positions =

for i, offset in enumerate(offset_mapping): answer = answers[i] start_char = answer[“answer_start”] end_char = start_char + len(answer[“text”]) sequence_ids = inputs.sequence_ids(i)

🌌 Find the start and end of the context

idx = 0 while sequence_ids[idx]!= 1: idx += 1 context_start_token_idx = idx while sequence_ids[idx] == 1: idx += 1 context_end_token_idx = idx - 1

🌌 If the answer is not fully inside the context, label is (0, 0)

if offset[context_start_token_idx] > start_char or offset[context_end_token_idx][2] < end_char: start_positions.append(0) end_positions.append(0) else:

🌌 Otherwise it’s the start and end token positions

token_idx = context_start_token_idx while token_idx <= context_end_token_idx and offset[token_idx] <= start_char: token_idx += 1 start_positions.append(token_idx - 1)

token_idx = context_end_token_idx while token_idx >= context_start_token_idx and offset[token_idx][2] >= end_char: token_idx -= 1 end_positions.append(token_idx + 1)

inputs[“start_positions”] = start_positions inputs[“end_positions”] = end_positions return inputs

🌌 tokenized_dataset = hf_dataset.map(preprocess_qa_function, batched=True, remove_columns=hf_dataset.column_names)

⚡ 4.3. Preparing Large Knowledge Bases (KB) for Fine-Tuning

Fine-tuning on large KBs aims to adapt the model to a specific domain’s language, style, or factual knowledge, embedding this information more deeply than RAG allows.11 However, this comes at the cost of training time and the risk of catastrophic forgetting.12 The first critical step is breaking down large documents into sizes manageable by the LLM’s context window (e.g., Gemma 3’s 128K, though practically, much smaller chunks are often used for fine-tuning due to memory limits).77

Chunking Strategies: The choice of chunking strategy impacts how information is presented to the model and can affect fine-tuning effectiveness.78

Fixed-Size Chunking: Splits text into chunks of a fixed number of characters or tokens, often with overlap to maintain some context between chunks.77

Pros: Simple to implement.77 Computationally cheap.80
Cons: Can arbitrarily break sentences or semantic units, potentially harming coherence and retrieval relevance if used for RAG later.77
Implementation: Use Python’s textwrap 77 or LangChain’s CharacterTextSplitter 80 or TokenTextSplitter.86

Python

🌌 Example using LangChain CharacterTextSplitter

from langchain_text_splitters import CharacterTextSplitter

🌌 text = “Your long document text here…”

text_splitter = CharacterTextSplitter( separator=“\n\n”, # Attempt to split paragraphs first

chunk_size=1000, # Target character size

chunk_overlap=100 # Overlap between chunks

)

🌌 chunks = text_splitter.split_text(text)

🌌 print(f”Split into {len(chunks)} chunks.”)

- Recursive Character Chunking: Attempts to split text hierarchically using a list of separators (e.g., [“\n\n”, “\n”, ”. ”, ” ”, ""]). It tries the first separator, then the second if chunks are still too large, and so on. This helps preserve natural boundaries like paragraphs and sentences.82

Pros: Better preservation of document structure than fixed-size. Often the recommended starting point for general text.88
Cons: Still primarily based on separators, might not capture purely semantic breaks.
Implementation: LangChain’s RecursiveCharacterTextSplitter.82

Python

🌌 Example using LangChain RecursiveCharacterTextSplitter

from langchain_text_splitters import RecursiveCharacterTextSplitter

🌌 text = “Your long document text here…”

text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, # Target character size

chunk_overlap=100, separators=[“\n\n”, “\n”, ”. ”, ” ”, ""] # Default separators

)

🌌 chunks = text_splitter.split_text(text)

🌌 print(f”Split into {len(chunks)} chunks.”)

- Semantic Chunking: Groups text segments (often sentences) based on semantic similarity, typically measured by the distance between their embeddings.77 Aims to create chunks that are topically coherent.

Pros: Produces more contextually meaningful and coherent chunks.77
Cons: Computationally more intensive due to embedding calculation and comparison.77 May require careful tuning of similarity thresholds.92
Implementation: LangChain’s SemanticChunker 92 (requires an embedding model) or LlamaIndex’s SemanticSplitterNodeParser.94

Python

🌌 Example using LangChain SemanticChunker (requires OpenAI API key or other embedding model)

🌌 from langchain_experimental.text_splitter import SemanticChunker

🌌 from langchain_openai import OpenAIEmbeddings # 🌌 Or other embeddings like HuggingFaceEmbeddings

🌌 text = “Your long document text here…”

🌌 Assuming OPENAI_API_KEY is set

🌌 text_splitter = SemanticChunker(

🌌 OpenAIEmbeddings(),

🌌 breakpoint_threshold_type=“percentile” # 🌌 Other options: “standard_deviation”, “interquartile”

🌌 )

🌌 docs = text_splitter.create_documents([text]) # 🌌 Returns LangChain Document objects

🌌 print(f”Split into {len(docs)} semantic chunks.”)

- Content-Aware Chunking (Markdown, Code, etc.): Leverages the inherent structure of the content, like Markdown headers or code syntax, to guide splitting.80

Pros: Preserves logical structure specific to the content type.
Cons: Only applicable to specific formats.
Implementation: LangChain offers MarkdownHeaderTextSplitter 88 and language-specific splitters (e.g., PythonCodeTextSplitter).88

⚡ Table 3: Comparison of Text Chunking Strategies

Strategy	Description	Pros	Cons	Example Libraries/Classes
Fixed-Size	Splits text by fixed character/token count with overlap.	Simple, fast, computationally cheap.77	Can break semantic units, ignores structure.77	CharacterTextSplitter, TokenTextSplitter (LangChain)
Recursive Character	Splits hierarchically using separators (e.g., \n\n, \n, .).	Better structure preservation, good default.88	Still relies on separators, not purely semantic.	RecursiveCharacterTextSplitter (LangChain) 82
Semantic	Groups text based on embedding similarity.	High semantic coherence.77	Computationally expensive, needs embedding model, requires tuning.77	SemanticChunker (LangChain) 92, SemanticSplitterNodeParser (LlamaIndex) 95
Content-Aware (Markdown/Code)	Uses structural elements (headers, code blocks) for splitting.	Preserves logical structure specific to content.88	Format-specific.	MarkdownHeaderTextSplitter, Language-specific splitters (LangChain) 88

Formatting Chunked Data for Fine-Tuning: Once documents are chunked, they need to be formatted for the SFT process. Two main approaches exist: 1. Unsupervised (Continued Pre-training): Feed the raw text chunks directly to the model. The objective is standard next-token prediction. This implicitly adapts the model to the domain’s vocabulary, style, and knowledge patterns.18 The dataset format would simply contain the text chunks. 2. Supervised (Instruction Tuning): Create instruction-response pairs from the chunks. This requires generating relevant questions or instructions for each chunk and formatting the chunk’s content (or a summary) as the desired response.11 This allows for more targeted knowledge injection or teaching specific behaviors based on the KB content. The dataset format must follow a specific template, ideally using the model’s chat template (see Section 4.4).

The choice between unsupervised and supervised formatting depends on the goal. For general domain adaptation and style learning, unsupervised might suffice. For building a QA system or instruction-following agent based on the KB, the supervised approach is necessary.

⚡ 4.4. Using Gemma Chat Templates

Instruction-tuned (IT) models like gemma-3-4b-it are trained to follow specific conversational formats. Failing to adhere to the correct template during fine-tuning or inference can significantly degrade performance or cause the model to ignore instructions.57

Gemma Template Structure: Gemma IT models use a specific turn-based structure marked by special tokens 57: <start_of_turn>user User’s message content<end_of_turn> <start_of_turn>model Assistant’s response content<end_of_turn>

Some Gemma versions might also require a beginning-of-sequence () token at the very start.3

tokenizer.apply_chat_template(): This Hugging Face tokenizer method is the standard way to correctly format conversational data.57

Input: A list of dictionaries, where each dictionary has a role (“system”, “user”, or “assistant”) and content key.96 Python messages =
Output Format:

tokenize=False: Returns the fully formatted string with special tokens inserted according to the model’s template.96 This is useful for creating the text field in your dataset for SFTTrainer.
tokenize=True: Directly returns the token IDs (input_ids) and attention_mask.96

add_generation_prompt=True: This flag appends the tokens that signal the start of the assistant’s turn (e.g., <start_of_turn>model\n).96 This is crucial when preparing data for SFT, as the model needs to learn to generate the content following this prompt. When preparing data for SFTTrainer using tokenize=False, ensure this is set correctly so the final formatted string includes the assistant’s starting turn marker before the target response.

Code Example (Dataset Formatting): Python from datasets import load_dataset from transformers import AutoTokenizer

🌌 Assume tokenizer is loaded for an IT model, e.g.,

🌌 tokenizer = AutoTokenizer.from_pretrained(“google/gemma-3-4b-it”)

🌌 Add pad token if necessary (often uses eos token)

if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token

🌌 Assume dataset is loaded with ‘instruction’ and ‘response’ columns

🌌 dataset = load_dataset(“your_instruction_dataset”)

def format_dataset_with_template(example):

🌌 Structure messages for the template

messages = [ {“role”: “user”, “content”: example[‘instruction’]}, {“role”: “assistant”, “content”: example[‘response’]} # Include the target response

]

🌌 Apply the template, returning the formatted string

🌌 Set add_generation_prompt=True if the template requires it before the assistant’s turn

🌌 Ensure the EOS token is part of the assistant’s content or handled by the template

formatted_text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True # Usually True for SFT datasets

)

🌌 Ensure EOS token if not added by template/response

if not formatted_text.strip().endswith(tokenizer.eos_token): formatted_text += tokenizer.eos_token return {“text”: formatted_text} # SFTTrainer expects a ‘text’ column by default

🌌 formatted_dataset = dataset.map(format_dataset_with_template)

🌌 print(formatted_dataset[‘train’][‘text’])

Always verify the exact template structure and special token requirements from the specific Gemma model’s documentation or tokenizer configuration on Hugging Face Hub.

⚡ 4.5. Data Augmentation

Data augmentation techniques create synthetic training data from existing examples, which can be beneficial for improving model robustness, increasing dataset size (especially for low-resource tasks), and reducing overfitting.17

Common NLP Techniques:

Easy Data Augmentation (EDA): Simple text manipulations like synonym replacement, random insertion, random swap, and random deletion of words.99
Back-Translation (BT): Translating text to an intermediate language and then back to the original language to generate paraphrases.99
LLM-based Augmentation: Using another powerful LLM (like GPT-4 or even a larger Gemma model) to paraphrase existing examples or generate new examples based on prompts (zero-shot or few-shot).99

Relevance for this Tutorial: While powerful, extensive data augmentation might be less critical when fine-tuning on large, well-chunked knowledge bases where the goal is broad domain adaptation. However, for QA tasks with limited initial data, augmentation could significantly improve performance and generalization. Techniques like paraphrasing questions or using an LLM to generate variations of question-context pairs could be explored.

⚡ 4.6. Data Preparation Insights

The process of preparing data for fine-tuning is far more than just loading files; it fundamentally dictates what the model learns and how effectively it adapts. For QA tasks, the intricate mapping of character-level answers to token positions is essential for the model to learn the extractive task correctly.67 For large KBs, the choice of chunking strategy represents a critical trade-off between preserving local context within a chunk and ensuring the chunk is small enough for processing and precise enough for potential retrieval later.78 Ignoring the specific chat template required by instruction-tuned models like Gemma 3 IT is a common pitfall that leads to suboptimal results, as the model expects input formatted in a particular conversational structure.57

Furthermore, the decision on how to format chunked KB data—either as raw text for continued pre-training or as instruction-response pairs for supervised tuning—depends entirely on the fine-tuning objective.11 Unsupervised fine-tuning aims for general adaptation to the domain’s language patterns and implicit knowledge, while supervised fine-tuning targets specific instruction-following or QA capabilities based on the knowledge within the chunks. This formatting choice directly influences the model’s resulting behavior.

🌟 5. Chapter 4: Fine-Tuning Configuration

Configuring the fine-tuning process involves selecting an appropriate strategy (full vs. parameter-efficient), choosing specific techniques (like LoRA/QLoRA), and setting hyperparameters. These choices are heavily influenced by the target task, dataset size, available hardware (especially VRAM), and desired outcome.

⚡ 5.1. Full Fine-Tuning vs. Parameter-Efficient Fine-Tuning (PEFT)

Full Fine-Tuning: This traditional approach involves updating all parameters of the pre-trained model using the task-specific dataset.17

Pros: Can potentially achieve the highest performance if sufficient data and compute are available.
Cons: Extremely resource-intensive, requiring vast amounts of GPU memory (often hundreds of GBs for models like Gemma 3) and significant computation time.17 Highly susceptible to catastrophic forgetting, where the model loses its general capabilities learned during pre-training.15 Prone to overfitting on smaller fine-tuning datasets.17 Saving the full fine-tuned model requires storing all parameters (many GBs).
Feasibility: Generally infeasible for models like Gemma 3 4B/12B on a single RTX 4090 (24GB VRAM).

Parameter-Efficient Fine-Tuning (PEFT): A collection of techniques that modify only a small fraction of the model’s parameters or add a small number of new parameters (adapters), keeping the bulk of the original model frozen.28

Benefits:

Reduced Memory/Compute: Drastically lowers VRAM requirements for training, making it possible to fine-tune large models on consumer hardware.28
Faster Training: Fewer parameters to update leads to quicker training iterations.
Reduced Storage: Only the small set of modified/added parameters (adapters) needs to be saved, typically only a few MBs instead of GBs.52
Mitigation of Catastrophic Forgetting: Since the base model remains largely unchanged, PEFT methods are less prone to forgetting pre-trained knowledge.105
Portability/Modularity: Small adapters can be easily shared and loaded on top of the base model for different tasks. Multiple adapters can potentially be used.52
Comparable Performance: State-of-the-art PEFT techniques often achieve performance comparable to full fine-tuning on downstream tasks.28

Recommendation: Given the Gemma 3 model sizes (4B/12B) and the target hardware (RTX 4090), PEFT is the necessary and recommended approach for this tutorial.24

⚡ 5.2. LoRA and QLoRA Deep Dive

LoRA and its quantized variant QLoRA are among the most popular and effective PEFT methods.

LoRA (Low-Rank Adaptation):

Concept: Instead of directly updating a large weight matrix W (e.g., in attention or MLP layers), LoRA approximates the update ΔW using the product of two much smaller, low-rank matrices, B and A (ΔW = BA). Only B and A are trained, while the original W remains frozen. The updated layer computes h = Wx + BAx.53
LoraConfig Parameters: The Hugging Face peft library uses the LoraConfig class to specify LoRA parameters 108:

r (rank): An integer defining the inner dimension of matrices A and B. A crucial hyperparameter controlling the capacity (number of trainable parameters) of the LoRA adapter. Smaller r means fewer parameters but potentially less adaptation capability. Common values range from 4 to 64 or higher.53
lora_alpha: An integer scaling factor. The LoRA output BAx is typically scaled by lora_alpha / r. This acts as a form of learning rate for the adapter. Often set to r or 2*r.53
target_modules: A list of strings specifying the names of the modules within the base model to which LoRA adapters should be applied (e.g., [“q_proj”, “k_proj”, “v_proj”, “o_proj”] for attention layers, or potentially including gate_proj, up_proj, down_proj for MLP layers in models like Gemma).53 Using “all-linear” attempts to target all linear layers, similar to QLoRA’s default.112 Identifying the correct module names for Gemma 3 might require inspecting the model architecture.
lora_dropout: A float specifying the dropout probability to apply within the LoRA layers (after the first matrix multiplication Ax but before the second B).54 Helps prevent overfitting the adapters.
bias: A string (‘none’, ‘all’, or ‘lora_only’) indicating whether bias parameters should be trained.1 ‘none’ keeps all original biases frozen and adds no new trainable biases.
task_type: Specifies the task type for appropriate model configuration, e.g., TaskType. CAUSAL_LM for language modeling.53
Other options: modules_to_save (train specific non-LoRA modules like classification heads), init_lora_weights (initialization strategy, default is Kaiming uniform for A, zeros for B), use_rslora (rank-stabilized scaling), use_dora (weight-decomposed adaptation).108

Code Example (LoraConfig): Python from peft import LoraConfig, TaskType

lora_config = LoraConfig( r=16, # Rank

lora_alpha=32, # Scaling factor (often 2*r)

target_modules=, lora_dropout=0.05, bias=“none”, task_type=TaskType. CAUSAL_LM )

QLoRA (Quantized LoRA):

Concept: QLoRA enhances LoRA’s efficiency by quantizing the frozen base model weights to a lower precision (typically 4-bit) while still performing the LoRA adaptation. Gradients are backpropagated through the quantized weights into the LoRA adapters, which are usually kept in a higher precision format (e.g., BF16).28 This dramatically reduces the memory required to hold the base model.
Key Components 29:

4-bit NormalFloat (NF4): An optimized 4-bit data type designed for the typical normal distribution of LLM weights, offering better precision than standard 4-bit integer or float types.
Double Quantization (DQ): Further reduces memory by quantizing the quantization constants themselves (the scaling factors used for each block of weights in the NF4 quantization). This saves roughly an additional 0.4 bits per parameter.
Paged Optimizers: Uses NVIDIA’s unified memory to automatically page optimizer states (which can be large) between GPU VRAM and CPU RAM, preventing OOM errors during memory spikes, particularly when using gradient checkpointing.

Implementation (bitsandbytes + transformers): QLoRA is integrated into Hugging Face libraries. It’s enabled by passing a BitsAndBytesConfig object to the quantization_config argument and setting load_in_4bit=True (or load_in_8bit=True) when loading the model with from_pretrained.2
Code Example (BitsAndBytesConfig and Model Loading): Python import torch from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer

model_id = “google/gemma-3-12b-it” # Example: 12B model

🌌 QLoRA configuration

bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type=“nf4”, # Use 4-bit NormalFloat

bnb_4bit_use_double_quant=True, # Enable Double Quantization

bnb_4bit_compute_dtype=torch.bfloat16 # Compute dtype during forward pass

)

🌌 Load the base model with quantization

model = AutoModelForCausalLM.from_pretrained( model_id, quantization_config=bnb_config, device_map=“auto”, # Automatically distribute across GPUs if available

🌌 attn_implementation=“flash_attention_2” # 🌌 Optional: Use Flash Attention if supported

)

🌌 Load tokenizer

tokenizer = AutoTokenizer.from_pretrained(model_id)

🌌 Add padding token if needed

if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token

LoftQ Initialization: For potentially better QLoRA results, consider using LoftQ initialization by setting init_lora_weights=“loftq” in LoraConfig and providing a LoftQConfig.108 This initializes LoRA weights specifically to minimize the error introduced by quantizing the base model.

⚡ 5.3. Hyperparameter Tuning

Finding the right hyperparameters is crucial for effective fine-tuning, balancing adaptation to the new task with stability and generalization.13

Key Hyperparameters for (Q)LoRA Fine-Tuning 13:

learning_rate: Typically smaller than pre-training rates. Values like 5e-5, 1e-4, 2e-4 are common starting points. Too high can cause instability or forgetting; too low can lead to slow convergence.15
per_device_train_batch_size: Highly constrained by VRAM, especially with QLoRA on a 24GB card. Often set to 1, 2, or 4.1
gradient_accumulation_steps: Used to compensate for small per-device batch sizes. Set such that per_device_train_batch_size * num_gpus * gradient_accumulation_steps equals a desired effective batch size (e.g., 16, 32, 64).1
num_train_epochs: Usually small for fine-tuning (e.g., 1 to 3) to prevent overfitting the smaller fine-tuning dataset.1
lr_scheduler_type: Controls learning rate decay. ‘linear’ or ‘cosine’ with a warmup period (warmup_steps or warmup_ratio) are common choices.55
weight_decay: L2 regularization parameter to penalize large weights and prevent overfitting. Small values like 0.01 or 0.1 are typical.13
optim: Optimizer. For QLoRA, memory-efficient paged optimizers like ‘paged_adamw_32bit’ or ‘paged_adamw_8bit’ are recommended.18
lora_r, lora_alpha, lora_dropout: Specific to LoRA, controlling adapter capacity, scaling, and regularization. Experimentation is often needed.

Tuning Strategies: Systematically exploring hyperparameter combinations (grid search, random search) is ideal but computationally expensive.14 A practical approach is to start with values reported in similar fine-tuning studies or examples (like those in this tutorial), then iteratively adjust one or two parameters at a time based on validation performance.13

⚡ 5.4. Memory Optimization Techniques

These techniques are often used in combination, especially with QLoRA, to fit training into available VRAM.

QLoRA: Quantizes the base model to 4-bits.29 This provides the largest memory saving for the model weights.
Gradient Accumulation: Increases effective batch size without linearly increasing memory usage for activations.1 Essential when per_device_train_batch_size is small.
Gradient Checkpointing: Saves significant memory by storing only a subset of activations and recomputing others during the backward pass, at the cost of increased computation time (~20-30% slower per step).120 Enabled via gradient_checkpointing=True in TrainingArguments.120
Mixed Precision Training (FP16/BF16): Uses 16-bit floating-point numbers for weights and activations, halving the memory footprint compared to 32-bit and potentially speeding up computation on supported hardware (Tensor Cores).120 Enabled via fp16=True or bf16=True in TrainingArguments.120 BF16 (supported on Ampere GPUs like RTX 4090 and newer) is generally preferred over FP16 due to its larger dynamic range, reducing the risk of underflow/overflow issues.122
Memory-Efficient Optimizers: AdamW can consume significant memory storing optimizer states (2x model parameters for moments + gradients). 8-bit optimizers (adamw_8bit) or Paged Optimizers (paged_adamw_32bit) significantly reduce this overhead.29

⚡ 5.5. Configuration Insights

Configuring fine-tuning for Gemma 3 (especially 12B) on an RTX 4090 is fundamentally an optimization problem constrained by the 24GB VRAM limit. It’s less about finding the absolute best theoretical hyperparameters and more about finding the best achievable hyperparameters within the memory budget. QLoRA is the cornerstone, making the base model memory manageable.29 However, activations and optimizer states still pose challenges. Therefore, techniques like gradient checkpointing and gradient accumulation are not just optional optimizations but often necessary components to make training feasible at all, especially with reasonable effective batch sizes.24 Hyperparameter choices like learning rate, LoRA rank (r), and number of epochs must be considered in conjunction with these memory-saving techniques.

Table 4: Example Hyperparameter Starting Ranges for Gemma 3 (4B/12B) QLoRA on RTX 4090

Hyperparameter	Starting Range / Value	Notes
learning_rate	5e-5 to 2e-4	Lower end might be safer initially.
per_device_train_batch_size	1, 2	Maximize based on VRAM with other settings.
gradient_accumulation_steps	4 to 16	Adjust to reach effective batch size (e.g., 16-64).
num_train_epochs	1 to 3	Monitor validation loss closely to avoid overfitting.
lr_scheduler_type	’cosine’ or ‘linear’	Use with warmup_steps (e.g., 10-50) or warmup_ratio (e.g., 0.03).
weight_decay	0.0 to 0.1	Regularization.
optim	’paged_adamw_32bit’	Recommended for QLoRA.29 ‘paged_adamw_8bit’ saves more memory but check stability.
lora_r	8, 16, 32, 64	Start lower (e.g., 16), increase if needed and memory allows.
lora_alpha	r or 2*r	Common practice, e.g., 16/32 or 32/64.
lora_dropout	0.05 to 0.1	Adapter regularization.
gradient_checkpointing	True	Recommended for memory saving.120
bf16	True	Recommended over fp16 on RTX 4090 for stability and performance.120
bnb_4bit_quant_type	”nf4”	QLoRA’s optimized 4-bit type.29
bnb_4bit_use_double_quant	True	Further memory saving.29
bnb_4bit_compute_dtype	torch.bfloat16	Compute precision during QLoRA forward pass.1

> ⚠️ Note: These are starting points. Optimal values depend heavily on the specific dataset and task.

🌟 6. Chapter 5: Executing the Fine-Tuning Process

This chapter provides a practical, runnable code example integrating the concepts and configurations discussed previously, utilizing the Hugging Face trl library’s SFTTrainer for supervised fine-tuning.

⚡ 6.1. Leveraging Hugging Face TRL and PEFT

The trl (Transformer Reinforcement Learning) library from Hugging Face offers high-level abstractions for training transformers, including the SFTTrainer specifically designed for Supervised Fine-Tuning.9 SFTTrainer simplifies the training loop by integrating seamlessly with:

transformers: For loading base models and tokenizers.
peft: For applying PEFT techniques like LoRA/QLoRA via a PeftConfig.
datasets: For handling data loading and processing.
accelerate: For handling device placement and distributed training automatically. By providing the configured model, tokenizer, dataset, PEFT config, and training arguments, SFTTrainer manages the complexities of the training loop, gradient updates, logging, and saving.1

⚡ 6.2. Runnable Code Example (QLoRA Fine-Tuning Gemma 3)

This script demonstrates fine-tuning google/gemma-3-4b-it on a hypothetical instruction dataset using QLoRA. Adapt the model_id, dataset_name, formatting function, and hyperparameters as needed. Python

import torch import os from datasets import load_dataset from transformers import ( AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, logging, ) from peft import LoraConfig, PeftModel, get_peft_model, prepare_model_for_kbit_training from trl import SFTTrainer from huggingface_hub import login

🌌 --- 1. Configuration ---

🌌 Model and Tokenizer

base_model_id = “google/gemma-3-4b-it” # Or “google/gemma-3-12b-it” if resources allow

new_model_name = “gemma-3-4b-it-finetuned-example” # Your desired output model name

🌌 Dataset (Replace with your actual dataset)

🌌 Example: Using a subset of Dolly for demonstration

dataset_name = “databricks/databricks-dolly-15k” dataset_split = “train[:1000]” # Use a small subset for quick testing

🌌 QLoRA Configuration

use_4bit = True bnb_4bit_compute_dtype = “bfloat16” # Use “float16” if bf16 not supported

bnb_4bit_quant_type = “nf4” # NormalFloat4

use_nested_quant = True # Enable Double Quantization

🌌 LoRA Configuration

lora_r = 16 # LoRA rank

lora_alpha = 32 # LoRA alpha

lora_dropout = 0.05 # LoRA dropout

🌌 Training Arguments

output_dir = ”./results” num_train_epochs = 1 # Fine-tune for 1-3 epochs usually

per_device_train_batch_size = 2 # Adjust based on VRAM

gradient_accumulation_steps = 8 # Adjust to reach effective batch size (e.g., 2*8=16)

gradient_checkpointing = True # Enable gradient checkpointing

max_grad_norm = 0.3 # Gradient clipping

learning_rate = 1e-4 # Initial learning rate (e.g., 1e-4 or 2e-4)

weight_decay = 0.01 # Regularization

optim = “paged_adamw_32bit” # Paged optimizer for QLoRA

lr_scheduler_type = “cosine” # Learning rate scheduler

max_steps = -1 # Set to -1 to train for num_train_epochs

warmup_ratio = 0.03 # Warmup ratio

group_by_length = True # Group sequences for efficiency

save_strategy = “epoch” # Save checkpoints every epoch

logging_steps = 25 # Log every 25 steps

max_seq_length = 1024 # Max sequence length for tokenizer/model

packing = False # Disable packing for simplicity here

🌌 --- 3. Load Tokenizer ---

tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)

🌌 Set padding token if necessary (Gemma often uses eos_token as pad_token)

if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token tokenizer.padding_side = “right” # SFTTrainer default, can sometimes be ‘left’ for generation

print(“Tokenizer loaded.”)

🌌 --- 4. Load Dataset and Format ---

dataset = load_dataset(dataset_name, split=dataset_split)

🌌 Formatting function using Gemma chat template

🌌 Adapts Dolly format (instruction, context, response) to Gemma messages

def format_dolly_for_gemma(example):

🌌 Construct the prompt based on whether context is present

if example.get(“context”): user_content = f”Instruction:\n{example[‘instruction’]}\n\nInput:\n{example[‘context’]}” else: user_content = f”Instruction:\n{example[‘instruction’]}”

messages = [ {“role”: “user”, “content”: user_content}, {“role”: “assistant”, “content”: example[‘response’]} ]

🌌 Apply chat template

formatted_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)

🌌 Ensure EOS token if not added by template

if not formatted_text.strip().endswith(tokenizer.eos_token): formatted_text += tokenizer.eos_token return {“text”: formatted_text}

dataset = dataset.map(format_dolly_for_gemma, remove_columns=list(dataset.features)) print(“Dataset loaded and formatted.”)

🌌 print(dataset[‘text’]) # 🌌 Optional: view formatted example

🌌 --- 5. Load Base Model with QLoRA ---

compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig( load_in_4bit=use_4bit, bnb_4bit_quant_type=bnb_4bit_quant_type, bnb_4bit_compute_dtype=compute_dtype, bnb_4bit_use_double_quant=use_nested_quant, )

🌌 Check GPU compatibility with bfloat16

if compute_dtype == torch.float16 and use_4bit: major, _ = torch.cuda.get_device_capability() if major >= 8: print(”=” * 80) print(“Your GPU supports bfloat16: accelerate training with bf16=True”) print(”=” * 80)

model = AutoModelForCausalLM.from_pretrained( base_model_id, quantization_config=bnb_config, device_map=“auto”, # Automatically uses available GPUs

🌌 attn_implementation=“flash_attention_2” # 🌌 Optional: requires flash-attn package

) model.config.use_cache = False # Required for gradient checkpointing

model.config.pretraining_tp = 1 # Set if pretraining_tp > 1 is used

🌌 Prepare model for k-bit training (important for QLoRA)

model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=gradient_checkpointing)

print(“Base model loaded with QLoRA configuration.”)

🌌 --- 6. PEFT Configuration (LoRA) ---

peft_config = LoraConfig( lora_alpha=lora_alpha, lora_dropout=lora_dropout, r=lora_r, bias=“none”, task_type=“CAUSAL_LM”, target_modules=[ # Example targets, verify for Gemma 3

“q_proj”, “k_proj”, “v_proj”, “o_proj”, “gate_proj”, “up_proj”, “down_proj”, ] )

model = get_peft_model(model, peft_config) model.print_trainable_parameters() # Show number of trainable parameters

print(“PEFT model created.”)

🌌 --- 7. Training Arguments ---

training_arguments = TrainingArguments( output_dir=output_dir, num_train_epochs=num_train_epochs, per_device_train_batch_size=per_device_train_batch_size, gradient_accumulation_steps=gradient_accumulation_steps, optim=optim, save_strategy=save_strategy, logging_steps=logging_steps, learning_rate=learning_rate, weight_decay=weight_decay, fp16=False, # bf16 recommended if supported

bf16=True if torch.cuda.is_bf16_supported() else False, # Enable BF16 if supported

max_grad_norm=max_grad_norm, max_steps=max_steps, warmup_ratio=warmup_ratio, group_by_length=group_by_length, lr_scheduler_type=lr_scheduler_type, gradient_checkpointing=gradient_checkpointing, report_to=“tensorboard”, # Log to tensorboard

🌌 deepspeed=ds_config, # 🌌 Optional: if using deepspeed config file

) print(“Training arguments set.”)

🌌 --- 8. Initialize SFTTrainer ---

trainer = SFTTrainer( model=model, train_dataset=dataset, peft_config=peft_config, dataset_text_field=“text”, # Column with formatted text

max_seq_length=max_seq_length, tokenizer=tokenizer, args=training_arguments, packing=packing, ) print(“SFTTrainer initialized.”)

🌌 --- 9. Start Training ---

print(“Starting training…”) train_result = trainer.train() print(“Training finished.”)

🌌 --- 10. Save Trained Model and Tokenizer ---

print(f”Saving model and tokenizer to {output_dir}”) trainer.save_model(output_dir) # Saves the LoRA adapter weights

tokenizer.save_pretrained(output_dir)

🌌 Optional: Log metrics

metrics = train_result.metrics trainer.log_metrics(“train”, metrics) trainer.save_metrics(“train”, metrics) trainer.save_state()

print(“Model and tokenizer saved.”)

🌌 --- 11. (Optional) Merge Adapter and Save Full Model ---

🌌 Free up memory first

🌌 del model

🌌 del trainer

🌌 torch.cuda.empty_cache()

🌌 Load base model and adapter

🌌 base_model = AutoModelForCausalLM.from_pretrained(

🌌 base_model_id,

🌌 low_cpu_mem_usage=True,

🌌 return_dict=True,

🌌 torch_dtype=torch.bfloat16, # 🌌 Use same dtype as training

🌌 device_map=“auto”,

🌌 )

🌌 merged_model = PeftModel.from_pretrained(base_model, output_dir)

🌌 merged_model = merged_model.merge_and_unload() # 🌌 Merge adapter into base model

🌌 Save the merged model

🌌 merged_model_path = f”./{new_model_name}-merged”

🌌 merged_model.save_pretrained(merged_model_path, safe_serialization=True)

🌌 tokenizer.save_pretrained(merged_model_path)

🌌 print(f”Merged model saved to {merged_model_path}”)

🌌 --- 12. (Optional) Push to Hub ---

🌌 Make sure you are logged in

🌌 try:

🌌 trainer.push_to_hub()

🌌 print(f”Model pushed to Hugging Face Hub: {trainer.args.hub_model_id or new_model_name}”)

🌌 except Exception as e:

🌌 print(f”Failed to push to hub: {e}”)

⚡ 6.3. Monitoring Training Progress

Monitoring the training process is crucial for diagnosing issues and understanding model behavior.

Console Logs: The SFTTrainer (via transformers. Trainer) prints logs to the console at intervals specified by logging_steps in TrainingArguments. These logs typically include the current step, epoch, learning rate, and training loss.117 A steadily decreasing loss indicates learning is occurring. Spikes or NaN values signal instability.115
TensorBoard/Weights & Biases: For more visual and interactive monitoring, configure report_to=[“tensorboard”] or report_to=[“wandb”] in TrainingArguments.9 This allows plotting the loss curve over time. Observing the training loss curve alongside a validation loss curve (if an eval_dataset is provided to the Trainer) is the standard way to detect overfitting – when training loss continues to decrease while validation loss starts to increase.66

⚡ 6.4. Checkpointing and Saving

Regularly saving model checkpoints is vital for resilience against interruptions and for selecting the best performing model state.

Configuration: The TrainingArguments control checkpointing behavior 117:

output_dir: Specifies the root directory where checkpoints and the final model are saved.
save_strategy: Determines when to save. Options are “steps” or “epoch”.
save_steps: If save_strategy=“steps”, saves a checkpoint every save_steps.
save_total_limit: Limits the total number of checkpoints kept, deleting older ones to save disk space.

What is Saved (PEFT): When using PEFT (LoRA), trainer.save_model() saves the adapter configuration (adapter_config.json) and the trained adapter weights (adapter_model.bin) within the specified checkpoint or output directory.1 It does not save the entire base model, leading to significant storage savings.54 The tokenizer files are saved separately using tokenizer.save_pretrained().

⚡ 6.5. Distributed Training (Brief Note)

The SFTTrainer leverages Hugging Face Accelerate internally to handle device placement and distribution.49 While this tutorial focuses on a single RTX 4090, scaling to multiple GPUs on a single node (or multiple nodes) typically involves:

1. Configuring Accelerate using accelerate config or a YAML configuration file. 2. Launching the training script using accelerate launch your_script.py —args…

Alternatively, frameworks like DeepSpeed can be integrated via TrainingArguments (—deepspeed ds_config.json) for more advanced distributed training strategies like ZeRO optimization.49 This provides a path for scaling beyond the single-GPU setup described here.

⚡ 6.6. Execution Insights

The SFTTrainer provides a high level of abstraction, making the fine-tuning process appear straightforward with just a trainer.train() call. However, this simplicity relies heavily on the correctness of the preceding steps: model loading (with proper QLoRA setup), tokenizer configuration (padding, chat templates), dataset formatting, PEFT configuration, and TrainingArguments.1 Errors or suboptimal performance during training often trace back to misconfigurations in these earlier stages. Furthermore, monitoring the training loss is not merely passive observation; it’s an active diagnostic tool.66 A loss that decreases too rapidly might indicate an overly high learning rate leading to instability, while a loss that plateaus too early might suggest the learning rate is too low or the model lacks capacity (e.g., LoRA rank r is too small). NaN (Not a Number) losses are a common sign of numerical instability, often linked to mixed precision settings (especially FP16), excessively high learning rates, or improperly formatted input data causing gradient explosion.115 Careful monitoring helps catch these issues early and guide adjustments to hyperparameters or data processing.

🌟 7. Chapter 6: Evaluating the Fine-Tuned Model

Evaluation is a critical step to quantify the fine-tuned model’s performance, compare it against baselines, diagnose issues like overfitting, and ensure it meets the requirements of the target application.13

⚡ 7.1. Why Evaluate?

Measure Improvement: Quantify how much fine-tuning improved performance on the specific task compared to the base model.
Task-Specific Performance: Assess how well the model performs the intended QA or KB interaction task.
Generalization Check: Evaluate on a held-out test set to ensure the model generalizes beyond the training data and isn’t overfitting.14
Comparison: Compare different fine-tuning runs (e.g., different hyperparameters, LoRA ranks) or different models.128
Safety and Alignment: Assess whether the model exhibits unwanted biases or generates harmful content (though this often requires specialized benchmarks or human evaluation beyond the scope of standard metrics).128

⚡ 7.2. Evaluation Strategies

Held-out Test Set: The most common approach is to evaluate the model on a portion of the dataset that was not used during training or validation.14 This provides an unbiased estimate of performance on unseen data.
Standard Benchmarks: For common tasks like QA or general language understanding, standardized benchmarks (e.g., SQuAD, GLUE, SuperGLUE) exist.51 Running these can provide comparable scores but might be computationally intensive locally.
Human Evaluation: Often considered the gold standard, especially for generative tasks, but it is subjective, time-consuming, and expensive.128 Methods include Likert scales for fluency/relevance or A/B testing.130 This tutorial focuses on automated metrics.

⚡ 7.3. Metrics for Question-Answering (QA)

For extractive QA tasks like SQuAD:

Exact Match (EM): Calculates the percentage of predictions where the predicted answer string exactly matches one of the ground truth answer strings.51 It’s a strict, all-or-nothing metric.
F1 Score: Treats prediction and ground truth as bags of tokens and computes the harmonic mean of precision and recall at the token level.51 It allows for partial credit if the prediction overlaps significantly with the ground truth, making it more robust than EM, especially when minor variations (e.g., punctuation, articles) exist.
SQuAD Metric: The Hugging Face evaluate library often provides a combined “squad” metric that calculates both EM and F1.51

⚡ 7.4. Metrics for Knowledge Base / Text Generation Tasks

Evaluating models fine-tuned on knowledge bases often involves assessing the quality of generated text based on the learned knowledge or style.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Measures the overlap between the generated text and one or more reference texts. Commonly used for summarization evaluation.51

ROUGE-N: Measures overlap of n-grams (e.g., ROUGE-1 for unigrams, ROUGE-2 for bigrams). Focuses on recall (how many n-grams from the reference appear in the prediction).
ROUGE-L: Measures the longest common subsequence (LCS), considering sentence-level structure similarity.
ROUGE-Lsum: Calculates ROUGE-L per sentence and aggregates, suitable for extractive summaries.141

BLEU (Bilingual Evaluation Understudy): Originally designed for machine translation, BLEU measures the precision of n-grams in the generated text compared to references, with a penalty for brevity.51 Higher scores indicate greater similarity. It can sometimes penalize valid paraphrases or creative outputs that differ lexically from the references.142
Perplexity (PPL): An intrinsic measure of how well a language model predicts a given text sample. Lower perplexity means the model is less “surprised” by the text, indicating a better fit to the language distribution.129 While useful for assessing language modeling capability, it doesn’t always directly correlate with performance on specific downstream tasks like QA or instruction following.
BERTScore: Computes similarity between generated and reference texts based on contextual embeddings from models like BERT. It correlates better with human judgment than n-gram based metrics like BLEU/ROUGE because it captures semantic similarity rather than just lexical overlap.130 However, it is more computationally expensive.

⚡ 7.5. Using the Hugging Face evaluate Library

The evaluate library provides a standardized way to compute various metrics.50

Installation: Ensure evaluate and any metric-specific dependencies (like rouge_score, nltk, bert_score, sacrebleu, seqeval) are installed.143 Bash pip install evaluate rouge_score nltk sacrebleu seqeval # Add others as needed
Loading Metrics: Use evaluate.load(“metric_name”). The name corresponds to the metric ID on the Hugging Face Hub (e.g., “accuracy”, “f1”, “precision”, “recall”, “squad”, “rouge”, “bleu”, “perplexity”, “bertscore”).143
Computing Metrics: Use the .compute() method, passing predictions and references in the expected format.71

accuracy, f1, precision, recall: Typically expect lists of predicted labels and reference labels.
squad: Expects lists of dictionaries, each containing id and prediction_text for predictions, and id and answers (dict with text list and answer_start list) for references.51
rouge, bleu: Expect lists of prediction strings and lists of reference strings (or lists of lists for multiple references per prediction for BLEU).143
perplexity: Expects model logits and input IDs or just a list of input strings if model_id is provided.

Incremental Evaluation: Use .add(prediction=…, reference=…) or .add_batch(predictions=…, references=…) to accumulate results before calling .compute().143
Combining Metrics: Use evaluate.combine([“metric1”, “metric2”,…]) to compute multiple metrics simultaneously.143
Code Examples: Python import evaluate

🌌 --- Example: Classification Metrics ---

clf_metrics = evaluate.combine([“accuracy”, “f1”, “precision”, “recall”]) predictions_clf = references_clf = results_clf = clf_metrics.compute(predictions=predictions_clf, references=references_clf) print(f”Classification Metrics: {results_clf}”)

🌌 Output: {‘accuracy’: 0.6, ‘f1’: 0.666…, ‘precision’: 0.666…, ‘recall’: 0.666…}

🌌 --- Example: SQuAD (QA) Metrics ---

squad_metric = evaluate.load(“squad”) predictions_qa = references_qa = [{‘answers’: {‘answer_start’: [3], ‘text’: [‘1976’]}, ‘id’: ‘id1’}, {‘answers’: {‘answer_start’: , ‘text’:}, ‘id’: ‘id2’}] results_qa = squad_metric.compute(predictions=predictions_qa, references=references_qa) print(f”SQuAD Metrics: {results_qa}”)

🌌 Output: {‘exact_match’: 50.0, ‘f1’: 66.666…} based on token overlap for second example

🌌 --- Example: ROUGE (Summarization/Generation) ---

rouge_metric = evaluate.load(“rouge”) predictions_gen = [“the cat sat on the mat”] references_gen = [“the cat was on the mat”] results_rouge = rouge_metric.compute(predictions=predictions_gen, references=references_gen) print(f”ROUGE Metrics: {results_rouge}”)

🌌 Output: {‘rouge1’:…, ‘rouge2’:…, ‘rougeL’:…, ‘rougeLsum’:…}

🌌 --- Example: BLEU (Translation/Generation) ---

bleu_metric = evaluate.load(“bleu”) predictions_bleu = [“the cat sat on the mat”] references_bleu = [[“the cat was on the mat”, “there was a cat on the mat”]] # List of lists for references

results_bleu = bleu_metric.compute(predictions=predictions_bleu, references=references_bleu) print(f”BLEU Metric: {results_bleu}”)

🌌 Output: {‘bleu’:…, ‘precisions’: […],…}

⚡ 7.6. Evaluating Knowledge Base Fine-tuning

Evaluating models fine-tuned on large KBs requires assessing both fluency/coherence and knowledge retention/application.

Generation Metrics: Use ROUGE and BLEU to compare generated text (e.g., summaries of KB content, answers based on KB) against reference texts.51
Perplexity: Calculate perplexity on a held-out portion of the KB text or domain-specific text. A lower perplexity compared to the base model suggests better adaptation to the domain’s language patterns.131
Domain-Specific QA: If possible, create or use an existing QA dataset relevant to the KB domain. Evaluate the fine-tuned model using EM/F1 scores on this dataset.
General Capability Benchmarks: To assess catastrophic forgetting, consider evaluating the fine-tuned model on broad benchmarks like GLUE or SuperGLUE.129 A significant drop in performance compared to the base model indicates forgetting of general language understanding abilities. However, running these benchmarks locally can be resource-intensive. Qualitative checks on simple, general knowledge questions can provide a basic indicator.

⚡ 7.7. Analyzing Results

Compare to Baseline: Always compare the fine-tuned model’s scores against the base model’s performance on the same evaluation set.
Track Hyperparameter Effects: Analyze how changes in learning rate, LoRA rank, epochs, etc., impact the evaluation metrics.
Identify Error Patterns: Don’t just rely on scores. Manually inspect model outputs where it performs poorly to understand the types of errors (e.g., factual inaccuracies, poor reasoning, stylistic issues).
Assess Trade-offs: Evaluate the balance between improvement on the target task and potential degradation on general tasks (catastrophic forgetting).

⚡ 7.8. Evaluation Insights

Selecting the right evaluation metric is task-dependent, and no single metric provides a complete picture of performance.132 For QA tasks, F1 offers a more nuanced assessment than the strict EM score.70 For generative tasks based on knowledge bases, metrics like ROUGE (emphasizing recall of content) and BLEU (emphasizing precision of phrasing) provide complementary views.141 While perplexity measures the model’s fluency and fit to the data distribution, it doesn’t directly guarantee correctness or usefulness for the specific task.131 Therefore, a combination of relevant automated metrics, supplemented by qualitative analysis, is generally required for a thorough evaluation. Furthermore, evaluating a fine-tuned model solely on its target task can be misleading. It’s crucial to consider the potential impact on the model’s general capabilities due to catastrophic forgetting.15 While PEFT methods like LoRA mitigate this compared to full fine-tuning 105, some degradation can still occur.104 Assessing this trade-off, perhaps through spot-checks on general knowledge or, if feasible, evaluation on broader benchmarks like GLUE/SuperGLUE 132, is essential to understand the full implications of fine-tuning for a given application.

⚡ Table 5: Key Evaluation Metrics for QA and Knowledge Tasks

Metric	Description	Primary Use Case(s)	Hugging Face evaluate ID
Exact Match	% of predictions exactly matching the reference answer string.	Extractive QA	exact_match (or in squad)
F1 Score	Harmonic mean of token-level precision and recall vs. reference answer.	Extractive QA	f1 (or in squad)
SQuAD	Combines Exact Match and F1 score for SQuAD-formatted data.	Extractive QA	squad
ROUGE	N-gram/LCS overlap (recall-oriented) between prediction and reference.	Summarization, Gen. QA	rouge
BLEU	N-gram overlap (precision-oriented) between prediction and reference(s).	Translation, Generation	bleu
Perplexity	Measure of how well the model predicts the test sequence (lower is better).	Language Modeling Fit	perplexity
BERTScore	Semantic similarity based on contextual embeddings.	Generation Quality	bertscore
Accuracy	% of correct predictions (e.g., for classification-based eval).	Classification	accuracy
Precision	TP / (TP + FP)	Classification	precision
Recall	TP / (TP + FN)	Classification	recall

🌟 8. Chapter 7: Troubleshooting and Best Practices

Fine-tuning large models locally, even with PEFT, often involves navigating various challenges. This chapter outlines common problems, their mitigation strategies, and best practices for a smoother experience.

⚡ 8.1. Common Fine-Tuning Problems and Mitigation

Out-of-Memory (OOM) Errors:

Symptoms: Training crashes with CUDA OOM errors.
Causes: Model size, activations, gradients, or optimizer states exceed available GPU VRAM.24
Mitigation:

Use QLoRA: Quantize the base model to 4-bits (load_in_4bit=True, BitsAndBytesConfig).29 This is the most impactful step.
Reduce Batch Size: Lower per_device_train_batch_size (e.g., to 1 or 2).1
Increase Gradient Accumulation: Raise gradient_accumulation_steps to maintain effective batch size.1
Enable Gradient Checkpointing: Set gradient_checkpointing=True in TrainingArguments to trade compute for memory.120
Use Paged Optimizers: Employ optim=“paged_adamw_32bit” (or _8bit) to prevent OOM during optimizer updates, especially with gradient checkpointing.29
Use Mixed Precision: Enable bf16=True (preferred on RTX 4090) or fp16=True in TrainingArguments.120
Reduce max_seq_length: Shorter sequences consume less memory for activations.
Choose Smaller Model: If feasible, start with Gemma 3 4B before attempting 12B.

Overfitting:

Symptoms: Training loss decreases, but validation loss stagnates or increases. Poor performance on unseen data.15
Causes: Model learns training data too well, including noise; insufficient or non-diverse training data; training for too long. Common with small fine-tuning datasets.15
Mitigation:

Reduce Training Time: Decrease num_train_epochs.13
Early Stopping: Monitor validation loss (requires eval_dataset and evaluation_strategy) and stop training when it stops improving (use load_best_model_at_end=True in TrainingArguments).19
Regularization: Add weight_decay (e.g., 0.01) to TrainingArguments.19
LoRA Dropout: Increase lora_dropout in LoraConfig.1
Data Augmentation: Increase the diversity and size of the training set if it’s small.17
Reduce Model Capacity: Use a smaller LoRA rank (r).

Catastrophic Forgetting:

Symptoms: Model performs well on the fine-tuning task but loses its general knowledge and performs poorly on tasks it could handle before fine-tuning.15
Causes: Fine-tuning updates weights optimized for the new task, overwriting knowledge learned during pre-training. More severe with full fine-tuning.
Mitigation:

Use PEFT (LoRA/QLoRA): This is the primary defense, as the base model weights are frozen.105
Limit Training: Use fewer num_train_epochs and a smaller learning_rate.
Rehearsal: Mix some general pre-training data or data from previous tasks with the current fine-tuning data (can be complex to implement).64
Incremental/Continual Learning Techniques: More advanced methods like Elastic Weight Consolidation (EWC) or specific adapter strategies aim to preserve old knowledge while learning new tasks (beyond the scope of this basic tutorial).64

Training Instability (NaN Loss, Slow/No Convergence):

Symptoms: Loss becomes NaN (Not a Number), loss explodes, or loss fails to decrease meaningfully.
Causes: Learning rate too high; numerical issues with mixed precision (especially FP16); incorrect data formatting/tokenization; gradient explosion.
Mitigation:

Lower Learning Rate: Reduce learning_rate significantly.
Use Gradient Clipping: Set max_grad_norm in TrainingArguments (e.g., 1.0 or 0.3) to prevent exploding gradients.115
Check Data: Verify data loading, preprocessing, and tokenization steps. Ensure correct formatting and no unexpected values.
Use BF16: If using mixed precision, prefer bf16=True over fp16=True on compatible hardware (RTX 4090 supports BF16) as it’s more stable.122
Try Different Optimizer: Experiment with adamw_torch vs. paged variants.
Warmup: Ensure a sufficient number of warmup_steps or adequate warmup_ratio is used.

Hardware/Driver/CUDA Issues:

Symptoms: Errors during library import, model loading, or training related to CUDA, cuDNN, NCCL, or driver mismatches. System instability.40
Mitigation:

Verify Environment: Double-check all installation steps from Chapter 1. Ensure driver version, CUDA toolkit version, and PyTorch CUDA version are compatible and correctly detected (nvidia-smi, nvcc —version, torch.cuda.is_available()).
Reinstall Libraries: Sometimes reinstalling PyTorch or other CUDA-dependent libraries within the correct environment can resolve issues.
Check System Resources: Ensure sufficient disk space and RAM (besides VRAM).
Consult Forums: Search developer forums (NVIDIA, PyTorch, Hugging Face) and GitHub issues for error messages specific to your hardware (RTX 4090), OS (Debian 12), and library versions.25

⚡ 8.2. Best Practices for Gemma 3 Fine-Tuning on RTX 4090/Debian

1. Start Small: Debug your entire pipeline (data loading, formatting, training loop, evaluation) with a small model variant (Gemma 3 1B or 2B if available and suitable, or 4B) and a tiny subset of your data (e.g., 100-1000 examples) before attempting large-scale runs.67 2. Embrace QLoRA: For 4B/12B models on 24GB VRAM, QLoRA is essential. Configure BitsAndBytesConfig correctly (NF4, Double Quantization, BF16 compute dtype).29 3. Use Gradient Checkpointing: Enable gradient_checkpointing=True in TrainingArguments to conserve VRAM, accepting the trade-off in training speed.120 4. Leverage Gradient Accumulation: Use gradient_accumulation_steps to simulate larger batch sizes that wouldn’t fit directly in memory.1 5. Employ Paged Optimizers: Use optim=“paged_adamw_32bit” when using QLoRA and gradient checkpointing to avoid OOM errors during optimizer updates.29 6. Monitor VRAM: Keep an eye on GPU memory usage using watch -n 1 nvidia-smi during training setup and initial steps to ensure you are within limits. 7. Apply Chat Templates Correctly: If fine-tuning an instruction-tuned (IT) Gemma model, meticulously format your data using tokenizer.apply_chat_template with the correct roles and add_generation_prompt setting.96 8. Tune Hyperparameters Systematically: Start with established defaults or recommendations (Table 4). Adjust learning rate, LoRA rank (r), and epochs iteratively, monitoring validation performance.13 9. Evaluate Rigorously: Use appropriate metrics for your task (Table 5). Evaluate on a held-out test set. Check for catastrophic forgetting if general capabilities are important [Chapter 6]. 10. Version Everything: Use Git for your code. Consider tools like DVC or MLflow to track datasets, experiments, hyperparameters, and results for reproducibility. 11. Utilize Community Knowledge: Before reporting issues, search Hugging Face forums, relevant GitHub repositories (Transformers, PEFT, TRL, Gemma cookbooks), and communities like Reddit (r/LocalLLaMA) for solutions related to Gemma, QLoRA, RTX 4090, or Debian setups.24

⚡ 8.3. Troubleshooting Insights

Fine-tuning models like Gemma 3 locally on consumer hardware like the RTX 4090 transforms the task into a significant exercise in resource management. Out-of-Memory (OOM) errors are arguably the most common and frustrating hurdle developers face.24 Successfully fitting even the 12B parameter Gemma 3 model within the 24GB VRAM requires a combination of techniques. While PEFT methods like LoRA significantly reduce the risk of catastrophic forgetting compared to full fine-tuning 105, they don’t eliminate it entirely.104 If preserving the model’s broad pre-trained knowledge is critical alongside specialization, developers must remain vigilant. This involves careful hyperparameter tuning (lower learning rates, fewer epochs) and potentially exploring more advanced continual learning strategies or data mixing (rehearsal) if significant forgetting is observed during evaluation on general benchmarks or tasks.64 The choice depends on whether the application demands a highly specialized model or one that retains more general versatility.

⚡ Table 6: Common Fine-Tuning Issues and Mitigation

Issue	Symptoms	Potential Causes	Mitigation Strategies
Out-of-Memory (OOM)	Training crashes (CUDA OOM error).	Batch size too large, sequence length too long, model size, optimizer states.	Use QLoRA (4-bit), reduce per_device_train_batch_size, increase gradient_accumulation_steps, enable gradient_checkpointing, use paged_adamw, use bf16, reduce max_seq_length.
Overfitting	Training loss ↓, Validation loss ↑/stagnates. Poor test perf.	Training too long, learning rate too high, small/non-diverse dataset, high model capacity.	Reduce num_train_epochs, use Early Stopping, add weight_decay, increase lora_dropout, data augmentation, decrease lora_r.
Catastrophic Forgetting	Good task performance, poor general knowledge/performance.	Overwriting pre-trained weights/knowledge during fine-tuning.	Use PEFT (LoRA/QLoRA), lower learning_rate, reduce num_train_epochs, Rehearsal (mix data), Continual Learning techniques (advanced).
Instability / NaN Loss	Loss becomes NaN or explodes, slow/no convergence.	Learning rate too high, numerical issues (FP16), data format errors, gradient explosion.	Lower learning_rate, use bf16 instead of fp16, check data preprocessing/tokenization, use gradient clipping (max_grad_norm), try different optimizer, ensure sufficient warmup.
Setup/Compatibility	Errors during library import, model load, or runtime.	Incorrect driver/CUDA/PyTorch versions, environment issues.	Verify all installations (Chapter 1), check compatibility, reinstall libraries in conda env, consult forums/GitHub issues for specific hardware/OS/library errors.40

🌟 9. Conclusion

⚡ 9.1. Summary of the Process

This guide has provided a comprehensive walkthrough for fine-tuning Google’s Gemma 3 models locally on a Debian 12 system equipped with an NVIDIA RTX 4090 GPU. We covered the essential stages:

1. Environment Setup: Configuring the operating system, NVIDIA drivers, CUDA Toolkit, Python environment (Miniconda), PyTorch, and the Hugging Face ecosystem libraries (transformers, datasets, accelerate, evaluate, peft, bitsandbytes, trl). 2. Understanding Gemma 3: Detailing its architecture, capabilities (multimodality, context length, multilingualism), available sizes, and access via Hugging Face. 3. Data Preparation: Discussing specific formatting requirements for Question-Answering (SQuAD-like) and strategies for handling large Knowledge Bases (chunking techniques), emphasizing the critical role of chat templates for instruction-tuned models. 4. Fine-Tuning Configuration: Explaining the necessity of Parameter-Efficient Fine-Tuning (PEFT), diving deep into LoRA and QLoRA, tuning key hyperparameters, and applying crucial memory optimization techniques (gradient checkpointing, accumulation, mixed precision, paged optimizers). 5. Execution: Providing a runnable Python script utilizing the SFTTrainer from trl to execute the QLoRA fine-tuning process, including monitoring and checkpointing. 6. Evaluation: Defining relevant metrics for QA (EM, F1) and KB/Generation tasks (ROUGE, BLEU, Perplexity) and demonstrating their computation using the evaluate library. 7. Troubleshooting & Best Practices: Addressing common issues like OOM errors, overfitting, catastrophic forgetting, and instability, offering mitigation strategies and summarizing best practices for this specific setup.

⚡ 9.2. Key Takeaways

Feasibility: Fine-tuning powerful models like Gemma 3 (specifically 4B and potentially 12B variants) locally on consumer hardware like the RTX 4090 is achievable, but it heavily relies on advanced optimization techniques.
QLoRA is Essential: Due to the 24GB VRAM constraint, QLoRA (4-bit quantization combined with LoRA) is practically mandatory for handling Gemma 3 models beyond the smallest sizes.
Memory Management is Key: Success hinges on effectively managing GPU memory using a combination of QLoRA, gradient checkpointing, gradient accumulation, paged optimizers, and potentially mixed precision (BF16).
Configuration Matters: Careful configuration of the environment (driver/CUDA/PyTorch versions), data preparation (chunking, chat templates), PEFT parameters (LoraConfig), and training arguments (TrainingArguments) is critical to avoid errors and achieve good results.
Iterative Process: Fine-tuning is often iterative, involving experimentation with hyperparameters, data formatting, and evaluation to find the optimal balance between task performance, generalization, and resource constraints.

⚡ 9.3. Potential Next Steps

Building upon this foundation, practitioners can explore several avenues:

Experiment with Model Sizes: Compare the performance and resource usage of fine-tuning Gemma 3 4B versus 12B (if feasible with further optimization).
Advanced PEFT Methods: Investigate other PEFT techniques available in the peft library, such as DoRA (Weight-Decomposed Low-Rank Adaptation) 112 or different initialization strategies like LoftQ.108
Combine with RAG: Explore hybrid approaches where a fine-tuned Gemma 3 model is used in conjunction with a Retrieval-Augmented Generation system for tasks requiring both deep domain adaptation and access to external, up-to-date information.11
Deployment: For deploying the fine-tuned model, consider merging the LoRA adapters into the base model weights for standalone inference using peft’s merge_and_unload() functionality.108 Explore serving frameworks optimized for LLMs.
Deeper Evaluation: Conduct more extensive evaluations, including human assessment, analysis of specific failure modes, and testing against broader benchmarks to better understand model capabilities and limitations.

⚡ 9.4. Further Resources

Gemma Official Documentation: https://ai.google.dev/gemma/docs 5
Gemma on Hugging Face: https://huggingface.co/google (Browse for specific model cards) 2
Hugging Face Transformers: https://huggingface.co/docs/transformers
Hugging Face PEFT: https://huggingface.co/docs/peft 52
Hugging Face TRL: https://huggingface.co/docs/trl 48
Hugging Face Accelerate: [https://huggingface.co/docs/accelerate](

🔧 Works cited

1. Fine-Tune Gemma 3: A Step-by-Step Guide With Financial Q&A …, accessed on April 16, 2025, https://www.datacamp.com/tutorial/fine-tune-gemma-3 2. google/gemma-3-1b-it · Hugging Face, accessed on April 16, 2025, https://huggingface.co/google/gemma-3-1b-it 3. How to fine-tune Google Gemma with ChatML and Hugging Face TRL - Philschmid, accessed on April 16, 2025, https://www.philschmid.de/fine-tune-google-gemma 4. Introducing Gemma 3: The Developer Guide, accessed on April 16, 2025, https://developers.googleblog.com/en/introducing-gemma3/ 5. Gemma 3 model overview | Google AI for Developers, accessed on April 16, 2025, https://ai.google.dev/gemma/docs/core 6. Gemma 3 for Beginners: An Introduction to Google’s Open-Source AI - Hugging Face, accessed on April 16, 2025, https://huggingface.co/blog/proflead/gemma-3-tutorial 7. Welcome Gemma 3: Google’s all new multimodal, multilingual, long …, accessed on April 16, 2025, https://huggingface.co/blog/gemma3 8. Use Gemma open models | Generative AI on Vertex AI - Google Cloud, accessed on April 16, 2025, https://cloud.google.com/vertex-ai/generative-ai/docs/open-models/use-gemma 9. How to fine-tune open LLMs in 2025 with Hugging Face - Philschmid, accessed on April 16, 2025, https://www.philschmid.de/fine-tune-llms-in-2025 10. LLM Fine Tuning Best Practices - Codoid, accessed on April 16, 2025, https://codoid.com/ai/llm-fine-tuning-best-practices/ 11. Enhancing Large Language Model Performance To Answer Questions and Extract Information More Accurately - arXiv, accessed on April 16, 2025, https://arxiv.org/pdf/2402.01722 12. Best way to add knowledge to a llm : r/LocalLLaMA - Reddit, accessed on April 16, 2025, https://www.reddit.com/r/LocalLLaMA/comments/1ao2bzu/best_way_to_add_knowledge_to_a_llm/ 13. Fine-tuning large language models (LLMs) in 2025 - SuperAnnotate, accessed on April 16, 2025, https://www.superannotate.com/blog/llm-fine-tuning 14. LLM Fine-Tuning: Guide to HITL & Best Practices, accessed on April 16, 2025, https://llmmodels.org/blog/llm-fine-tuning-guide-to-hitl-and-best-practices/ 15. Guide to Fine Tuning LLMs: Methods & Best Practices - Ema, accessed on April 16, 2025, https://www.ema.co/additional-blogs/addition-blogs/guide-to-fine-tuning-llms-methods-and-best-practices 16. Fine-Tuning Large Language Models - Analytics Vidhya, accessed on April 16, 2025, https://www.analyticsvidhya.com/blog/2023/08/fine-tuning-large-language-models/ 17. Introduction to Fine-Tuning Theory in Large Language Models (LLMs) - Medium, accessed on April 16, 2025, https://medium.com/aidetic/introduction-to-fine-tuning-theory-in-large-language-models-llms-e45eab78f659 18. The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities (Version 1.0) - arXiv, accessed on April 16, 2025, https://arxiv.org/html/2408.13296v1 19. Fine-Tuning LLMs: Expert Guide to Task-Specific AI Models - Rapid Innovation, accessed on April 16, 2025, https://www.rapidinnovation.io/post/for-developers-step-by-step-guide-to-fine-tuning-llms-for-specific-tasks 20. Tuning LLMs by RAG Principles: Towards LLM-native Memory - arXiv, accessed on April 16, 2025, https://arxiv.org/html/2503.16071 21. Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs - ACL Anthology, accessed on April 16, 2025, https://aclanthology.org/2024.emnlp-main.15.pdf 22. Retrieval-Augmented Generation for Large Language Models: A Survey - arXiv, accessed on April 16, 2025, https://arxiv.org/pdf/2312.10997 23. Improving Retrieval for RAG based Question Answering Models on Financial Documents - arXiv, accessed on April 16, 2025, https://arxiv.org/pdf/2404.07221 24. How much memory for finetuning · Issue #155 · octo-models/octo - GitHub, accessed on April 16, 2025, https://github.com/octo-models/octo/issues/155 25. Performance Issue with RTX 4090 and all SD/Diffusers versions #952 - GitHub, accessed on April 16, 2025, https://github.com/huggingface/diffusers/issues/952 26. Mistral NeMo - Hacker News, accessed on April 16, 2025, https://news.ycombinator.com/item?id=40996058 27. Mac Studio with 192GB still the best option for a local LLM <$10k? : r/LocalLLaMA - Reddit, accessed on April 16, 2025, https://www.reddit.com/r/LocalLLaMA/comments/1ckoyn4/mac_studio_with_192gb_still_the_best_option_for_a/ 28. A Deep Dive into the Trade-Offs of Parameter-Efficient Preference Alignment Techniques, accessed on April 16, 2025, https://arxiv.org/html/2406.04879v1 29. arxiv.org, accessed on April 16, 2025, https://arxiv.org/pdf/2305.14314 30. My experience on starting with fine tuning LLMs with custom data : r/LocalLLaMA - Reddit, accessed on April 16, 2025, https://www.reddit.com/r/LocalLLaMA/comments/14vnfh2/my_experience_on_starting_with_fine_tuning_llms/ 31. NVIDIA install guide - Linux.org, accessed on April 16, 2025, https://www.linux.org/threads/nvidia-install-guide.48421/ 32. NvidiaGraphicsDrivers - Debian Wiki, accessed on April 16, 2025, https://wiki.debian.org/NvidiaGraphicsDrivers 33. How to Install CUDA Toolkit on Debian 12, 11, or 10 - LinuxCapable, accessed on April 16, 2025, https://linuxcapable.com/how-to-install-cuda-on-debian-linux/ 34. How to Configure the NVIDIA vGPU Drivers, CUDA Toolkit and Container Toolkit on Debian 12 | The Virtual Horizon, accessed on April 16, 2025, https://thevirtualhorizon.com/2024/05/31/how-to-configure-the-nvidia-vgpu-drivers-cuda-toolkit-and-container-toolkit-on-debian-12/ 35. Simple Debian, CUDA & Pytorch setup - LocalLLaMA - Reddit, accessed on April 16, 2025, https://www.reddit.com/r/LocalLLaMA/comments/1jvkr9r/simple_debian_cuda_pytorch_setup/ 36. 1. Introduction — NVIDIA Driver Installation Guide r570 documentation, accessed on April 16, 2025, https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/index.html 37. How to install CUDA and run Pytorch on Linux - GitHub Gist, accessed on April 16, 2025, https://gist.github.com/tranctan/7136955aaf2a1457301b68ed2b2ea4d4 38. 1. Introduction — Installation Guide for Linux 12.8 documentation, accessed on April 16, 2025, https://docs.nvidia.com/cuda/cuda-installation-guide-linux/ 39. Start Locally | PyTorch, accessed on April 16, 2025, https://pytorch.org/get-started/locally/ 40. Compatibility 4090 + Cuda + Pytorch - Reddit, accessed on April 16, 2025, https://www.reddit.com/r/CUDA/comments/1bzs18i/compatibility_4090_cuda_pytorch/ 41. Pytorch not working with Nvidia 4090, accessed on April 16, 2025, https://discuss.pytorch.org/t/pytorch-not-working-with-nvidia-4090/173054 42. Debian 12 - Installing Nvidia Cuda Toolkit following outlined instructions has made external displays stop working after reboot, accessed on April 16, 2025, https://forums.developer.nvidia.com/t/debian-12-installing-nvidia-cuda-toolkit-following-outlined-instructions-has-made-external-displays-stop-working-after-reboot/285468 43. How to Install PyTorch on Debian 12 - Shapehost, accessed on April 16, 2025, https://shape.host/resources/how-to-install-pytorch-on-debian-12 44. How to Install Anaconda on Debian 12 | Vultr Docs, accessed on April 16, 2025, https://docs.vultr.com/how-to-install-anaconda-on-debian-12 45. Installing Miniconda - Anaconda, accessed on April 16, 2025, https://www.anaconda.com/docs/getting-started/miniconda/install 46. Miniconda - Anaconda, accessed on April 16, 2025, https://www.anaconda.com/docs/getting-started/miniconda/main 47. Anaconda Documentation - Anaconda, accessed on April 16, 2025, https://docs.anaconda.com/free/miniconda/miniconda-install/ 48. Installation - Hugging Face, accessed on April 16, 2025, https://huggingface.co/docs/transformers/installation 49. Accelerate - Hugging Face, accessed on April 16, 2025, https://huggingface.co/docs/transformers/accelerate 50. Evaluate Metric - Hugging Face, accessed on April 16, 2025, https://huggingface.co/evaluate-metric 51. Choosing a metric for your task - Hugging Face, accessed on April 16, 2025, https://huggingface.co/docs/evaluate/choosing_a_metric 52. Load adapters with PEFT - Hugging Face, accessed on April 16, 2025, https://huggingface.co/docs/transformers/v4.44.0/peft 53. PEFT - Hugging Face, accessed on April 16, 2025, https://huggingface.co/docs/transformers/peft 54. huggingface/peft: PEFT: State-of-the-art Parameter-Efficient Fine-Tuning. - GitHub, accessed on April 16, 2025, https://github.com/huggingface/peft 55. Fine-tuning Gemma 3 on a Custom Dataset With Firecrawl and …, accessed on April 16, 2025, https://www.firecrawl.dev/blog/gemma-3-fine-tuning-firecrawl-unsloth 56. How to Fine-Tune LLMs in 2024 with Hugging Face - Philschmid, accessed on April 16, 2025, https://www.philschmid.de/fine-tune-llms-in-2024-with-trl 57. google/gemma-2-2b-it - Hugging Face, accessed on April 16, 2025, https://huggingface.co/google/gemma-2-2b-it 58. google/gemma-2b-it - Hugging Face, accessed on April 16, 2025, https://huggingface.co/google/gemma-2b-it 59. Gemma 3 - Hugging Face, accessed on April 16, 2025, https://huggingface.co/docs/transformers/main/model_doc/gemma3 60. unsloth/gemma-3-1b-it - Hugging Face, accessed on April 16, 2025, https://huggingface.co/unsloth/gemma-3-1b-it 61. gemma-cookbook/Workshops/Workshop_How_to_Fine_tuning_Gemma.ipynb at main, accessed on April 16, 2025, https://github.com/google-gemini/gemma-cookbook/blob/main/Workshops/Workshop_How_to_Fine_tuning_Gemma.ipynb 62. Tokenizers - Hugging Face LLM Course, accessed on April 16, 2025, https://huggingface.co/learn/llm-course/chapter2/4 63. Summary of the tokenizers - Hugging Face, accessed on April 16, 2025, https://huggingface.co/docs/transformers/tokenizer_summary 64. All You Need to Know About LLM Fine-Tuning (Part 2) | Akaike Ai, accessed on April 16, 2025, https://www.akaike.ai/resources/all-you-need-to-know-about-llm-fine-tuning-part-2 65. Supervised Fine-Tuning - Hugging Face LLM Course, accessed on April 16, 2025, https://huggingface.co/learn/nlp-course/en/chapter11/1 66. llm-research-summaries/training/ultimate-guide-fine-tuning-llm_parthasarathy-2408.13296.md at main - GitHub, accessed on April 16, 2025, https://github.com/cognitivetech/llm-research-summaries/blob/main/training/ultimate-guide-fine-tuning-llm_parthasarathy-2408.13296.md 67. Question answering - Hugging Face, accessed on April 16, 2025, https://huggingface.co/docs/transformers/tasks/question_answering 68. Introduction to Fine-tuning Large Language Models - Stephen Diehl, accessed on April 16, 2025, https://www.stephendiehl.com/posts/training_llms/ 69. Question answering - Hugging Face, accessed on April 16, 2025, https://huggingface.co/docs/transformers/v4.22.2/en/tasks/question_answering 70. How to Evaluate LLMs - KDnuggets, accessed on April 16, 2025, https://www.kdnuggets.com/how-to-evaluate-llms 71. SQuAD - a Hugging Face Space by evaluate-metric, accessed on April 16, 2025, https://huggingface.co/spaces/evaluate-metric/squad 72. README.md · rajpurkar/squad_v2 at main - Hugging Face, accessed on April 16, 2025, https://huggingface.co/datasets/rajpurkar/squad_v2/blob/main/README.md 73. Question answering - Hugging Face NLP Course, accessed on April 16, 2025, https://huggingface.co/learn/nlp-course/chapter7/7 74. What’s the data format of the QA json file in official scripts - Hugging Face Forums, accessed on April 16, 2025, https://discuss.huggingface.co/t/whats-the-data-format-of-the-qa-json-file-in-official-scripts/32079 75. huggingface datasets - Formatting question/answer data for Hugging Face - Stack Overflow, accessed on April 16, 2025, https://stackoverflow.com/questions/77776402/formatting-question-answer-data-for-hugging-face 76. Fine-Tuning LLMs: A Guide With Examples - DataCamp, accessed on April 16, 2025, https://www.datacamp.com/tutorial/fine-tuning-large-language-models 77. Text Chunking In Python Techniques | Restackio, accessed on April 16, 2025, https://www.restack.io/p/text-chunking-answer-python-techniques-cat-ai 78. Mastering RAG: Advanced Chunking Techniques for LLM Applications - Galileo AI, accessed on April 16, 2025, https://www.galileo.ai/blog/mastering-rag-advanced-chunking-techniques-for-llm-applications 79. Chunking Technique For Feeding LLM Long Text - Unlocked LLM, accessed on April 16, 2025, https://www.unlockedllm.com/publication/chunking-technique-for-feeding-llm-long-text/ 80. Chunking Strategies for LLM Applications - Pinecone, accessed on April 16, 2025, https://www.pinecone.io/learn/chunking-strategies/ 81. Breaking up is hard to do: Chunking in RAG applications - Stack Overflow, accessed on April 16, 2025, https://stackoverflow.blog/2024/12/27/breaking-up-is-hard-to-do-chunking-in-rag-applications/ 82. The Ultimate Guide to Chunking Strategies for RAG Applications with Databricks, accessed on April 16, 2025, https://community.databricks.com/t5/technical-blog/the-ultimate-guide-to-chunking-strategies-for-rag-applications/ba-p/113089 83. ibmdotcom-tutorials/generative-ai/rag-chunking-strategies.ipynb at main - GitHub, accessed on April 16, 2025, https://github.com/IBM/ibmdotcom-tutorials/blob/main/generative-ai/rag-chunking-strategies.ipynb 84. Chunking strategies for RAG tutorial using Granite - IBM, accessed on April 16, 2025, https://www.ibm.com/think/tutorials/chunking-strategies-for-rag-with-langchain-watsonx-ai 85. Chunking techniques with Langchain and LlamaIndex - LanceDB Blog, accessed on April 16, 2025, https://blog.lancedb.com/chunking-techniques-with-langchain-and-llamaindex/ 86. How to Choose the Right Chunking Strategy for Your LLM Application | MongoDB, accessed on April 16, 2025, https://www.mongodb.com/developer/products/atlas/choosing-chunking-strategy-rag/ 87. Chunking strategies for RAG applications - Amazon Bedrock Recipes - GitHub Pages, accessed on April 16, 2025, https://aws-samples.github.io/amazon-bedrock-samples/rag/open-source/chunking/rag_chunking_strategies_langchain_bedrock/ 88. Text Splitters | 🦜️ LangChain, accessed on April 16, 2025, https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/ 89. How to Chunk Text in JavaScript for Your RAG Application | DataStax, accessed on April 16, 2025, https://www.datastax.com/blog/how-to-chunk-text-in-javascript-for-rag-applications 90. azure-search-vector-samples/demo-python/code/data-chunking/langchain-data-chunking-example.ipynb at main - GitHub, accessed on April 16, 2025, https://github.com/Azure/azure-search-vector-samples/blob/main/demo-python/code/data-chunking/langchain-data-chunking-example.ipynb 91. LangChain Text Splitters for Beginners: Easy Text Chunking Tutorial - YouTube, accessed on April 16, 2025, https://www.youtube.com/watch?v=6PqlvaXbTTE 92. How to split text based on semantic similarity | 🦜️ LangChain, accessed on April 16, 2025, https://python.langchain.com/docs/how_to/semantic-chunker/ 93. Chunking Idea: Summarize Chunks for better retrieval : r/LangChain - Reddit, accessed on April 16, 2025, https://www.reddit.com/r/LangChain/comments/1bbdgpj/chunking_idea_summarize_chunks_for_better/ 94. [Question]: How to chunk text for structured LLM data extraction · Issue #18240 · run-llama/llama_index - GitHub, accessed on April 16, 2025, https://github.com/run-llama/llama_index/issues/18240 95. Semantic Chunker - LlamaIndex, accessed on April 16, 2025, https://docs.llamaindex.ai/en/stable/examples/node_parsers/semantic_chunking/ 96. Templates - Hugging Face, accessed on April 16, 2025, https://huggingface.co/docs/transformers/main/chat_templating 97. SFT Trainer and chat templates - Beginners - Hugging Face Forums, accessed on April 16, 2025, https://discuss.huggingface.co/t/sft-trainer-and-chat-templates/147205 98. google/gemma-7b-it · Fix chat template does not compatible with ConversationalPipeline, accessed on April 16, 2025, https://huggingface.co/google/gemma-7b-it/discussions/42 99. Data Augmentation is Dead, Long Live Data Augmentation - arXiv, accessed on April 16, 2025, https://arxiv.org/html/2402.14895v1 100. Neural Data-to-Text Generation Based on Small Datasets: Comparing the Added Value of Two Semi-Supervised Learning Approaches on Top of a Large Language Model - MIT Press Direct, accessed on April 16, 2025, https://direct.mit.edu/coli/article/49/3/555/116158/Neural-Data-to-Text-Generation-Based-on-Small 101. Elevate Your NLP Models with Automated Data Augmentation for Enhanced Performance - Hugging Face, accessed on April 16, 2025, https://huggingface.co/blog/chakravarthik27/boost-nlp-models-with-automated-data-augmentation 102. Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations - ACL Anthology, accessed on April 16, 2025, https://aclanthology.org/2023.emnlp-main.647.pdf 103. Revealing Trends in Datasets from the 2022 ACL and EMNLP Conferences - arXiv, accessed on April 16, 2025, https://arxiv.org/html/2404.08666v1 104. Fine-Tuning LLMs: Overcoming Catastrophic Forgetting - Yurts AI, accessed on April 16, 2025, https://www.yurts.ai/blog/navigating-the-challenges-of-fine-tuning-and-catastrophic-forgetting 105. CURLoRA: Stable LLM Continual Fine-Tuning and Catastrophic Forgetting Mitigation - arXiv, accessed on April 16, 2025, https://arxiv.org/html/2408.14572v1 106. Investigating the Catastrophic Forgetting in Multimodal Large Language Models - arXiv, accessed on April 16, 2025, https://arxiv.org/pdf/2309.10313 107. [D] LLMs are known for catastrophic forgetting during continual fine-tuning - Reddit, accessed on April 16, 2025, https://www.reddit.com/r/MachineLearning/comments/1akd287/d_llms_are_known_for_catastrophic_forgetting/ 108. LoRA - Hugging Face, accessed on April 16, 2025, https://huggingface.co/docs/peft/main/conceptual_guides/lora 109. PEFT: Parameter-Efficient Fine-Tuning Methods for LLMs - Hugging Face, accessed on April 16, 2025, https://huggingface.co/blog/samuellimabraz/peft-methods 110. LoRA - Hugging Face, accessed on April 16, 2025, https://huggingface.co/docs/peft/main/developer_guides/lora 111. LoRA vs Full Fine-tuning: An Illusion of Equivalence - arXiv, accessed on April 16, 2025, https://arxiv.org/html/2410.21228v1 112. LoRA - Hugging Face, accessed on April 16, 2025, https://huggingface.co/docs/peft/v0.9.0/developer_guides/lora 113. LoRA - Hugging Face, accessed on April 16, 2025, https://huggingface.co/docs/peft/package_reference/lora 114. How to vision fine-tune the Gemma3 using custom data collator on …, accessed on April 16, 2025, https://github.com/unslothai/unsloth/issues/2122 115. LLM Fine Tuning Parameters - Hugging Face, accessed on April 16, 2025, https://huggingface.co/docs/autotrain/llm_finetuning_params 116. Prompt Tuning vs. Fine-Tuning—Differences, Best Practices, and Use Cases | Nexla, accessed on April 16, 2025, https://nexla.com/ai-infrastructure/prompt-tuning-vs-fine-tuning/ 117. Dataset Tokenization to Fine-Tune Gemma 3 1B - Beginners - Hugging Face Forums, accessed on April 16, 2025, https://discuss.huggingface.co/t/dataset-tokenization-to-fine-tune-gemma-3-1b/148877 118. Gemma3 minimal fine tuning example? · Issue #36714 · huggingface/transformers - GitHub, accessed on April 16, 2025, https://github.com/huggingface/transformers/issues/36714 119. Challenges of multi-task learning in LLM fine-tuning - IoT Tech News, accessed on April 16, 2025, https://iottechnews.com/news/challenges-of-multi-task-learning-in-llm-fine-tuning/ 120. GPU - Hugging Face, accessed on April 16, 2025, https://huggingface.co/docs/transformers/v4.42.0/perf_train_gpu_one 121. Messing around with fine-tuning LLMs, part 9 — gradient checkpointing - Giles’ blog, accessed on April 16, 2025, https://www.gilesthomas.com/2024/09/fine-tuning-9 122. Performance and Scalability: How To Fit a Bigger Model and Train It Faster - Hugging Face, accessed on April 16, 2025, https://huggingface.co/docs/transformers/v4.19.4/en/performance 123. mlabonne/llm-course: Course to get into Large Language Models (LLMs) with roadmaps and Colab notebooks. - GitHub, accessed on April 16, 2025, https://github.com/mlabonne/llm-course 124. Trainer and Accelerate - Transformers - Hugging Face Forums, accessed on April 16, 2025, https://discuss.huggingface.co/t/trainer-and-accelerate/26382 125. How to run single-node, multi-GPU training with HF Trainer and deepspeed?, accessed on April 16, 2025, https://discuss.huggingface.co/t/how-to-run-single-node-multi-gpu-training-with-hf-trainer-and-deepspeed/70342 126. Efficient Training on Multiple GPUs - Hugging Face, accessed on April 16, 2025, https://huggingface.co/docs/transformers/en/perf_train_gpu_many 127. Multiple GPU in SFTTrainer - Beginners - Hugging Face Forums, accessed on April 16, 2025, https://discuss.huggingface.co/t/multiple-gpu-in-sfttrainer/91899 128. LLM evaluation: Why testing AI models matters - IBM, accessed on April 16, 2025, https://www.ibm.com/think/insights/llm-evaluation 129. LLM Evaluation: Top 10 Metrics and Benchmarks - Kolena, accessed on April 16, 2025, https://www.kolena.com/guides/llm-evaluation-top-10-metrics-and-benchmarks/ 130. LLM Evaluation: Metrics, Frameworks, and Best Practices | SuperAnnotate, accessed on April 16, 2025, https://www.superannotate.com/blog/llm-evaluation-guide 131. LLM Evaluation: Metrics, Methodologies, Best Practices - DataCamp, accessed on April 16, 2025, https://www.datacamp.com/blog/llm-evaluation 132. LLM Benchmarks: Understanding Language Model Performance - Humanloop, accessed on April 16, 2025, https://humanloop.com/blog/llm-benchmarks 133. LLM Benchmarks for Comprehensive Model Evaluation - Data Science Dojo, accessed on April 16, 2025, https://datasciencedojo.com/blog/llm-benchmarks-for-evaluation/ 134. Technical Approaches to LLM Evaluation for AI Applications | Adaline, accessed on April 16, 2025, https://www.adaline.ai/blog/technical-approaches-to-llm-evaluation-for-ai-applications 135. LLM evaluation benchmarks—a concise guide - Fabrity, accessed on April 16, 2025, https://fabrity.com/blog/llm-evaluation-benchmarks-a-concise-guide/ 136. Guide to Evaluating Large Language Models: Metrics and Best Practices - Composio, accessed on April 16, 2025, https://composio.dev/blog/llm-evaluation-guide/ 137. LLM Evaluation: Best Metrics & Tools - UBIAI, accessed on April 16, 2025, https://ubiai.tools/llm-evaluation-best-metrics-tools/ 138. Exact Match - a Hugging Face Space by evaluate-metric, accessed on April 16, 2025, https://huggingface.co/spaces/evaluate-metric/exact_match 139. Define your evaluation metrics | Generative AI on Vertex AI - Google Cloud, accessed on April 16, 2025, https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval 140. The AI community building the future. - Hugging Face, accessed on April 16, 2025, https://huggingface.co/metrics 141. Mastering ROUGE Matrix: Your Guide to Large Language Model Evaluation for Summarization with Examples - DEV Community, accessed on April 16, 2025, https://dev.to/aws-builders/mastering-rouge-matrix-your-guide-to-large-language-model-evaluation-for-summarization-with-examples-jjg 142. LLM evaluations: Metrics, frameworks, and best practices | genai-research - Wandb, accessed on April 16, 2025, https://wandb.ai/onlineinference/genai-research/reports/LLM-evaluations-Metrics-frameworks-and-best-practices—VmlldzoxMTMxNjQ4NA 143. How to Evaluate LLMs Using Hugging Face Evaluate - Analytics …, accessed on April 16, 2025, https://www.analyticsvidhya.com/blog/2025/04/hugging-face-evaluate/ 144. How to Use Hugging Face’s New Evaluate Library - Vennify.ai, accessed on April 16, 2025, https://www.vennify.ai/hugging-face-evaluate-library/ 145. A quick tour - Hugging Face, accessed on April 16, 2025, https://huggingface.co/docs/evaluate/a_quick_tour 146. How to specify additional parameters when using HuggingFace Evaluate’s evaluate.combine() method? - Stack Overflow, accessed on April 16, 2025, https://stackoverflow.com/questions/78058655/how-to-specify-additional-parameters-when-using-huggingface-evaluates-evaluate 147. FLAN T5-XL Fine Tuning: A Comprehensive Guide - BytePlus, accessed on April 16, 2025, https://www.byteplus.com/en/topic/500592 148. A Primer on Fine-Tuning PaliGemma and VLMs - Datature, accessed on April 16, 2025, https://www.datature.io/blog/a-primer-on-fine-tuning-paligemma-and-vlms 149. Why does Nvidia (4090) and Debian not play nicely together? - Reddit, accessed on April 16, 2025, https://www.reddit.com/r/debian/comments/1fmizx3/why_does_nvidia_4090_and_debian_not_play_nicely/ 150. How to finetune small LLMs to optimise them for RAG? : r/LocalLLaMA - Reddit, accessed on April 16, 2025, https://www.reddit.com/r/LocalLLaMA/comments/1baod6t/how_to_finetune_small_llms_to_optimise_them_for/ 151. Fine-tuning Gemma and/or CodeGemma for multi-turn, agentic workloads. #132 - GitHub, accessed on April 16, 2025, https://github.com/google-gemini/gemma-cookbook/issues/132 152. Multi-gpu huggingface training using trl - Transformers - Hugging Face Forums, accessed on April 16, 2025, https://discuss.huggingface.co/t/multi-gpu-huggingface-training-using-trl/113338 153. How to use Huggingface Trainer with multiple GPUs? - Stack Overflow, accessed on April 16, 2025, https://stackoverflow.com/questions/75814047/how-to-use-huggingface-trainer-with-multiple-gpus 154. How to train my model on multiple GPU - Transformers - Hugging Face Forums, accessed on April 16, 2025, https://discuss.huggingface.co/t/how-to-train-my-model-on-multiple-gpu/75775 155. finetuning with PEFT int-8bit + LoRA on single node multiGPU was working, now doesn’t any more · Issue #1840 · huggingface/accelerate - GitHub, accessed on April 16, 2025, https://github.com/huggingface/accelerate/issues/1840 156. Fine-tuning Gemma using JAX and Flax | Google AI for Developers - Gemini API, accessed on April 16, 2025, https://ai.google.dev/gemma/docs/jax_finetune

Fine Tuning Google Gemma 3 Locally: A Comprehensive Guide For Debian 12 And Nvidia Rtx 4090

📖 Reading Mode

📖 Table of Contents

🌌 Fine-Tuning Google Gemma 3 Locally: A Comprehensive Guide for Debian 12 and NVIDIA RTX 4090

🌟 1. Introduction

⚡ 1.1. The Rise of Gemma 3

⚡ 1.2. Why Fine-Tune?

⚡ 1.3. Tutorial Scope and Target Audience

⚡ 1.4. Hardware Context (RTX 4090)

🌟 2. Chapter 1: Environment Setup on Debian 12 for RTX 4090

⚡ 2.1. System Preparation

⚡ 2.2. NVIDIA Driver Installation

⚡ 2.3. CUDA Toolkit Installation

🌌 Adjust version/URL if needed based on NVIDIA’s latest instructions for Debian 12

🌌 Clean up the downloaded file

🌌 Example for CUDA 12.1

🌌 Or install the latest generally available version (verify PyTorch compatibility first)

🌌 sudo apt-get -y install cuda-toolkit

⚡ 2.4. Python Environment Setup

🌌 Compare output with the official hash

⚡ 2.5. Installing Core ML Libraries

🌌 Example: Replace with the correct commit/branch if needed

🌌 pip install git+https://github.com/huggingface/transformers@main

⚡ 2.6. Environment Setup Summary & Insights

⚡ Table 1: Recommended Software Versions (Example)

🌟 3. Chapter 2: Understanding Gemma 3

⚡ 3.1. Model Overview

⚡ 3.2. Key Technical Specifications

⚡ 3.3. Performance and Benchmarks

⚡ 3.4. Accessing Gemma 3 via Hugging Face

🌌 Ensure you have accepted the license on Hugging Face Hub

🌌 Login using: huggingface-cli login or from huggingface_hub import login; login()

🌌 The pipeline automatically applies the chat template for IT models

⚡ 3.5. Gemma 3 Insights for Fine-Tuning

⚡ Table 2: Gemma 3 Model Sizes and Approximate Inference Memory Requirements

🌟 4. Chapter 3: Data Preparation for Fine-Tuning

⚡ 4.1. General Principles

⚡ 4.2. Preparing Data for Question-Answering (QA)

🌌 Assuming df has columns: ‘id’, ‘context’, ‘question’, ‘answer_text’, ‘answer_start_char’

🌌 df = pd.read_csv(“your_qa_data.csv”)

🌌 Format the answers column correctly

🌌 Select relevant columns

🌌 Convert to Hugging Face Dataset object

🌌 Example Preprocessing Function [67, 69]

🌌 Assume tokenizer is loaded, e.g., tokenizer = AutoTokenizer.from_pretrained(“google/gemma-3-4b-it”)

🌌 Assume max_length is defined, e.g., max_length = 512

🌌 Find the start and end of the context

🌌 If the answer is not fully inside the context, label is (0, 0)

🌌 Otherwise it’s the start and end token positions

🌌 tokenized_dataset = hf_dataset.map(preprocess_qa_function, batched=True, remove_columns=hf_dataset.column_names)

⚡ 4.3. Preparing Large Knowledge Bases (KB) for Fine-Tuning

🌌 Example using LangChain CharacterTextSplitter

🌌 text = “Your long document text here…”

🌌 chunks = text_splitter.split_text(text)

🌌 print(f”Split into {len(chunks)} chunks.”)

🌌 Example using LangChain RecursiveCharacterTextSplitter

🌌 text = “Your long document text here…”

🌌 chunks = text_splitter.split_text(text)

🌌 print(f”Split into {len(chunks)} chunks.”)

🌌 Example using LangChain SemanticChunker (requires OpenAI API key or other embedding model)

🌌 from langchain_experimental.text_splitter import SemanticChunker

🌌 from langchain_openai import OpenAIEmbeddings # 🌌 Or other embeddings like HuggingFaceEmbeddings

🌌 text = “Your long document text here…”

🌌 Assuming OPENAI_API_KEY is set

🌌 text_splitter = SemanticChunker(

🌌 OpenAIEmbeddings(),

🌌 breakpoint_threshold_type=“percentile” # 🌌 Other options: “standard_deviation”, “interquartile”

🌌 )

🌌 docs = text_splitter.create_documents([text]) # 🌌 Returns LangChain Document objects

🌌 print(f”Split into {len(docs)} semantic chunks.”)

⚡ Table 3: Comparison of Text Chunking Strategies

⚡ 4.4. Using Gemma Chat Templates

🌌 Assume tokenizer is loaded for an IT model, e.g.,

🌌 tokenizer = AutoTokenizer.from_pretrained(“google/gemma-3-4b-it”)

🌌 Add pad token if necessary (often uses eos token)

🌌 Assume dataset is loaded with ‘instruction’ and ‘response’ columns

🌌 dataset = load_dataset(“your_instruction_dataset”)

🌌 Structure messages for the template

🌌 Apply the template, returning the formatted string

🌌 Set add_generation_prompt=True if the template requires it before the assistant’s turn