🌌 Fine-Tuning Large Language Models with TRL and SFTTrainer on an RTX 3060

This comprehensive guide provides a step-by-step walkthrough for fine-tuning large language models (LLMs) using the Transformer Reinforcement Learning (TRL) library and the SFTTrainer on an RTX 3060 with 12GB of VRAM. Fine-tuning empowers you to tailor powerful LLMs like gemma3 and phi4 to your specific needs and datasets, even on a resource-constrained GPU. We’ll delve into environment setup, code implementation, dataset preparation, and the fine-tuning process itself.

🌟 Introduction

🚀 Welcome to this comprehensive guide! This section will give you the foundational knowledge you need. Large language models have revolutionized how we interact with and generate text. However, their true potential is unlocked when they are fine-tuned for specific tasks or domains. TRL is a powerful library that simplifies this process, offering tools and functionalities for various fine-tuning methods 1. SFTTrainer, a component of TRL, streamlines supervised fine-tuning, enabling efficient training with minimal code.

🌟 Setting Up the Environment

Before we begin, ensure you have the necessary tools and libraries installed.

1. Install Required Libraries: Use pip to install the following:

Bash

pip install transformers datasets accelerate bitsandbytes peft trl

1. Hugging Face Hub: Create an account on the Hugging Face Hub and obtain an access token. This will allow you to download and use pre-trained models and datasets. You can log in using the following code:

Python

from huggingface_hub import login login(token=“YOUR_HUGGING_FACE_TOKEN”)

1. Hardware Requirements: An RTX 3060 with 12GB VRAM is sufficient for fine-tuning smaller models with QLoRA. However, larger models might require additional memory optimization techniques, which we will discuss later. 2. Flash Attention: If your GPU supports it (Ampere architecture or newer), install flash attention for faster training and reduced memory usage:

Bash

pip install ninja packaging MAX_JOBS=4 pip install flash-attn —no-build-isolation

1. Unsloth Library: For further optimization and enhanced performance on limited VRAM, consider using the Unsloth library. It offers specialized tools and techniques for fine-tuning large models efficiently 4. 2. Mixed Precision Training: Mixed precision training allows you to use both FP16 (half-precision) and FP32 (full-precision) data types during training. This can significantly reduce memory usage and speed up training without a major impact on accuracy 5.

🌟 Understanding TRL, SFTTrainer, and QLoRA

⚡ TRL (Transformer Reinforcement Learning)

TRL is a comprehensive library for training transformer language models with reinforcement learning. It provides a set of tools for various training methods, including supervised fine-tuning (SFT), reward modeling, and proximal policy optimization (PPO) 1. TRL integrates seamlessly with the Hugging Face transformers library, making it easy to use with existing models and datasets.

⚡ SFTTrainer

SFTTrainer is a specialized trainer within TRL designed for supervised fine-tuning of LLMs. It simplifies the process of fine-tuning by providing a high-level API that handles data loading, preprocessing, and training. SFTTrainer supports various dataset formats and offers features like packing, which combines multiple training samples into a single sequence for more efficient training 6.

⚡ QLoRA (Quantized Low-Rank Adaptation)

QLoRA is a parameter-efficient fine-tuning technique that combines quantization and low-rank adaptation (LoRA) 7. It reduces the memory footprint of large models by quantizing the model weights to 4-bit precision while keeping a small set of parameters in higher precision for fine-tuning. This allows for efficient fine-tuning on GPUs with limited VRAM, like the RTX 3060, without significant performance degradation.

🌟 Dataset Preparation and Formatting

The quality and format of your dataset are crucial for successful fine-tuning.

⚡ Dataset Selection

Choose a dataset that is relevant to your task and domain. Ensure it is diverse, comprehensive, and free of errors. High-quality datasets like OpenAssistant, which contains extensive and diverse annotations, are ideal for conversational AI tasks 8.

⚡ Dataset Formatting

TRL supports various dataset formats, including JSON, CSV, and text files. Ensure your dataset follows the correct structure for the chosen format. For instruction fine-tuning, structure your dataset with clear instructions and corresponding desired outputs. You can use templates like Alpaca or ChatML to format your data 9.

⚡ Instruction Fine-tuning

For instruction fine-tuning, structure your dataset with clear instructions and corresponding desired outputs. You can use templates like Alpaca or ChatML to format your data. For example, the Alpaca template includes “Instruction,” “Input,” and “Response” sections to guide the model’s learning 10.

⚡ Data Preprocessing

Clean and preprocess your dataset to remove noise, inconsistencies, and irrelevant information. This might involve tokenization, normalization, and formatting adjustments.

⚡ Data Augmentation

If you have a limited dataset, consider using data augmentation techniques to improve its quality and quantity. These techniques include paraphrasing, back-translation, and synthetic data generation 8.

🌟 Fine-tuning gemma3 with QLoRA

gemma3 is a powerful LLM capable of handling both text and image inputs. Here’s how to fine-tune it using QLoRA and SFTTrainer:

1. Load the Model and Tokenizer:

Python

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_id = “google/gemma-3-4b-pt” # Choose the desired gemma3 variant

bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type=“nf4”, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, ) model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config) tokenizer = AutoTokenizer.from_pretrained(model_id)

Before fine-tuning, ensure you have accepted the terms of use for the chosen gemma3 model on Hugging Face.

1. Prepare the Dataset: Load and format your dataset according to the gemma3 chat template:

Python

from datasets import load_dataset

dataset = load_dataset(“philschmid/amazon-product-descriptions-vlm”, split=“train”)

def preprocess_function(examples): images =] inputs = tokenizer( examples[“prompt”], return_tensors=“pt”, padding=“max_length”, truncation=True, ) inputs[“pixel_values”] = processor(images=images, return_tensors=“pt”).pixel_values return inputs

dataset = dataset.map(preprocess_function, batched=True)

1. Split the Dataset: Divide the dataset into training and evaluation sets to assess the performance of the fine-tuned model. Python

from sklearn.model_selection import train_test_split

train_dataset, eval_dataset = train_test_split(dataset, test_size=0.2)

1. Configure and Train with SFTTrainer:

Python

from trl import SFTTrainer, SFTConfig

peft_config = LoraConfig( r=8, lora_alpha=16, lora_dropout=0.1, bias=“none”, task_type=“CAUSAL_LM”, )

args = SFTConfig( output_dir=“gemma-product-description”, num_train_epochs=1, per_device_train_batch_size=1, gradient_accumulation_steps=4, learning_rate=2e-4, save_steps=100, logging_steps=10, )

trainer = SFTTrainer( model=model, train_dataset=train_dataset, eval_dataset=eval_dataset, peft_config=peft_config, dataset_text_field=“text”, tokenizer=tokenizer, args=args, )

trainer.train()

🌟 Fine-tuning phi4 with QLoRA

phi4 is another powerful LLM that can be efficiently fine-tuned using QLoRA and SFTTrainer. Follow these steps:

1. Load the Model and Tokenizer:

Python

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_id = “microsoft/phi-4” bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type=“nf4”, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, ) model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config) tokenizer = AutoTokenizer.from_pretrained(model_id)

1. Address the Tokenizer Bug: The phi4 tokenizer has a bug where it uses the <|endoftext|> token for the beginning of sequence (BOS), end of sequence (EOS), and padding tokens. To fix this, ensure the EOS token is set to <|im_end|> 4. 2. Prepare the Dataset: Load and format your dataset according to the phi4 chat template:

Python

from datasets import load_dataset

dataset = load_dataset(“HuggingFaceH4/ultrachat_200k”, split=“train”)

def formatting_prompts_func(example): text = f"""<|im_start|>system You are a helpful AI assistant.<|im_end|> <|im_start|>user {example[‘messages’][0][‘content’]}<|im_end|> <|im_start|>assistant {example[‘messages’][1][‘content’]}<|im_end|>""" example[‘text’] = text return example

dataset = dataset.map(formatting_prompts_func, batched=True)

1. Split the Dataset: Divide the dataset into training and evaluation sets to assess the performance of the fine-tuned model. Python

from sklearn.model_selection import train_test_split

train_dataset, eval_dataset = train_test_split(dataset, test_size=0.2)

1. Configure and Train with SFTTrainer:

Python

from trl import SFTTrainer, TrainingArguments

peft_config = LoraConfig( r=16, lora_alpha=16, lora_dropout=0.1, bias=“none”, task_type=“CAUSAL_LM”, )

training_args = TrainingArguments( output_dir=”./results”, per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-4, fp16=True, logging_steps=10, optim=“paged_adamw_8bit”, save_strategy=“epoch”, )

trainer = SFTTrainer( model=model, train_dataset=train_dataset, eval_dataset=eval_dataset, peft_config=peft_config, dataset_text_field=“text”, max_seq_length=512, tokenizer=tokenizer, args=training_args, )

trainer.train()

🌟 Memory Optimization Techniques

When working with limited VRAM, it’s essential to employ memory optimization techniques to avoid out-of-memory errors during training. Here are two common techniques:

⚡ Gradient Checkpointing

Gradient checkpointing reduces memory usage by storing only a subset of activations during the forward pass. During backpropagation, the missing activations are recomputed as needed 12. This can significantly reduce VRAM consumption but may increase training time.

⚡ Gradient Accumulation

Gradient accumulation simulates larger batch sizes without increasing memory usage. Instead of updating model weights after each batch, gradients are accumulated over multiple smaller batches 12. This reduces VRAM requirements but also increases training time.

🌟 Estimating VRAM Requirements

To estimate the VRAM needed for inference and fine-tuning, you can use the following formulas:

⚡ Inference (batch size 1):

VRAM (GB) = (P * 4 / 32 / Q) * 1.2

where:

P = number of parameters in the model (in billions)
Q = bit precision used for loading the model (e.g., 16, 8, or 4)
1.2 = overhead factor

⚡ Fine-tuning:

For fine-tuning, estimate VRAM by considering the memory used for model parameters, optimizer states, gradients, and activations. This typically requires 3 to 4 times more memory than inference at the same precision 12.

🌟 Evaluating Fine-tuned LLMs

Evaluating the performance of your fine-tuned LLMs is crucial to ensure they meet your requirements.

⚡ Quantitative Metrics

Use metrics like perplexity, BLEU, and ROUGE to measure the model’s fluency, coherence, and accuracy 13.

Perplexity: Measures how well the model predicts the next word in a sequence. Lower perplexity indicates better performance.
BLEU (Bilingual Evaluation Understudy): Measures the overlap of n-grams between the generated text and a reference text. Higher scores indicate better quality.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Measures the overlap of various sequences between the generated text and reference texts.

⚡ Qualitative Assessment

Conduct human evaluations to assess the model’s ability to generate human-like text, understand nuances, and follow instructions 13. This can involve having humans rate the quality of model outputs or compare them to human-generated text.

⚡ Benchmark Datasets

Evaluate your model on benchmark datasets like MMLU, GSM8K, and TruthfulQA to compare its performance against established baselines 14. These datasets provide standardized tests for various language tasks, allowing you to assess your model’s capabilities in different areas.

🌟 Comparing gemma3 and phi4

Here’s a table summarizing the key differences between gemma3 and phi4:

Model	Size	Modality	Context Length	Special Features
gemma3-4b-pt	4B	Text & Image	8192	Multilingual support, vision understanding
phi-4	14B	Text	2048	Advanced reasoning and instruction-following skills

🌟 Conclusion

This guide has provided a comprehensive overview of fine-tuning LLMs with TRL and SFTTrainer on an RTX 3060. By following these instructions and adapting them to your specific needs, you can efficiently fine-tune gemma3 and phi4 models for various tasks, even with limited VRAM. Remember to prioritize dataset quality, experiment with different hyperparameters, and evaluate your models thoroughly to achieve optimal results.

🔧 Works cited

1. TRL - Transformer Reinforcement Learning - Hugging Face, accessed on March 12, 2025, https://huggingface.co/docs/trl/main/en/index

2. Enhancing LLM Accessibility: A Deep Dive into QLoRA Through Fine-tuning Llama 2 on a single AMD GPU — ROCm Blogs, accessed on March 12, 2025, https://rocm.blogs.amd.com/artificial-intelligence/llama2-Qlora/README.html

3. Fine-Tune Gemma using Hugging Face Transformers and QloRA | Google AI for Developers, accessed on March 12, 2025, https://ai.google.dev/gemma/docs/core/huggingface_text_finetune_qlora

4. Phi-4 Finetuning + Bug Fixes by Unsloth, accessed on March 12, 2025, https://unsloth.ai/blog/phi4

5. How much VRAM do I need for LLM model fine-tuning? | Modal Blog, accessed on March 12, 2025, https://modal.com/blog/how-much-vram-need-fine-tuning

6. How to Fine-tune an LLM Part 3: The HuggingFace Trainer | alpaca_ft - Wandb, accessed on March 12, 2025, https://wandb.ai/capecape/alpaca_ft/reports/How-to-Fine-tune-an-LLM-Part-3-The-HuggingFace-Trainer—Vmlldzo1OTEyNjMy

7. QLoRA: Fine-Tuning Large Language Models (LLM’s) - Medium, accessed on March 12, 2025, https://medium.com/@dillipprasad60/qlora-explained-a-deep-dive-into-parametric-efficient-fine-tuning-in-large-language-models-llms-c1a4794b1766

8. Best Practices for Fine-Tuning Large Language Models with LoRA and QLoRA - Medium, accessed on March 12, 2025, https://medium.com/@jsmith0475/best-practices-for-fine-tuning-large-language-models-with-lora-and-qlora-998312c82aad

9. What are Instruction Datasets for Fine-Tuning LLMs? - Hopsworks, accessed on March 12, 2025, https://www.hopsworks.ai/dictionary/instruction-datasets-for-fine-tuning-llms

10. Structuring Datasets for Fine-Tuning an LLM | by William Caban | Shift Zone, accessed on March 12, 2025, https://shift.zone/structuring-datasets-for-fine-tuning-an-llm-8ca15062dd5c

11. Fine Tune PaliGemma with QLoRA for Visual Question Answering - PyImageSearch, accessed on March 12, 2025, https://pyimagesearch.com/2024/12/02/fine-tune-paligemma-with-qlora-for-visual-question-answering/

12. How Much VRAM Do You Need for LLMs? - Hyperstack, accessed on March 12, 2025, https://www.hyperstack.cloud/blog/case-study/how-much-vram-do-you-need-for-llms

13. Best practices when evaluating fine-tuned LLMs. - Medium, accessed on March 12, 2025, https://medium.com/@arazvant/best-practices-when-evaluating-fine-tuned-llms-47f02f5164c2

14. Evaluation of fine-tuned LLM using MonsterAPI | by Avikumar Talaviya | Medium, accessed on March 12, 2025, https://medium.com/@avikumart_/evaluation-of-fine-tuned-llm-using-monsterapi-a67a7714a65b

15. Fine-tune Gemma 3 with Unsloth, accessed on March 12, 2025, https://unsloth.ai/blog/gemma3

Fine Tuning Large Language Models With Trl And Sfttrainer On An Rtx 3060

📖 Reading Mode

📖 Table of Contents

🌌 Fine-Tuning Large Language Models with TRL and SFTTrainer on an RTX 3060

🌟 Introduction

🌟 Setting Up the Environment

🌟 Understanding TRL, SFTTrainer, and QLoRA

⚡ TRL (Transformer Reinforcement Learning)

⚡ SFTTrainer

⚡ QLoRA (Quantized Low-Rank Adaptation)

🌟 Dataset Preparation and Formatting

⚡ Dataset Selection

⚡ Dataset Formatting

⚡ Instruction Fine-tuning

⚡ Data Preprocessing

⚡ Data Augmentation

🌟 Fine-tuning gemma3 with QLoRA

🌟 Fine-tuning phi4 with QLoRA

🌟 Memory Optimization Techniques

⚡ Gradient Checkpointing

⚡ Gradient Accumulation

🌟 Estimating VRAM Requirements

⚡ Inference (batch size 1):

⚡ Fine-tuning:

🌟 Evaluating Fine-tuned LLMs

⚡ Quantitative Metrics

⚡ Qualitative Assessment

⚡ Benchmark Datasets

🌟 Comparing gemma3 and phi4

🌟 Conclusion

🔧 Works cited