🌌 Local Fine-Tuning of Large Language Models with 12GB VRAM in 2025

The proliferation of large language models (LLMs) has democratized access to sophisticated natural language processing capabilities. As these models become increasingly integrated into various applications, the need for their customization to specific tasks and domains has grown significantly. However, fine-tuning these expansive models often presents a considerable computational challenge, particularly for individuals working with consumer-grade hardware that may have limitations in resources such as video memory (VRAM).

In 2025, while advancements in hardware and software continue, effectively fine-tuning LLMs on systems with constrained VRAM, such as a 12GB video card, requires a nuanced understanding of parameter-efficient techniques, appropriate software tools, and optimized workflows. This report will delve into the best practices for locally fine-tuning several prominent LLMs – Phi4, Gemma3, Llama3.2, and Granite3 – within this VRAM constraint, as well as the procedures for converting these fine-tuned models to the GGUF format for efficient local inference. The focus will be on providing a detailed, actionable guide for technically proficient individuals seeking to customize these models despite hardware limitations. Parameter-Efficient Fine-Tuning (PEFT) methods have emerged as a transformative approach in adapting large models, shifting away from updating all model parameters to instead injecting small trainable components or adjusting a subset of weights 1. This strategy drastically reduces the computational and memory costs associated with fine-tuning, making it a viable option for resource-constrained environments. Furthermore, fine-tuning remains a powerful tool in 2025 for achieving higher quality results than simple prompting, reducing operational costs by creating smaller, more specialized models, and ensuring reliability for specific use cases 2.

🌟 Understanding VRAM Constraints and Their Impact on LLM Fine-Tuning

Video RAM (VRAM) serves as a dedicated high-speed memory for the graphics processing unit (GPU), crucial for storing and processing the vast amounts of data required by modern applications, including the training and inference of large language models. Insufficient VRAM can severely impede the performance of these tasks, leading to sluggish operation, system crashes, or even the inability to load the model altogether 4. Several factors contribute to the amount of VRAM consumed during LLM fine-tuning. The sheer size of the model, measured by the number of its parameters, is a primary determinant. Each parameter needs to be stored in memory, and the precision at which these parameters are represented significantly impacts memory usage 4. For instance, using 32-bit floating-point (FP32) precision requires 4 bytes per parameter, while lower precisions like 16-bit (FP16/BF16) use half that amount, and 4-bit integer (INT4) quantization uses only 0.5 bytes per parameter 4. Beyond the model parameters themselves, the optimizer states, gradients calculated during backpropagation, and activations generated during the forward pass also consume substantial VRAM 4. The size of the input data, specifically the batch size and sequence length, further influences VRAM usage. A general guideline suggests that roughly 8,000 tokens can be accommodated within the VRAM when spread across the batch 5. Therefore, larger batch sizes and longer input sequences necessitate more VRAM. It is crucial to recognize that even if the base model size appears to fit within the available VRAM, the prompts provided as input and the text generated as output by the LLM also utilize this memory 6. Therefore, a model with a size close to the VRAM capacity might not run smoothly in practice. Several techniques can be employed to mitigate VRAM constraints during LLM fine-tuning. Freezing the weights of the pre-trained model and only training a small subset of newly introduced parameters, as done in PEFT, is a highly effective strategy 5. Utilizing 4-bit quantization for the model weights can also lead to significant reductions in VRAM consumption 5. Furthermore, employing memory-efficient attention mechanisms like flash attention can reduce overall VRAM usage 5. Techniques such as gradient accumulation, where gradients are accumulated over several smaller batches to simulate a larger effective batch size, and gradient checkpointing, which trades off computation for memory by recalculating activations when needed, can also help in managing VRAM usage 4. While 12GB of VRAM might be sufficient for gaming at 1440p resolution with reasonable settings in 2025 7, fine-tuning large language models presents a different set of challenges due to the memory demands of model parameters, gradients, and activations.

🌟 Key Software Tools and Libraries for Local LLM Fine-Tuning in 2025

The landscape of software tools and libraries for local LLM fine-tuning in 2025 is rich and continues to evolve, offering various options for researchers and practitioners. The Hugging Face Transformers library stands as a cornerstone, providing access to a vast repository of pre-trained models and functionalities for their manipulation and training 1. Closely integrated with Transformers is the PEFT (Parameter-Efficient Fine-Tuning) library, which implements popular techniques like LoRA, QLoRA, and adapter layers, enabling efficient adaptation of large models on resource-constrained hardware 1. For users seeking a more streamlined experience, Axolotl provides a framework for configuring and running fine-tuning processes, often using YAML configuration files 12.

Unsloth has emerged as a library focused on optimizing the fine-tuning process for speed and reduced memory usage, potentially offering advantages when working with limited VRAM 13. The TRL (Transformer Reinforcement Learning) library, while initially focused on reinforcement learning techniques, includes the SFTTrainer, a widely used tool for supervised fine-tuning, often employed in conjunction with PEFT methods 2. For converting fine-tuned models into the GGUF format, llama.cpp is an indispensable tool. It provides scripts for converting models from various formats, including those from Hugging Face, to GGUF, and also offers functionalities for quantizing these models 17. Finally, for downloading and running LLMs locally, tools like LM Studio and Ollama can be valuable for testing the performance of fine-tuned models 6. The availability of these specialized libraries indicates a mature and well-supported ecosystem for local LLM fine-tuning in 2025.

🌟 Parameter-Efficient Fine-Tuning (PEFT) Techniques for 12GB VRAM

Given the 12GB VRAM constraint, parameter-efficient fine-tuning (PEFT) techniques are crucial for making local fine-tuning of large language models feasible in 2025. These methods allow for the adaptation of pre-trained models to specific tasks while significantly reducing the number of trainable parameters and the associated memory footprint.

Low-Rank Adaptation (LoRA) is a widely adopted PEFT technique that addresses the memory demands of full fine-tuning by freezing the original weights of the LLM and introducing small, low-rank weight matrices (adapters) into its layers 1. Only these newly added, low-rank matrices are trained, drastically reducing the number of parameters that need to be updated and stored during fine-tuning. This approach not only lowers the computational cost but also minimizes the storage requirements for the fine-tuned model. A key concept behind LoRA is the hypothesis that the intrinsic dimensionality of the weight changes needed to adapt a pre-trained model to a new task is much smaller than the full dimensionality of the model’s tensors 24. By approximating these weight changes with lower-dimensional matrices, LoRA achieves significant efficiency gains. The rank of these low-rank matrices, often denoted as ‘k’ or ‘r’, serves as a control parameter: a higher rank allows for more complex adaptations but also increases the number of trainable parameters 23. Additionally, a scaling factor, ‘alpha’, is often used to control the magnitude by which the new adapter weights influence the original model’s behavior 4. A notable advantage of LoRA is its ability to stack multiple fine-tunings on top of the same base model. Since the original model weights remain frozen, different LoRA adapters can be trained for various tasks and then swapped in as needed, offering a modular approach to model customization 1.

Quantized LoRA (QLoRA) builds upon the foundation of LoRA by introducing an additional layer of efficiency through quantization 1. QLoRA begins by quantizing the pre-trained language model to a 4-bit precision, significantly reducing its memory footprint even before the addition of LoRA adapters. This makes it possible to fine-tune extremely large models, which would otherwise be infeasible on consumer-grade GPUs. For instance, QLoRA has been successfully used to fine-tune a 65 billion parameter LLaMA model to near-ChatGPT performance on a single 48GB GPU 1. During the fine-tuning process with QLoRA, the 4-bit quantized base model remains frozen, and backpropagation occurs through this frozen model into the LoRA adapters. To handle potential memory spikes during training, QLoRA incorporates techniques like double quantization and paged optimizers. Beyond LoRA and QLoRA, Adapter Layers represent another category of parameter-efficient fine-tuning techniques 3. These methods involve inserting small, lightweight neural network modules between the existing layers of the pre-trained model. During fine-tuning, the weights of the original model are typically frozen, and only the parameters within these newly added adapter modules are updated. This approach allows the model to adapt its behavior for the target task without requiring extensive retraining of the base model’s parameters. Adapter layers are often designed to be computationally efficient, making them a practical solution for adapting large transformer-based architectures like GPT and BERT. Other PEFT techniques, such as Prompt Tuning, Prefix Tuning, and P-Tuning, focus on optimizing the input prompts or adding trainable prefixes to the input to guide the model’s behavior 1. These methods also keep the base model frozen and only train a small number of additional parameters. For example, Prompt Tuning involves learning an embedding for a prompt vector, while Prefix Tuning prepends learned tokens to each input. While LoRA and QLoRA are particularly well-suited for scenarios with strict VRAM limitations due to their direct impact on the number of trainable parameters and the base model’s memory footprint, these other techniques can be effective for specific tasks and model architectures.

🌟 Fine-tuning Phi4 Locally with 12GB VRAM in 2025

For fine-tuning the Phi4 model locally with a 12GB VRAM card in 2025, parameter-efficient fine-tuning techniques, particularly LoRA and QLoRA, are highly recommended due to their memory efficiency. The Unsloth library appears to be a promising tool for this purpose, as it offers optimized fine-tuning capabilities and supports Phi-4 (14B) 14. While a direct code example for fine-tuning Phi-4 with LoRA on 12GB VRAM was not explicitly found in the provided materials, a relevant example for fine-tuning Phi-3.5 with LoRA using Unsloth is available 14. The provided code snippet 14 demonstrates the use of 4-bit quantization (load_in_4bit=True), which is crucial for fitting larger models like Phi-4 within the 12GB VRAM limit. The FastLanguageModel.from_pretrained function from Unsloth is used to load the pre-quantized model. The LoRA configuration within the code specifies the target_modules, which are the layers in the model where LoRA adapters will be added. For Phi models, common target modules include “q_proj”, “k_proj”, “v_proj”, “o_proj”, “gate_proj”, “up_proj”, and “down_proj”.

The rank (r) of the LoRA adapter, set to 16 in the example, can be adjusted; a lower rank will reduce memory usage but might also limit the model’s adaptability. The training arguments, defined within the SFTTrainer, include parameters like per_device_train_batch_size and gradient_accumulation_steps. These parameters play a significant role in managing VRAM usage. A smaller batch size and a larger number of gradient accumulation steps will reduce the amount of VRAM needed at any given time. To fine-tune Phi-4 with LoRA on a 12GB VRAM card, one could start with a modified version of the Phi-3.5 fine-tuning script 14. The key modification would be to ensure that the model_name in the FastLanguageModel.from_pretrained function is set to the appropriate pre-quantized 4-bit version of Phi-4, if available within Unsloth or on the Unsloth Hugging Face page. It is advisable to begin with a small per_device_train_batch_size, such as 1 or 2, and potentially increase the gradient_accumulation_steps to compensate for the small batch size. Experimentation with the LoRA rank (r) might also be necessary to find a balance between performance and memory usage. Monitoring the VRAM usage during training is essential to ensure that the process stays within the 12GB limit. Resources like the Unsloth documentation and the Kaggle notebooks mentioned 14 for Phi-4 (14B) may provide further guidance and potentially optimized configurations for fine-tuning Phi-4 within specific memory constraints.

🌟 Fine-tuning Gemma3 Locally with 12GB VRAM in 2025

Fine-tuning the Gemma3 model locally with a 12GB VRAM card in 2025 appears feasible, particularly by leveraging the optimizations offered by the Unsloth library. Unsloth has demonstrated significant efficiency gains for Gemma 3, claiming to make fine-tuning the 12B parameter model 1.6 times faster and using 60% less VRAM 13. Detailed VRAM requirements for Gemma 3 indicate that the 12B model requires 27.6GB in full precision for text-to-text tasks but only 6.9GB when using 4-bit quantization 25. This data strongly suggests that fine-tuning the Gemma 3 12B model is achievable within a 12GB VRAM budget by employing quantization. Unsloth provides free Google Colab notebooks specifically designed for fine-tuning Gemma 3 13. These notebooks likely contain optimized code examples and configurations that are well-suited for resource-constrained environments. It is recommended to start with the 4-bit quantized versions of Gemma 3, which Unsloth has made available on Hugging Face 13. One could begin with the Gemma 3 4B model, which has even lower VRAM requirements (2.3GB for 4-bit quantized text-to-text tasks) 25, or even attempt the 12B model directly, carefully monitoring VRAM usage and adjusting the batch size as needed. The specific task for which Gemma3 is being fine-tuned (text-only vs. vision+text) will also influence the VRAM requirements, with multimodal tasks generally requiring more memory 25. The Unsloth library’s optimizations, such as keeping intermediate activations in bfloat16 format and performing matrix multiplies in float16 with tensor cores 13, contribute to the reduced memory footprint.

🌟 Fine-tuning Llama3.2 Locally with 12GB VRAM in 2025

Fine-tuning the Llama3.2 model locally with a 12GB VRAM card in 2025 is also a viable prospect, particularly for the smaller variants of the Llama 3 family. Unsloth has reported making Llama 3.2 (3B) fine-tuning significantly faster and using 60% less memory compared to standard methods 16. This suggests that the 3B version of Llama3.2 should comfortably fit within a 12GB VRAM limit. Furthermore, the Llama Cookbook mentions that Llama 2-13B can be fine-tuned using LoRA or QLoRA on a single 24GB GPU, with QLoRA requiring even less memory 26. A relevant code example for fine-tuning Llama 3.2 (3B) using QLoRA is available 9. This example, designed to run on Kaggle with free GPUs, utilizes the BitsAndBytesConfig to load the model in 4-bit quantization. The code snippet includes the necessary imports from the transformers, peft, trl, and datasets libraries. The bnb_config is set up with load_in_4bit=True, bnb_4bit_quant_type=“nf4”, bnb_4bit_compute_dtype=torch.bfloat16 (or torch.float16 if bfloat16 is not supported), and bnb_4bit_use_double_quant=True. The model and tokenizer are then loaded using AutoModelForCausalLM.from_pretrained and AutoTokenizer.from_pretrained, respectively, with the specified quantization configuration and device mapping. To adapt this for a local environment with 12GB VRAM, one would need to adjust the base_model variable to the specific Llama3.2 variant being used. The SFTTrainer from the trl library is then employed to handle the supervised fine-tuning process. The training arguments, such as per_device_train_batch_size, should be carefully considered. Starting with a batch size of 1 and monitoring VRAM usage is recommended. Libraries like Unsloth also provide optimized methods for fine-tuning Llama 3.2, potentially offering further memory reductions 16. Another library to consider is TorchTune, which also supports fine-tuning Llama 3.2 27. While the example in 27 used a 24GB GPU for the 3.2B instruct model, it suggests that reducing the batch size could enable fine-tuning on a 12GB card.

🌟 Fine-tuning Granite3 Locally with 12GB VRAM in 2025

Information regarding the specific fine-tuning of the Granite3 model with 12GB VRAM in 2025 is less readily available in the provided research materials. The accessible link to the IBM Research blog about Granite was not functional 28, which might have contained valuable details. Given the VRAM constraint, techniques like LoRA and QLoRA are likely the most suitable starting points for fine-tuning Granite3. One could begin by exploring the availability of pre-trained Granite3 models on the Hugging Face Hub and then attempting to fine-tune a smaller variant (if multiple sizes exist) using QLoRA with 4-bit quantization. The process would involve loading the model and tokenizer using the Transformers library and then configuring a LoRA adapter using the PEFT library. Key parameters to adjust for memory management would include the rank of the LoRA adapter (r), the batch size, and potentially the use of gradient accumulation. It is advisable to start with a small learning rate and a low LoRA rank, along with a batch size of 1, and then gradually increase these parameters while closely monitoring the VRAM usage to avoid out-of-memory errors. Looking for official documentation or community resources related to fine-tuning Granite3 might yield more specific recommendations or examples.

🌟 Converting Fine-tuned LLMs to GGUF Format in 2025

Converting fine-tuned large language models to the GGUF format is a crucial step for enabling efficient local inference, particularly with tools like llama.cpp and Ollama. The standard procedure for this conversion often involves using the convert.py script located within the llama.cpp repository 20. This process typically requires cloning the llama.cpp repository and installing its Python dependencies. The basic command for converting a Hugging Face model to GGUF is python llama.cpp/convert.py <path_to_hf_model> —outfile <output_gguf_file> 20. For users seeking a more automated approach, the LLM-GGUF-Auto-Converter Jupyter notebook provides a solution for batch converting models to GGUF format with various quantization options, built upon llama.cpp and integrated with Hugging Face 18. This tool can streamline the conversion process and offers features like automatic CUDA detection and Hugging Face upload functionality. Another practical guide demonstrates converting a Hugging Face model to GGUF using Google Colab, which involves cloning the llama.cpp repository and executing conversion commands within the Colab environment 19. Interestingly, the unsloth library, which is often used for efficient fine-tuning, also offers a direct method for saving LoRA fine-tuned models to the GGUF format 29. This can be particularly convenient for users who have fine-tuned their models using Unsloth, as it bypasses the need for separate conversion scripts. The command model.save_pretrained_gguf(“gguf_model”, tokenizer, quantization_method = “q4_k_m”) illustrates this functionality 29. Regardless of the method used, it is generally necessary to first merge the LoRA adapters with the base model before converting to GGUF, unless the conversion tool handles this automatically, as seems to be the case with Unsloth 21. The llama.cpp conversion process often requires the model to be in FP16 format initially 22.

⚡ Code Example using llama.cpp:

Assuming you have a fine-tuned model saved in Hugging Face format in a directory named fine_tuned_model:

1. Clone llama.cpp: Bash git clone https://github.com/ggerganov/llama.cpp cd llama.cpp pip install -r requirements.txt 2. Convert to GGUF (e.g., with 4-bit quantization): Bash python convert.py ../fine_tuned_model —outfile fine_tuned_model.gguf —outtype q4_k_m

(> ⚠️ Note: You might need to adjust the —outtype based on your desired quantization level.)

⚡ Code Example using Unsloth (if the model was fine-tuned with Unsloth):

Python

from unsloth import FastLanguageModel

🌌 Assuming your LoRA fine-tuned model is saved in “lora_model”

model, tokenizer = FastLanguageModel.from_pretrained(“lora_model”)

🌌 Save to GGUF with Q4_K_M quantization

model.save_pretrained_gguf(“gguf_model”, tokenizer, quantization_method = “q4_k_m”)

These examples provide a starting point for converting fine-tuned models to the GGUF format, with the specific steps potentially varying depending on the chosen tools and the format of the fine-tuned model.

🌟 Optimization Strategies for Fine-tuning with 12GB VRAM

To effectively fine-tune large language models with a 12GB VRAM card, several optimization strategies can be employed to minimize memory usage while maintaining reasonable training speed and model performance.

Quantization Techniques are paramount in reducing the memory footprint of LLMs 1. By converting the model weights from higher-precision formats (like FP32 or FP16) to lower-precision formats (such as 4-bit or 8-bit integers), the amount of memory required to store the model can be significantly decreased. Quantized LoRA (QLoRA) specifically leverages 4-bit quantization to enable fine-tuning of very large models on limited hardware 2.

Gradient Accumulation is another valuable technique that allows for training with an effectively larger batch size than would otherwise fit into memory 4. This is achieved by processing the data in smaller micro-batches and accumulating the gradients over multiple steps before performing a weight update. This method reduces the peak VRAM usage during the backward pass, making it possible to train with larger effective batch sizes on limited hardware. Choosing an Efficient Batch Size is a critical balancing act 27. While larger batch sizes generally lead to more stable and faster training, they also require more VRAM 5. The optimal batch size is the largest one that can fit into the available VRAM without causing out-of-memory errors. Experimentation is often necessary to find this sweet spot.

Gradient Checkpointing is a technique that reduces memory usage during training by saving only a subset of the activations computed in the forward pass 4. The remaining activations are recomputed on the fly during the backward pass. This trade-off reduces the memory footprint at the cost of increased computation time. Employing Flash Attention or other memory-efficient attention mechanisms can also contribute to lower VRAM usage during training 4. These optimized algorithms perform the attention calculations more efficiently, reducing both the memory required and the computational time. Unsloth, for example, claims to offer performance advantages over standard Flash Attention 2 in certain scenarios 13. Finally, in extreme cases of VRAM limitation, CPU Offloading can be considered 4. This involves moving parts of the model or the training process to the system’s RAM when they are not actively being used by the GPU. While this can help to fit larger models into limited VRAM, it often comes with a significant performance penalty due to the slower data transfer between RAM and the GPU.

🌟 Potential Challenges and Workarounds for Local LLM Fine-tuning on 12GB VRAM

Fine-tuning large language models locally on a 12GB VRAM card in 2025 presents several potential challenges. The limited VRAM capacity may restrict the size and complexity of the models that can be effectively fine-tuned 6. Training times might be longer due to the necessity of using smaller batch sizes and memory-saving techniques like gradient accumulation and gradient checkpointing. Aggressive quantization, while crucial for fitting models into memory, could potentially lead to a slight degradation in performance 20. Compatibility issues between specific models and certain PEFT techniques or libraries might also arise, requiring careful selection of the appropriate tools. Furthermore, achieving optimal performance within the VRAM constraint often necessitates meticulous hyperparameter tuning to strike a balance between model accuracy and memory usage. It is important to remember that even if a model’s parameters fit within 12GB of VRAM, the dynamic memory usage during runtime, including the processing of prompts and generation of output, must also be considered 6. Several workarounds can help mitigate these challenges. Starting with smaller variants of the target models (e.g., Llama3.2 3B instead of 8B) is a prudent first step. Experimenting with different PEFT techniques and their configurations can reveal the most memory-efficient and performant approach for a given model and task. Utilizing libraries like Unsloth, which are specifically optimized for memory-efficient training, can also be beneficial. For more resource-intensive fine-tuning tasks that exceed the capabilities of the local hardware, leveraging cloud-based platforms like Google Colab or Kaggle, which offer access to more powerful GPUs, can be a viable alternative. Finally, if the performance of directly fine-tuned models on 12GB VRAM is unsatisfactory, exploring knowledge distillation techniques to train smaller, more efficient models that mimic the behavior of larger ones might be a worthwhile consideration.

🌟 Conclusion: Empowering Local LLM Customization in 2025

In 2025, local fine-tuning of large language models with a 12GB VRAM video card is a challenging yet achievable endeavor, particularly for models like Phi4, Gemma3, and Llama3.2. Parameter-efficient fine-tuning (PEFT) techniques, most notably LoRA and QLoRA, are indispensable for adapting these models within the given memory constraints. Libraries such as Hugging Face Transformers, PEFT, Unsloth, and TRL provide the necessary tools and functionalities for implementing these techniques. The process of converting fine-tuned models to the GGUF format, facilitated by llama.cpp and tools like the LLM-GGUF-Auto-Converter, enables efficient local deployment. Optimization strategies such as quantization, gradient accumulation, and careful batch size selection are crucial for maximizing the utilization of the 12GB VRAM. While challenges such as model size limitations and longer training times may arise, workarounds like starting with smaller model variants, experimenting with different PEFT configurations, and utilizing optimized libraries can help overcome these hurdles.

🔧 Works cited

1. The Fine-Tuning Landscape in 2025: A Comprehensive Analysis | by Pradeep Das, accessed on March 21, 2025, https://medium.com/@pradeepdas/the-fine-tuning-landscape-in-2025-a-comprehensive-analysis-d650d24bed97 2. How to fine-tune open LLMs in 2025 with Hugging Face - Philschmid, accessed on March 21, 2025, https://www.philschmid.de/fine-tune-llms-in-2025 3. Fine Tuning Series 03:- Why Parameter Efficient Fine Tuning is always preferred over full Fine… - Medium, accessed on March 21, 2025, https://medium.com/@yashwanths_29644/fine-tuning-series-03-why-parameter-efficient-fine-tuning-is-always-preferred-over-full-fine-93ff5f36aadd 4. How Much VRAM Do You Need for LLMs? (Training/Fine-Tuning/Inference) - Medium, accessed on March 21, 2025, https://medium.com/@saehwanpark/how-much-vram-do-you-need-for-llms-training-fine-tuning-inference-2fb75666cea8 5. vRAM Requirements for LLM Fine-Tuning | by Dzmitry Ashkinadze | Medium, accessed on March 21, 2025, https://medium.com/@dzmitry.ashkinadze/vram-requirements-for-llm-fine-tuning-ec35e42240d8 6. Optimizing VRAM Settings for Using Local LLM on macOS (Fine-tuning: 1) | Peddals Blog, accessed on March 21, 2025, https://blog.peddals.com/en/fine-tune-vram-size-of-mac-for-llm/ 7. Is 12GB of VRAM enough in 2025? : r/buildapc - Reddit, accessed on March 21, 2025, https://www.reddit.com/r/buildapc/comments/1hy4w1m/is_12gb_of_vram_enough_in_2025/ 8. 12GB VRAM enough for 2-3 years? :: Hardware and Operating Systems - Steam Community, accessed on March 21, 2025, https://steamcommunity.com/discussions/forum/11/4133808904483266774/?l=koreana 9. Fine-tuning Llama 3.2 and Using It Locally: A Step-by-Step Guide | DataCamp, accessed on March 21, 2025, https://www.datacamp.com/tutorial/fine-tuning-llama-3-2 10. Fine-Tune Gemma for Vision Tasks using Hugging Face Transformers and QLoRA, accessed on March 21, 2025, https://ai.google.dev/gemma/docs/core/huggingface_vision_finetune_qlora 11. PEFT: Parameter-Efficient Fine-Tuning Methods for LLMs - Hugging Face, accessed on March 21, 2025, https://huggingface.co/blog/samuellimabraz/peft-methods 12. Fine-tuning LLMs for text generation | LocalAI documentation, accessed on March 21, 2025, https://localai.io/docs/advanced/fine-tuning/ 13. Fine-tune Gemma 3 with Unsloth, accessed on March 21, 2025, https://www.unsloth.ai/blog/gemma3 14. unslothai/unsloth: Finetune Llama 3.3, DeepSeek-R1 … - GitHub, accessed on March 21, 2025, https://github.com/unslothai/unsloth 15. Gemma 3 Fine-tuning now in Unsloth - 1.6x faster with 60% less VRAM - Reddit, accessed on March 21, 2025, https://www.reddit.com/r/LocalLLaMA/comments/1jba8c1/gemma_3_finetuning_now_in_unsloth_16x_faster_with/ 16. Fine-tune Llama 3.2 Vision with Unsloth, accessed on March 21, 2025, https://www.unsloth.ai/blog/llama3-2 17. How to Convert Models to GGUF Format? - Analytics Vidhya, accessed on March 21, 2025, https://www.analyticsvidhya.com/blog/2024/10/convert-models-to-gguf-format/ 18. dwain-barnes/LLM-GGUF-Auto-Converter: Automated Jupyter notebook solution for batch converting Large Language Models to GGUF format with multiple quantization options. Built on llama.cpp with HuggingFace integration. - GitHub, accessed on March 21, 2025, https://github.com/dwain-barnes/LLM-GGUF-Auto-Converter 19. Converting a Hugging Face LLM Model to GGUF Format Using Google Colab, accessed on March 21, 2025, https://ruslanmv.com/blog/How-to-convert-Models-to-GGUF-in-Google-Colab 20. Tutorial: How to convert HuggingFace model to GGUF format · ggml-org llama.cpp · Discussion #2948 - GitHub, accessed on March 21, 2025, https://github.com/ggml-org/llama.cpp/discussions/2948 21. Bring your own fine-tuned model to MAX pipelines - Modular, accessed on March 21, 2025, https://docs.modular.com/stable/max/tutorials/max-pipeline-bring-your-own-model/ 22. Fine-Tune Any LLM, Convert to GGUF, And Deploy Using Ollama - YouTube, accessed on March 21, 2025, https://www.youtube.com/watch?v=sK7yqqrK2fE 23. LoRA Fine-Tuning and Llama 3 | Lecture 16 | LLM 2025 - YouTube, accessed on March 21, 2025, https://www.youtube.com/watch?v=2I_Tx7Xtx8E 24. Fine-Tuning Llama 3 with LoRA: Step-by-Step Guide - Neptune.ai, accessed on March 21, 2025, https://neptune.ai/blog/fine-tuning-llama-3-with-lora 25. GPU System Requirements Guide for Gemma 3 Multimodal - ApX Machine Learning, accessed on March 21, 2025, https://apxml.com/posts/gemma-3-gpu-requirements 26. Fine-tuning | How-to guides - Llama, accessed on March 21, 2025, https://www.llama.com/docs/how-to-guides/fine-tuning/ 27. Fine-tuning Llama 3.2 on Your Data with a single GPU | Training LLM for Sentiment Analysis, accessed on March 21, 2025, https://www.youtube.com/watch?v=9wp0Gd9-pfE 28. accessed on December 31, 1969, https://www.ibm.com/blogs/research/granite-foundation-models/ 29. How to convert my fine-tuned model to .gguf ? : r/LocalLLaMA - Reddit, accessed on March 21, 2025, https://www.reddit.com/r/LocalLLaMA/comments/1amjx77/how_to_convert_my_finetuned_model_to_gguf/

Local Fine Tuning Of Large Language Models With 12gb Vram In 2025

📖 Reading Mode

📖 Table of Contents