🌌 Converting Unsloth Fine-tuned LoRA Models to GGUF Format

🌟 1. Introduction: Bridging the Gap from Fine-tuned LoRA to Efficient GGUF Inference

🚀 Welcome to this comprehensive guide! This section will give you the foundational knowledge you need. Unsloth has emerged as a prominent framework for the efficient fine-tuning of large language models, often employing techniques such as QLoRA to minimize memory consumption during the training phase 1. A key aspect of deploying these fine-tuned models is their conversion into formats suitable for inference. LoRA, being a parameter-efficient adaptation method, results in a set of adapter weights that modify the behavior of a pre-trained base model 4. To facilitate efficient inference, particularly on consumer-grade hardware, the GGUF (Generic GPT Unified Format) has become increasingly popular 1. This binary format is designed for rapid loading and optimized performance when using inference libraries like llama.cpp 10. The user in this context has successfully fine-tuned a model using Unsloth and now seeks a detailed understanding of how to convert the resulting LoRA weights into the GGUF format, taking into account the capabilities of their 12GB VRAM video card. The need for this conversion arises from the desire to run the fine-tuned model locally for inference, a task for which GGUF and its associated tools are well-suited 10.

🌟 2. Leveraging Unsloth’s Streamlined save_pretrained_gguf Function

Recognizing the importance of deploying fine-tuned models in efficient formats, Unsloth provides a direct and user-friendly method for converting models, including those fine-tuned using LoRA, to the GGUF format through the save_pretrained_gguf function 4. This integration within the Unsloth framework suggests a design choice aimed at simplifying the deployment pipeline for users. The save_pretrained_gguf function accepts several parameters that control the conversion process 8. The “dir” parameter specifies the local directory where the output GGUF file will be stored. The tokenizer parameter requires the tokenizer object associated with the fine-tuned model, which is essential for proper text processing during inference. Critically, the quantization_method parameter dictates the level of quantization applied when saving the model to the GGUF format 4. Different quantization methods offer varying trade-offs between model size, potential accuracy, and inference speed. The simplicity of using this function is evident in the following code examples. To save a LoRA model in GGUF format with Q4_K_M quantization, a user can execute the following Python code:

Python

from unsloth import FastLanguageModel model, tokenizer = FastLanguageModel.from_pretrained(“your_lora_model_directory”) model.save_pretrained_gguf(“gguf_model”, tokenizer, quantization_method=“q4_k_m”)

This concise code snippet highlights the ease with which a fine-tuned LoRA model can be converted to GGUF using Unsloth. Similarly, to save the model with higher precision using F16 format, the quantization_method parameter can be changed accordingly:

Python

model.save_pretrained_gguf(“gguf_model_f16”, tokenizer, quantization_method=“f16”)

For a middle-ground approach, Q8_0 quantization can be used:

Python

model.save_pretrained_gguf(“gguf_model_q8”, tokenizer, quantization_method=“q8_0”)

The availability of these diverse quantization options directly within the save_pretrained_gguf function empowers users to tailor the resulting GGUF model to their specific hardware capabilities and desired performance characteristics. A significant advantage of using Unsloth’s save_pretrained_gguf function is its automatic handling of LoRA weight merging with the base model before the conversion to GGUF 6. As explicitly stated in community discussions, “Unsloth automatically merges your LoRA weights and makes a 16bit model, then converts to GGUF directly” 6. This automatic merging streamlines the deployment process for users, eliminating the need for a separate, manual merging step.

🌟 3. The Manual Route: Converting LoRA to GGUF using llama.cpp

While Unsloth’s integrated function offers a convenient solution, understanding the manual conversion process using the llama.cpp toolkit provides a deeper level of control and becomes particularly valuable in scenarios where troubleshooting is required or when working with models that might not be directly supported by Unsloth’s function 4. This manual approach involves several steps, beginning with ensuring that llama.cpp is properly built and configured. The first prerequisite is to clone the llama.cpp repository using the command git clone —recursive https://github.com/ggerganov/llama.cpp. After cloning, navigate into the repository directory using cd llama.cpp. Next, install the necessary Python dependencies by running pip install -r requirements.txt. Finally, build the llama.cpp library using the command make clean && make all -j. For users with NVIDIA GPUs, it is highly recommended to build with CUDA support to potentially accelerate the conversion process. Once llama.cpp is built, the manual conversion of a LoRA model typically involves a few key stages. If the user only possesses the LoRA adapter weights and the base model in a standard Hugging Face format, the base model might need to be converted to the GGUF format first. This can be done using the convert-hf-to-gguf.py script located within the llama.cpp directory. An example command for this conversion is: python convert-hf-to-gguf.py /path/to/your/base_model —outfile /path/to/output/base_model.f16.gguf —outtype f16 4. This step ensures that the base model is in a format that llama.cpp can utilize for subsequent merging with the LoRA adapters. Following the base model conversion (if necessary), the LoRA adapter weights themselves need to be converted into a format compatible with llama.cpp. The toolkit provides a dedicated script, convert-lora-to-ggml.py, for this purpose. The command python convert-lora-to-ggml.py /path/to/your/lora_adapters will typically generate a ggml-adapter-model.bin file containing the converted LoRA weights 4. The next critical step in the manual process is merging the converted base model (in GGUF format) with the converted LoRA adapter weights. The llama.cpp toolkit includes an export-lora tool for this purpose. An example command for merging is: export-lora —model-base /path/to/base_model.f16.gguf —model-out /path/to/output/merged_model.gguf —lora /path/to/lora_adapters/ggml-adapter-model.bin 4. This command effectively combines the knowledge embedded in the base model with the task-specific adaptations learned through LoRA fine-tuning, resulting in a single, merged model in GGUF format. Finally, to optimize the size and performance of the merged model for inference, it is highly recommended to quantize it using the quantize tool provided by llama.cpp. The command for quantization typically looks like this: quantize /path/to/merged_model.gguf /path/to/output/quantized_model. Q8_0.gguf Q8_0 4. The Q8_0 in this command specifies the desired quantization level; users can choose different quantization levels based on their specific needs and hardware constraints.

🌟 4. Explicitly Merging LoRA with the Base Model Before GGUF Conversion

In certain situations, particularly when encountering difficulties with direct LoRA to GGUF conversion or when employing tools outside of Unsloth’s built-in functionalities, an alternative strategy involves explicitly merging the LoRA weights with the base model using libraries such as peft (Parameter-Efficient Fine-Tuning) and transformers before proceeding with the GGUF conversion 6. This approach creates an intermediate, full-weight model that might exhibit greater compatibility with various GGUF conversion tools. To perform this explicit merging, one can utilize the peft and transformers libraries in Python. First, the base model and the LoRA adapter need to be loaded:

Python

from peft import PeftModel from transformers import AutoModelForCausalLM, AutoTokenizer

base_model_name = “your_base_model_name” adapter_model_name = “your_lora_model_directory”

model = AutoModelForCausalLM.from_pretrained(base_model_name) model = PeftModel.from_pretrained(model, adapter_model_name)

Here, AutoModelForCausalLM from transformers is used to load the base language model, and PeftModel.from_pretrained from peft is then used to load the LoRA adapter weights and attach them to the base model. Next, the LoRA weights are merged into the base model’s weights using the merge_and_unload() method:

Python

merged_model = model.merge_and_unload()

The merge_and_unload() function efficiently combines the weights of the LoRA adapter with the base model, effectively creating a new model where the fine-tuning is directly incorporated into the original weights. The unload() part of the function name indicates that the LoRA adapter is no longer needed separately, which can help in managing memory usage, especially on systems with limited GPU memory. Python

merged_model.save_pretrained(“merged_adapters”) tokenizer = AutoTokenizer.from_pretrained(base_model_name) # Or your LoRA tokenizer directory

tokenizer.save_pretrained(“merged_adapters”)

Saving the merged model in this manner creates a directory containing the model’s configuration and weights in a widely recognized format. Once the LoRA weights are explicitly merged with the base model and saved, the resulting “merged_adapters” directory can be treated as a standard pre-trained model. Consequently, the convert-hf-to-gguf.py script from the llama.cpp toolkit can be used to convert this merged model to the GGUF format, following the same steps outlined in Section 3 for converting a base model.

🌟 5. Software Dependencies and Installation Guide

Successfully converting Unsloth fine-tuned LoRA models to the GGUF format necessitates the installation of several software dependencies 10. The primary programming language involved is Python (version 3.8 or higher is recommended), as Unsloth and many related tools are Python-based 10. The core library for fine-tuning, Unsloth, needs to be installed using the pip package manager with the command pip install unsloth 1. The conversion to GGUF often involves the use of llama.cpp, a powerful library for efficient inference. This library typically requires cloning and building from its source repository, as detailed in Section 3 4. Additionally, the Python packages gguf and protobuf are usually required as dependencies for llama.cpp and can be installed using pip install gguf protobuf 4. For certain functionalities, particularly if you intend to interact with llama.cpp from Python, the python-llama-cpp package might be necessary and can be installed via pip install llama-cpp-python. In summary, the installation process generally involves the following steps: First, ensure Python is installed on your system. Then, install the Unsloth library using pip. Next, clone the llama.cpp repository from GitHub. Navigate to the llama.cpp directory and install its Python dependencies using the requirements.txt file. Build the llama.cpp library using the make command, potentially with CUDA enabled if you have an NVIDIA GPU. Finally, if you plan to use the explicit merging method, install the peft and transformers libraries using pip. It is worth noting that Unsloth is primarily developed and tested on Linux environments, and while it might function on Windows through the Windows Subsystem for Linux (WSL), Linux generally offers better performance and compatibility for machine learning workflows 11.

🌟 6. Analyzing the Role and Limitations of Your 12GB VRAM

The 12GB VRAM of the user’s video card plays a crucial role in various stages of the fine-tuning and conversion process 2. During the initial fine-tuning phase, VRAM is the most critical resource. Unsloth is specifically designed to be memory-efficient, often employing techniques like QLoRA, which allows for the fine-tuning of large language models even on GPUs with limited VRAM, such as 12GB 2. When it comes to explicitly merging the LoRA weights with the base model using libraries like peft and transformers, the VRAM requirements can vary significantly depending on the size of the base model 13. For very large base models, such as those with 70 billion parameters or more, merging the weights in full precision can be quite VRAM-intensive, and 12GB might pose a limitation. However, for smaller models, such as those with up to 13 billion parameters, 12GB of VRAM is generally sufficient for the merging process. The GGUF conversion process itself, particularly when using the llama.cpp tools, tends to be more reliant on CPU and system RAM rather than being solely limited by VRAM 7. While the initial loading of the model into memory might utilize some VRAM, especially if a GPU-enabled build of llama.cpp is used, the primary operations during conversion involve processing the model weights in system memory and writing them to disk in the GGUF format. Therefore, the 12GB VRAM is less likely to be a bottleneck during the pure GGUF conversion step using llama.cpp. However, when using Unsloth’s save_pretrained_gguf function, which likely handles the merging and conversion in a more optimized manner, the VRAM usage during the conversion process is also expected to be relatively low. In conclusion, for the LoRA to GGUF conversion process, especially when utilizing Unsloth’s built-in function, a 12GB VRAM card should generally be adequate for most common model sizes and LoRA configurations. However, if the user is working with exceptionally large base models and opts for the explicit merging route, they might encounter VRAM limitations.

🌟 7. Key Unsloth Parameters and Configurations for GGUF Conversion

When using Unsloth for GGUF conversion, several key parameters and configurations play a significant role in determining the characteristics of the output file 4. The quantization_method parameter within the model.save_pretrained_gguf function is paramount as it dictates the quantization level applied to the model during the conversion to the GGUF format. The choice of quantization method has a direct impact on the final GGUF file’s size and the inference performance of the model. For instance, selecting “q4_k_m” will result in a significantly smaller model size, making it suitable for resource-constrained environments, but it might come with a slight reduction in accuracy compared to higher-precision formats. On the other hand, choosing “f16” will retain the model’s original 16-bit floating-point precision (although not true quantization), leading to the largest file size but potentially the highest accuracy and slowest inference speed. Options like “q8_0” offer a compromise, providing a good balance between model size, accuracy, and inference speed. If the user decides to first merge the LoRA weights using Unsloth’s save_pretrained_merged function, there are also relevant parameters to consider. For example, the save_method parameter in save_pretrained_merged allows specifying the precision of the intermediate merged model. A common practice is to save the merged model in “merged_16bit” format, which uses 16-bit floating-point precision. Another potentially relevant parameter in Unsloth, although primarily used when saving models in other formats, is maximum_memory_usage. This parameter allows the user to control the maximum amount of GPU memory that Unsloth will utilize during the saving process. While it might not directly affect the save_pretrained_gguf function in all cases, if an intermediate saving step is involved (e.g., saving the merged model), adjusting this parameter could be beneficial for users with limited VRAM to prevent out-of-memory errors.

🌟 8. Exploring Community Insights and Alternative Conversion Workflows

Beyond the official documentation and the primary methods, the Unsloth and broader machine learning communities often share valuable insights and alternative workflows for converting LoRA models to GGUF 6. One interesting point mentioned in community discussions is the existence of a script named convert_lora_to_gguf.py 18. This script is intended to directly convert PEFT-compatible LoRA adapters to the GGUF format without requiring an explicit merging step with the base model first. While this approach could potentially be more direct and faster, it has been noted that this particular code path might not be as actively maintained as Unsloth’s primary save_pretrained_gguf function. Another alternative workflow involves leveraging online resources, such as the Hugging Face Space mentioned in snippet12 (https://huggingface.co/spaces/ggml-org/gguf-my-lora).

This space provides a web-based interface that allows users to upload their PEFT LoRA adapters and convert them to the GGUF format. This can be a convenient option for users who prefer a graphical interface or who want to avoid the complexities of setting up and using command-line tools like those in llama.cpp locally. The community also mentions tools like mergekit 12. Mergekit is a utility designed for merging different language model weights, including extracting and merging PEFT-compatible LoRA adapters. Finally, online forums, particularly Reddit communities like r/LocalLLaMA 6, serve as invaluable resources for discovering community-tested workflows, practical tips, and solutions to specific problems encountered during the conversion process. Users often share their experiences, including alternative methods they have found successful, such as the explicit merging approach using peft and transformers discussed earlier.

🌟 9. Troubleshooting Common Issues and Their Solutions

During the process of converting Unsloth LoRA models to GGUF, users might encounter certain common issues. One frequently reported error is “RuntimeError: Unsloth: Quantization failed!” 19. This error typically indicates that the llama.cpp library, which Unsloth often relies on for the GGUF conversion and quantization steps, needs to be compiled manually on the user’s system. Bash

git clone —recursive https://github.com/ggerganov/llama.cpp cd llama.cpp && make clean && make all -j

After these commands complete successfully, the user should then retry the quantization step in their Unsloth script. This manual compilation ensures that the llama.cpp binaries are compatible with the user’s specific system configuration. Another potential issue, as highlighted in the research, is that some very large models (e.g., Qwen or Llama 70B) might encounter errors during the LoRA extraction or conversion process 12. This could be due to the specific architectural characteristics of these models or limitations within the conversion tools themselves. Users working with such large models might need to explore model-specific conversion strategies or consult community forums for workarounds. In some instances, file path issues can also lead to errors during the conversion or quantization process 19. It has been reported that using absolute file paths instead of relative paths in the relevant commands can sometimes resolve these issues. While not specific to Unsloth, permission issues within the operating system can occasionally interfere with the execution of conversion scripts 21. In rare cases, running the conversion commands with elevated privileges (e.g., as root) might be necessary, although this should be done with caution and only when other solutions have been exhausted. Finally, even with Unsloth’s memory optimizations, users working with very large models on systems with limited VRAM might still encounter out-of-memory errors during the saving process. In such situations, attempting to reduce the maximum_memory_usage parameter in Unsloth’s saving functions might help to alleviate the issue 4.

🌟 10. Conclusion and Recommendations for Your Workflow

In summary, the process of converting Unsloth fine-tuned LoRA models to the GGUF format offers several pathways, each with its own advantages and considerations. Users can leverage Unsloth’s built-in save_pretrained_gguf function for a direct and often efficient conversion, which automatically handles the merging of LoRA weights. Alternatively, a more manual approach using the llama.cpp toolkit provides greater control over each stage of the conversion, including base model conversion, LoRA adapter conversion, merging, and quantization. Given your setup with a 12GB VRAM video card, the primary recommendation is to begin by utilizing Unsloth’s save_pretrained_gguf function for the conversion. This method is likely the most streamlined and memory-efficient for Unsloth users. When using this function, carefully consider the quantization_method parameter. For a good balance between model size and performance on your hardware, options like “q4_k_m” or “q8_0” are generally recommended. If you encounter any issues with the direct conversion or require more granular control, the secondary recommendation would be to explore the explicit merging method using peft and transformers, followed by conversion using llama.cpp. If you choose this route, be mindful of potential VRAM limitations during the merging step, especially if you are working with a very large base model. Should you encounter “Quantization failed” errors, ensure that you have correctly cloned and built the llama.cpp repository from source, as this is a common prerequisite for successful GGUF conversion. Lastly, do not hesitate to consult the active Unsloth and LocalLLaMA communities online for additional tips, troubleshooting advice, and alternative workflows that other users have found effective.

⚡ Table 1: Comparison of GGUF Quantization Methods

Quantization Method	Description	Model Size	Accuracy	Inference Speed	Use Cases
f16	16-bit floating point (no true quantization)	Largest	Highest	Slowest	When maximum accuracy is critical and size is less of a concern.
q8_0	8-bit integer quantization	Medium	Good	Medium	Good balance of size, speed, and accuracy.
q4_k_m	4-bit quantization with k-matrices (multiple types)	Smallest	Lower	Fastest	For resource-constrained environments.
q4_k_s	Another 4-bit quantization with k-matrices	Smaller than q8_0	Slightly lower than q8_0	Faster than q8_0	Good for balance in limited resources.
q5_k_m	5-bit quantization with k-matrices	Smaller than q8_0, larger than q4_k_m	Better than q4_k_m, slightly lower than q8_0	Faster than q8_0, slower than q4_k_m	A compromise between q4 and q8.

🔧 Works cited

1. Fine-tuning Llama 3.2 Using Unsloth - KDnuggets, accessed on March 20, 2025, https://www.kdnuggets.com/fine-tuning-llama-using-unsloth 2. Unsloth Guide: Optimize and Speed Up LLM Fine-Tuning - DataCamp, accessed on March 20, 2025, https://www.datacamp.com/tutorial/unsloth-guide-optimize-and-speed-up-llm-fine-tuning 3. Finetuning with Unsloth: The Game-Changer in LLM Fine-tuning | by Sridevi Panneerselvam, accessed on March 20, 2025, https://medium.com/@sridevi17j/finetuning-with-unsloth-the-game-changer-in-llm-fine-tuning-e32262701195 4. Home · unslothai/unsloth Wiki - GitHub, accessed on March 20, 2025, https://github.com/unslothai/unsloth/wiki 5. Home · unslothai/unsloth Wiki - GitHub, accessed on March 20, 2025, https://github.com/unslothai/unsloth/wiki/Home/8a17212986ff4366426c859f46c3dab881f0274f 6. How to convert my fine-tuned model to .gguf ? : r/LocalLLaMA - Reddit, accessed on March 20, 2025, https://www.reddit.com/r/LocalLLaMA/comments/1amjx77/how_to_convert_my_finetuned_model_to_gguf/ 7. A Rapid Tutorial on Unsloth - Stephen Diehl, accessed on March 20, 2025, https://www.stephendiehl.com/posts/unsloth/ 8. Saving to GGUF - Unsloth Documentation, accessed on March 20, 2025, https://docs.unsloth.ai/basics/running-and-saving-models/saving-to-gguf 9. Running & Saving Models - Unsloth Documentation, accessed on March 20, 2025, https://docs.unsloth.ai/basics/running-and-saving-models 10. How to Convert Models to GGUF Format? - Analytics Vidhya, accessed on March 20, 2025, https://www.analyticsvidhya.com/blog/2024/10/convert-models-to-gguf-format/ 11. Fine-tuning Gemma 2 2B for custom data extraction, using Local GPU, Unsloth and your own synthetic… | by Vasudevan Vijayaragavan, Associate Vice President | Medium, accessed on March 20, 2025, https://medium.com/@vasudevan.vijay/fine-tuning-gemma-2-2b-for-custom-data-extraction-using-local-gpu-unsloth-and-your-own-synthetic-6ac4fb8064e8 12. @ngxson on Hugging Face: “Check out my collection of pre-made GGUF LoRA adapters! This allow you to use…”, accessed on March 20, 2025, https://huggingface.co/posts/ngxson/853577992234802 13. Releases · unslothai/unsloth - GitHub, accessed on March 20, 2025, https://github.com/unslothai/unsloth/releases 14. Unsloth - Browse /2025-03 at SourceForge.net, accessed on March 20, 2025, https://sourceforge.net/projects/unsloth.mirror/files/2025-03/ 15. Alpaca + Llama-3 8b Unsloth 2x faster finetuning.ipynb - Colab - Google, accessed on March 20, 2025, https://colab.research.google.com/drive/135ced7oHytdxu3N2DNe1Z0kqjyYIkDXp?usp=sharing 16. Gemma finetuning 243% faster, uses 58% less VRAM : r/LocalLLaMA - Reddit, accessed on March 20, 2025, https://www.reddit.com/r/LocalLLaMA/comments/1b0kht9/gemma_finetuning_243_faster_uses_58_less_vram/ 17. Quantization (q4_k_m gguf) failed for Phi-3 · Issue #413 · unslothai/unsloth - GitHub, accessed on March 20, 2025, https://github.com/unslothai/unsloth/issues/413 18. Feature request: export to GGUF LoRA (not merging) · Issue #1546 …, accessed on March 20, 2025, https://github.com/unslothai/unsloth/issues/1546 19. RuntimeError: Unsloth: Quantization failed! You might have to compile llama.cpp yourself, then run this again. #1781 - GitHub, accessed on March 20, 2025, https://github.com/unslothai/unsloth/issues/1781 20. shenzhi-wang/Llama3-8B-Chinese-Chat · there is no tokenizer.model file - Hugging Face, accessed on March 20, 2025, https://huggingface.co/shenzhi-wang/Llama3-8B-Chinese-Chat/discussions/35 21. unsloth/DeepSeek-V3-GGUF · Getting error with Q3-K-M - Hugging Face, accessed on March 20, 2025, https://huggingface.co/unsloth/DeepSeek-V3-GGUF/discussions/2

Converting Unsloth Fine Tuned Lora Models To Gguf Format

📖 Reading Mode

📖 Table of Contents

🌌 Converting Unsloth Fine-tuned LoRA Models to GGUF Format

🌟 1. Introduction: Bridging the Gap from Fine-tuned LoRA to Efficient GGUF Inference

🌟 2. Leveraging Unsloth’s Streamlined save_pretrained_gguf Function

🌟 3. The Manual Route: Converting LoRA to GGUF using llama.cpp

🌟 4. Explicitly Merging LoRA with the Base Model Before GGUF Conversion

🌟 5. Software Dependencies and Installation Guide

🌟 6. Analyzing the Role and Limitations of Your 12GB VRAM

🌟 7. Key Unsloth Parameters and Configurations for GGUF Conversion

🌟 8. Exploring Community Insights and Alternative Conversion Workflows

🌟 9. Troubleshooting Common Issues and Their Solutions

🌟 10. Conclusion and Recommendations for Your Workflow

⚡ Table 1: Comparison of GGUF Quantization Methods

🔧 Works cited