Converting your own finetuned language model to GGUF
In the fast-paced world of artificial intelligence and machine learning, efficiency and performance are key. Let’s say you have finetuned your own version of language model for a particular downstream task using 🤗 Hugging Face TRL framework . The finetuned version is working as expected, but the resulting model is still to big for your particular specifications. Moreover, the inference is somewhat slow.
That’s where tools like llama.cpp and the GGUF format come into play. This post is just a pretty small guide for converting your AI models to the GGUF format using llama.cpp.
What is llama.cpp and GGUF Format?
Llama.cpp started as a C++ rewrite of the inference engine of the the original Llama model (released by Meta in February 23). The original goal was to to make possible to run llama on apple CPUs. Now, it has become a complete ecosystem to run and test your local Language Models. Llama.cpp focus on providing an efficient, high-performance foundation for running these models. Beyond the many performence improvement hacks implemented inside llama.cpp code, the most relevant is the its quantization algorithm.
Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and activations with low-precision data types like 4-bit (int4) or 8-bit integer (
int8
) instead of the usual 32-bit floating point (float32
).
On the other hand, GGUF is a compact, efficient format designed the the team behind llama.cpp for storing and transferring LLM , particularly suited for deployment in inference tasks.
GGUF is a file format for storing models for inference with GGML and executors based on GGML. GGUF is a binary format that is designed for fast loading and saving of models, and for ease of reading. Models are traditionally developed using PyTorch or another framework, and then converted to GGUF for use in GGML.
Why Convert Models to GGUF?
Converting models to GGUF can significantly reduce the model size while maintaining precision, making it ideal for deployment in environments where resources are limited. The GGUF format is also optimized for CPU speed (GPU is also supported), ensuring that your models run as efficiently as possible. See here for a more complete list of the benefits of using GGUF.
Step-by-Step Conversion to GGUF Using llama.cpp
Converting your models to GGUF format involves a few steps but fret not; the process is straightforward. Here’s how to do it:
1. Install llama-cpp
Before you begin, you’ll need to have llama-cpp installed on your system. This setup ensures that you have all the necessary libraries and dependencies to convert and run your models.
2. Convert Your Model
Next, you’ll convert your existing model to a GGUF-compatible format. This is done via a Python script, convert.py
, The convert.py
tool is mostly just for converting models in other formats (like HuggingFace) to one that other llama.cpp tools can deal with.
The command looks something like this:
python convert.py model_dir/ --outtype f32 --outfile "model_dir/model.bin"
This command specifies the input model directory, the output type (in this case, f32
for 32-bit floating point), and the output file name and location. Possible values for --outtype
are f16
and q8
. Notice that here you are not applying any quantization algorithm, you are just converting from the Hugging Face format to the GGUF format.
If needed you can quantize up to Q8
(i.e 8-bits) using the convert.py
script. However if you want to use some lower quantization approaches, the you will need to process the model.bin
file with other tool.
In my experience the better approach is to use the convert script with no quantization (i.e. using --outtype f32
) and the using the quantize
command to do the actual quantization. Some people have reported a minor performance loos if you first apply some level of quantization like Q8
in the convert.py
script and then apply a new quantization again using the quantize
command. In my case I observed that behavior even when you use f16
, so I prefer to use always --outtype f32
.
3. Lower Quantization
To quantize to lower sizes, you will need to use the quantize
command. For that you will need to compile llama.cpp from the sources. The quantize
command allow you to quantize your model to different size. The list is large… but usually the best choices are: q5_k_m
(5-bit quantization) and q4_k_m
(4-bit quantization), but the final choice is yours of course.
TheBloke at Hugging Face usually provides quantized models at different sizes together with a table indicating the recommended size. Notice that the recommendation is done considering the trade off between memory usage and overall performance of the model.
The command for quantization looks like this:
./quantize model_dir/model.bin model_dir/model.Q5_K_M.gguf q5_k_m
This step converts your .bin
file into a .gguf
file, which is ready for deployment.
Some final comments
Converting your models to the GGUF format dramatically improve inference performance, especially in resource-constrained environments. However, as expected the performance can be degraded. So the final trade off between inference speed and performance we will depend on you and the goal of of your downstream task.
On the other hand, it is clear the following recipe can change over the time. Llama.cpp ecosystem is very active and most of the tools are provided as-is without a clear documentation. The convert.py
script is an example where not all the parameters are clearly explained. For instance the parameter --vocab-type
support different strategies like spm
, bpe
, hfft
but no example about how to use them is provided. I’m personally interested in using bpe
but not sure how to do it. I will investigate and improve this post in the future.
References
[1] A discussion about the quantization process from hugging face models to GGUF
[2] Another comprehensive guide for quantizing LLM to GGUF.