Skip to content

Registering llama2 model llama.cpp format as an MLflow Model and using it

Llama2-cpp models in the gguf format can be registered as MLflow Models and used in the local machine

Create conda env

First, we create a conda environment suitable for llama-cpp-python

conda create -n l6 python=3.9
conda activate l6
pip install transformers[torch]
pip install mlflow>=2.6.0 numpy scipy pandas scikit-learn cloudpickle sentencepiece infinstor_mlflow_plugin
FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

Llama2 license

Llama2 is available for commercial use, but you must obtain a free license from Meta. Go to the following Meta website and signing up for access to the Llama2 models: here

Download HF Llama2 model

Now, download huggingface Llama2 model

git clone https://huggingface.co/meta-llama/Llama-2-7b-chat-hf

Note that you must login using your huggingface username and access token in order to complete the above command. Note also that this very same huggingface account must be authorized to download Llama2 models.

Convert Llama2 to gguf format

First, checkout the llama.cpp source code from github

git clone https://github.com/ggerganov/llama.cpp.git

Next, use the convert.py utility included in llama.cpp to convert the downloaded model to gguf format

(cd Llama-2-7b-chat-hf; python ../llama.cpp/convert.py --outtype q8_0 .)

The output should look similar to the following:

-rw-rw-r-- 1 jagane jagane  7161089696 Sep 10 21:53 ggml-model-q8_0.gguf

Log Model to MLflow for Chat

Next, we use logmodel to log the Meta Llama2 model in llama.cpp gguf format to MLflow model and turn it into an MLflow model. This example is for the chat task.

git clone https://github.com/jagane-infinstor/logmodel.git
(cd logmodel/llama2-gguf; python log.py --data_path ../../Llama-2-7b-chat-hf/ggml-model-q8_0.gguf --task chat)

The above command causes the llama.cpp gguf format model to be logged as an MLflow model for the chat task. You can now go to the mlflow GUI and select the specific experiment/run. The model will be displayed in the artifacts pane for that run. Go ahead and register the model, for example, as llama2-gguf-chat

Test the logged Chat model

The program chat.py included in the logmodel github tree is useful for testing the logged model

python chat.py --model models:/llama2-gguf-chat/1

The output is probably going to be something like this:

> What is the capital of California?

llama_print_timings:        load time =  3552.53 ms
llama_print_timings:      sample time =    51.67 ms /    78 runs   (    0.66 ms per token,  1509.58 tokens per second)
llama_print_timings: prompt eval time =  3552.49 ms /    94 tokens (   37.79 ms per token,    26.46 tokens per second)
llama_print_timings:        eval time = 20680.57 ms /    77 runs   (  268.58 ms per token,     3.72 tokens per second)
llama_print_timings:       total time = 24444.20 ms
assistant>   Thank you for asking! The capital of California is Sacramento. It is located in the northern part of the state, along the Sacramento River. Sacramento has a rich history and culture, and it is home to many important government buildings and institutions, including the California State Capitol. I hope that helps! Let me know if you have any other questions.
> How far is it from San Francisco?
Llama.generate: prefix-match hit

llama_print_timings:        load time =  3552.53 ms
llama_print_timings:      sample time =    63.20 ms /   101 runs   (    0.63 ms per token,  1598.08 tokens per second)
llama_print_timings: prompt eval time =  3425.73 ms /    96 tokens (   35.68 ms per token,    28.02 tokens per second)
llama_print_timings:        eval time = 25718.24 ms /   100 runs   (  257.18 ms per token,     3.89 tokens per second)
llama_print_timings:       total time = 29382.87 ms
assistant>   Great question! Sacramento is located approximately 150 miles (241 kilometers) northeast of San Francisco. The drive from San Francisco to Sacramento typically takes about 2-3 hours, depending on traffic and the route you take. There are also public transportation options available, such as buses and trains, which can take a bit longer but offer a convenient alternative to driving. I hope that helps! Let me know if you have any other questions.
> 

Log Model to MLflow for Embedding

Next, we use logmodel to log the Meta Llama2 model in llama.cpp gguf format to MLflow model and turn it into an MLflow model. This example is for the embedding generation task.

git clone https://github.com/jagane-infinstor/logmodel.git
(cd logmodel/llama2-gguf; python log.py --data_path ../../Llama-2-7b-chat-hf/ggml-model-q8_0.gguf --task embedding-generation)

The above command causes the llama.cpp gguf format model to be logged as an MLflow model for the embedding generation task. You can now go to the mlflow GUI and select the specific experiment/run. The model will be displayed in the artifacts pane for that run. Go ahead and register the model, for example, as llama2-gguf-embedding

Test the logged Embeddings model

The program chat.py included in the logmodel github tree is useful for testing the logged model

python embeddings.py --model models:/llama2-gguf-embedding/1

Now, when you type in a sentence at the > prompt, the program will print out the embeddings generated by the model.

That's all folks!