Registering llama2 model llama.cpp format as an MLflow Model and using it¶
Llama2-cpp models in the gguf format can be registered as MLflow Models and used in the local machine
Create conda env¶
First, we create a conda environment suitable for llama-cpp-python
conda create -n l6 python=3.9
conda activate l6
pip install transformers[torch]
pip install mlflow>=2.6.0 numpy scipy pandas scikit-learn cloudpickle sentencepiece infinstor_mlflow_plugin
FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
Llama2 license¶
Llama2 is available for commercial use, but you must obtain a free license from Meta. Go to the following Meta website and signing up for access to the Llama2 models: here
Download HF Llama2 model¶
Now, download huggingface Llama2 model
git clone https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
Note that you must login using your huggingface username and access token in order to complete the above command. Note also that this very same huggingface account must be authorized to download Llama2 models.
Convert Llama2 to gguf format¶
First, checkout the llama.cpp source code from github
git clone https://github.com/ggerganov/llama.cpp.git
Next, use the convert.py utility included in llama.cpp to convert the downloaded model to gguf format
(cd Llama-2-7b-chat-hf; python ../llama.cpp/convert.py --outtype q8_0 .)
The output should look similar to the following:
-rw-rw-r-- 1 jagane jagane 7161089696 Sep 10 21:53 ggml-model-q8_0.gguf
Log Model to MLflow for Chat¶
Next, we use logmodel to log the Meta Llama2 model in llama.cpp gguf format to MLflow model and turn it into an MLflow model. This example is for the chat task.
git clone https://github.com/jagane-infinstor/logmodel.git
(cd logmodel/llama2-gguf; python log.py --data_path ../../Llama-2-7b-chat-hf/ggml-model-q8_0.gguf --task chat)
The above command causes the llama.cpp gguf format model to be logged as an MLflow model for the chat task. You can now go to the mlflow GUI and select the specific experiment/run. The model will be displayed in the artifacts pane for that run. Go ahead and register the model, for example, as llama2-gguf-chat
Test the logged Chat model¶
The program chat.py included in the logmodel github tree is useful for testing the logged model
python chat.py --model models:/llama2-gguf-chat/1
The output is probably going to be something like this:
> What is the capital of California?
llama_print_timings: load time = 3552.53 ms
llama_print_timings: sample time = 51.67 ms / 78 runs ( 0.66 ms per token, 1509.58 tokens per second)
llama_print_timings: prompt eval time = 3552.49 ms / 94 tokens ( 37.79 ms per token, 26.46 tokens per second)
llama_print_timings: eval time = 20680.57 ms / 77 runs ( 268.58 ms per token, 3.72 tokens per second)
llama_print_timings: total time = 24444.20 ms
assistant> Thank you for asking! The capital of California is Sacramento. It is located in the northern part of the state, along the Sacramento River. Sacramento has a rich history and culture, and it is home to many important government buildings and institutions, including the California State Capitol. I hope that helps! Let me know if you have any other questions.
> How far is it from San Francisco?
Llama.generate: prefix-match hit
llama_print_timings: load time = 3552.53 ms
llama_print_timings: sample time = 63.20 ms / 101 runs ( 0.63 ms per token, 1598.08 tokens per second)
llama_print_timings: prompt eval time = 3425.73 ms / 96 tokens ( 35.68 ms per token, 28.02 tokens per second)
llama_print_timings: eval time = 25718.24 ms / 100 runs ( 257.18 ms per token, 3.89 tokens per second)
llama_print_timings: total time = 29382.87 ms
assistant> Great question! Sacramento is located approximately 150 miles (241 kilometers) northeast of San Francisco. The drive from San Francisco to Sacramento typically takes about 2-3 hours, depending on traffic and the route you take. There are also public transportation options available, such as buses and trains, which can take a bit longer but offer a convenient alternative to driving. I hope that helps! Let me know if you have any other questions.
>
Log Model to MLflow for Embedding¶
Next, we use logmodel to log the Meta Llama2 model in llama.cpp gguf format to MLflow model and turn it into an MLflow model. This example is for the embedding generation task.
git clone https://github.com/jagane-infinstor/logmodel.git
(cd logmodel/llama2-gguf; python log.py --data_path ../../Llama-2-7b-chat-hf/ggml-model-q8_0.gguf --task embedding-generation)
The above command causes the llama.cpp gguf format model to be logged as an MLflow model for the embedding generation task. You can now go to the mlflow GUI and select the specific experiment/run. The model will be displayed in the artifacts pane for that run. Go ahead and register the model, for example, as llama2-gguf-embedding
Test the logged Embeddings model¶
The program chat.py included in the logmodel github tree is useful for testing the logged model
python embeddings.py --model models:/llama2-gguf-embedding/1
Now, when you type in a sentence at the > prompt, the program will print out the embeddings generated by the model.
That's all folks!