How to Run a Chat Model Locally

I’ve been hearing more about running local chat bots, but everyone seems to talk about how to tweak one. I’ve yet to see somewhere people explain how to install one, so this is it.

I’m gonna be using llama.cpp for the interface, and open_llama_3b for the model. Also, this tutorial will be limited to linux only.

Prerequisites

knowledge of linux, commands
no need for a powerful GPU, (I myself am using an onboard graphics chip for this)
Have Python 3 installed

Check with
```
python -V
```
Have git lfs installed

Check with
```
git lfs
```
If it is not installed, do
```
git lfs install
```

Install

First clone the llama.cpp repository

git clone https://github.com/ggerganov/llama.cpp

While that is running, also clone the open_llama_3b repository

# Installs repository, excluding any large files
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/openlm-research/open_llama_3b

# After that finishes, download the large files with this
cd open_llama_3b
git lfs pull

Building

Next you have to build the llama.cpp project

cd llama.cpp
make

This will generate executables in the project root directory (llama.cpp/), notably: main, quantize

Converting model

In order to use open_llama_3b with llama.cpp, you have to convert it to the ggml FP16 format

First, create a python virtual environment

cd llama.cpp
python -m venv .venv

Then activate it

source .venv/bin/activate

Install dependencies

python -m pip install -r requirements.txt

Then convert the open_llama_3b model

python convert.py ../path/to/open_llama_3b

After that move the converted file ggml-model-f16.gguf to llama.cpp/models/3B

mkdir ./models/3B
mv ../path/to/open_llama_3b/ggml-model-f16 ./models/3B/

Quantization

To run the model, you need to prepare it by quantization

./quantize ./models/3B/ggml-model-f16.gguf ./models/3B/ggml-model-q4_0.gguf q4_0

Run the model

Finally, run the model

# Running the model with prompt
./main -m ./models/3B/ggml-model-f16.gguf -n 128 --repeat-penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt

Notes

All credits go to the authors of llama.cpp and open_llama_3b. Any mistakes or errors are solely mine.

Prerequisites#

Install#

Building#

Converting model#

Quantization#

Run the model#

Notes#