Running llama.cpp on Windows with GPU acceleration

Objective

Run llama.cpp on Windows PC with GPU acceleration.

Pre-requisites

First, you have to install a ton of stuff if you don’t have it already:

Git
Python
C++ compiler and toolchain. From the Visual Studio Downloads page, scroll down until you see Tools for Visual Studio under the All Downloads section and select the download for Build Tools for Visual Studio 2022.
CMake
OpenCL SDK. Install this at c:\vcpkg\packages\opencl_x64-windows.
CLBlast
clinfo.exe – don’t sweat if you are not able to install this one

Add paths to OpenCL and CLBlast lib, bin, include folders to the PATH environment variable. e.g.:

echo %PATH%
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.38.33130\bin\HostX64\x64;C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\Common7\IDE\VC\VCPackages;C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\Common7\IDE\CommonExtensions\Microsoft\TestWindow;C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\MSBuild\Current\bin\Roslyn;C:\Program Files (x86)\Windows Kits\10\bin\10.0.22621.0\\x64;C:\Program Files (x86)\Windows Kits\10\bin\\x64;C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\\MSBuild\Current\Bin\amd64;C:\Windows\Microsoft.NET\Framework64\v4.0.30319;C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\Common7\IDE\;C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\Common7\Tools\;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\WINDOWS\System32\OpenSSH\;C:\Program Files\dotnet\;C:\Program Files\apache-maven-3.9.6\bin;C:\Program Files\jdk-21.0.2\bin;C:\Program Files\Git\cmd;C:\Program Files\Git\mingw64\bin;C:\Program Files\Git\usr\bin;C:\Program Files (x86)\GnuWin32\bin;C:\Program Files\HDF_Group\HDF5\1.14.1\bin\;C:\Program Files (x86)\WiX Toolset v3.11\bin;C:\Program Files\CMake\bin;C:\Users\xxx\AppData\Local\Microsoft\WindowsApps;C:\Users\xxx\.dotnet\tools;;C:\Users\xxx\AppData\Local\Programs\Microsoft VS Code\bin;C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\Common7\IDE\CommonExtensions\Microsoft\CMake\CMake\bin;C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\Common7\IDE\CommonExtensions\Microsoft\CMake\Ninja;C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\Common7\IDE\VC\Linux\bin\ConnectionManagerExe

Download source code and Build

Download source code

git clone git@github.com:ggerganov/llama.cpp.git

In my case I am synced to:

commit d2f650cb5b04ee2726663e79b47da5efe196ce00 (HEAD -> master, tag: b1999, origin/master, origin/HEAD)
Author: Paul Tsochantaris <ptsochantaris@icloud.com>
Date:   Sun Jan 28 19:50:16 2024 +0000

    metal : free metal objects (#5161)

    * Releasing MTLFunction references after Metal pipeline construction

    * Keeping the `ggml_metal_kernel` structure

    * Spacing fix

    * Whitespace fix

Build Instructions

From the root of the repository:

set CL_BLAST_CMAKE_PKG="C:/Program Files/CLBlast-1.6.1-windows-x64/lib/cmake/CLBlast"
mkdir build
cd build
cmake .. -DBUILD_SHARED_LIBS=OFF -DLLAMA_CLBLAST=ON -DCMAKE_PREFIX_PATH=%CL_BLAST_CMAKE_PKG% -G "Visual Studio 17 2022" -A x64
cmake --build . --config Release
cmake --install . --prefix C:/LlamaCPP

Verify

c:\Users\xxx\code\llama.cpp\build\bin\Release>ls -al
total 30856
drwxr-xr-x 1 xxx 197609       0 Jan 28 16:13 .
drwxr-xr-x 1 xxx 197609       0 Jan 28 15:45 ..
-rwxr-xr-x 1 xxx 197609  505344 Jan 28 15:45 baby-llama.exe
-rwxr-xr-x 1 xxx 197609  772096 Jan 28 15:45 batched-bench.exe
-rwxr-xr-x 1 xxx 197609  791552 Jan 28 15:45 batched.exe
-rwxr-xr-x 1 xxx 197609  794112 Jan 28 15:45 beam-search.exe
-rwxr-xr-x 1 xxx 197609  422912 Jan 28 15:45 benchmark.exe
-rwxr-xr-x 1 xxx 197609  391680 Jan 28 15:45 convert-llama2c-to-ggml.exe
-rwxr-xr-x 1 xxx 197609  912384 Jan 28 15:45 embedding.exe
-rwxr-xr-x 1 xxx 197609  386048 Jan 28 15:45 export-lora.exe
-rwxr-xr-x 1 xxx 197609  908800 Jan 28 15:46 finetune.exe
-rwxr-xr-x 1 xxx 197609  912896 Jan 28 15:46 imatrix.exe
-rwxr-xr-x 1 xxx 197609 1051648 Jan 28 15:46 infill.exe
-rwxr-xr-x 1 xxx 197609  870912 Jan 28 15:46 llama-bench.exe
-rwxr-xr-x 1 xxx 197609 1079808 Jan 28 15:46 llava-cli.exe
-rwxr-xr-x 1 xxx 197609  987136 Jan 28 15:46 lookahead.exe
-rwxr-xr-x 1 xxx 197609  986624 Jan 28 15:46 lookup.exe
-rwxr-xr-x 1 xxx 197609 1078784 Jan 28 15:46 main.exe
-rw-r--r-- 1 xxx 197609  211380 Jan 28 16:47 main.log
-rwxr-xr-x 1 xxx 197609  999936 Jan 28 15:46 parallel.exe
-rwxr-xr-x 1 xxx 197609  786432 Jan 28 15:46 passkey.exe
-rwxr-xr-x 1 xxx 197609 1023488 Jan 28 15:46 perplexity.exe
-rwxr-xr-x 1 xxx 197609  349696 Jan 28 15:46 q8dot.exe
-rwxr-xr-x 1 xxx 197609  793088 Jan 28 15:46 quantize-stats.exe
-rwxr-xr-x 1 xxx 197609  610304 Jan 28 15:46 quantize.exe
-rwxr-xr-x 1 xxx 197609  917504 Jan 28 15:46 save-load-state.exe
-rwxr-xr-x 1 xxx 197609 1623552 Jan 28 15:46 server.exe
-rwxr-xr-x 1 xxx 197609  774144 Jan 28 15:46 simple.exe
-rwxr-xr-x 1 xxx 197609  991232 Jan 28 15:46 speculative.exe
-rwxr-xr-x 1 xxx 197609  734720 Jan 28 15:46 test-autorelease.exe
-rwxr-xr-x 1 xxx 197609  569344 Jan 28 15:46 test-backend-ops.exe
-rwxr-xr-x 1 xxx 197609    9728 Jan 28 15:46 test-c.exe
-rwxr-xr-x 1 xxx 197609  411648 Jan 28 15:46 test-grad0.exe
-rwxr-xr-x 1 xxx 197609   43008 Jan 28 15:46 test-grammar-parser.exe
-rwxr-xr-x 1 xxx 197609  448000 Jan 28 15:47 test-llama-grammar.exe
-rwxr-xr-x 1 xxx 197609  615424 Jan 28 15:47 test-model-load-cancel.exe
-rwxr-xr-x 1 xxx 197609  349696 Jan 28 15:47 test-quantize-fns.exe
-rwxr-xr-x 1 xxx 197609  360960 Jan 28 15:47 test-quantize-perf.exe
-rwxr-xr-x 1 xxx 197609  352768 Jan 28 15:47 test-rope.exe
-rwxr-xr-x 1 xxx 197609  457216 Jan 28 15:47 test-sampling.exe
-rwxr-xr-x 1 xxx 197609  772096 Jan 28 15:47 test-tokenizer-0-falcon.exe
-rwxr-xr-x 1 xxx 197609  772608 Jan 28 15:47 test-tokenizer-0-llama.exe
-rwxr-xr-x 1 xxx 197609  792064 Jan 28 15:47 test-tokenizer-1-bpe.exe
-rwxr-xr-x 1 xxx 197609  790528 Jan 28 15:47 test-tokenizer-1-llama.exe
-rwxr-xr-x 1 xxx 197609  741376 Jan 28 15:47 tokenize.exe
-rwxr-xr-x 1 xxx 197609  882688 Jan 28 15:47 train-text-from-scratch.exe
-rwxr-xr-x 1 xxx 197609  353280 Jan 28 15:47 vdot.exe

Download `llama2`

Now we need to get the LLM. In my case I am using llama2. To get the model weights I used following steps from a clean directory:

create a virtual environment

python -m venv venv

activate the virtual environment

venv\Scripts\activate

install huggingface-cli

pip install -U "huggingface_hub[cli]"

Get the model

huggingface-cli download TheBloke/Llama-2-7b-Chat-GGUF llama-2-7b-chat.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False

Run

Use clinfo to get useful information:

c:\Users\xxx\Downloads\clinfo.exe -l
Platform #0: Intel(R) OpenCL HD Graphics
 `-- Device #0: Intel(R) Iris(R) Plus Graphics 655
Platform #1: OpenCLOn12
 +-- Device #0: Intel(R) Iris(R) Plus Graphics 655
 `-- Device #1: Microsoft Basic Render Driver

cd to llama.cpp\build\bin\Release anf from there run (replace paths as necessary):

set GGML_OPENCL_PLATFORM=0
set GGML_OPENCL_DEVICE=0
main.exe -m c:\Users\xxx\code\models\llama-2-7b-chat.Q4_K_M.gguf -i --gpu-layers 43 -c 2048 -n 400 -r "User:" --in-prefix " " -e --file ..\..\..\prompts\chat-with-bob.txt --color

This is what I get:

Log start
main: build = 1999 (d2f650cb)
main: built with MSVC 19.38.33134.0 for x64
main: seed  = 1706487793
ggml_opencl: selecting platform: 'Intel(R) OpenCL HD Graphics'
ggml_opencl: selecting device: 'Intel(R) Iris(R) Plus Graphics 655'
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from c:\Users\xxx\code\models\llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                          general.file_type u32              = 15
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.80 GiB (4.84 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    70.31 MiB
llm_load_tensors:     OpenCL buffer size =  3820.93 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    12.01 MiB
llama_new_context_with_model:        CPU compute buffer size =   167.20 MiB
llama_new_context_with_model: graph splits (measure): 1

system_info: n_threads = 4 / 8 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 |
main: interactive mode on.
Reverse prompt: 'User:'
Input prefix: ' '
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 2048, n_batch = 512, n_predict = 400, n_keep = 0


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

 Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.

User: Hello, Bob.
Bob: Hello. How may I help you today?
User: Please tell me the largest city in Europe.
Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User:

Now start chatting!

User: who is sherlock holmes?
Bob: Sherlock Holmes is a fictional character created by Sir Arthur Conan Doyle. He is a consulting detective who is known for his intelligence, wit, and ability to solve complex crimes.
User: who is dr. watson?
Bob: Dr. John Watson is Sherlock Holmes' trusted friend and partner in crime-solving. He is a medical doctor and serves as the narrator of many of the stories featuring Holmes.
User: how are sherlock holmes and dr. watson related?
Bob: Sherlock Holmes and Dr. John Watson are fictional characters created by Sir Arthur Conan Doyle. They are the main protagonists of the Sherlock Holmes stories, with Holmes being the detective and Watson serving as his narrator and partner in crime-solving.
 where does sherlock holmes live?
Bob: Sherlock Holmes is a fictional character, so he doesn't actually live anywhere. However, the stories often describe his London townhouse at 221B Baker Street as his residence.
 how old is sherlock holmes?
Bob: Sherlock Holmes is a fictional character, so he doesn't have an actual age. However, in the stories, he is often described as being in his early 30s or late 20s.

Let me know how it goes!