
Objective
Run llama.cpp on Windows PC with GPU acceleration.
Pre-requisites
First, you have to install a ton of stuff if you don’t have it already:
- Git
- Python
- C++ compiler and toolchain. From the Visual Studio Downloads page, scroll down until you see Tools for Visual Studio under the All Downloads section and select the download for Build Tools for Visual Studio 2022.
- CMake
- OpenCL SDK. Install this at
c:\vcpkg\packages\opencl_x64-windows. - CLBlast
- clinfo.exe – don’t sweat if you are not able to install this one
Add paths to OpenCL and CLBlast lib, bin, include folders to the PATH environment variable. e.g.:
echo %PATH%
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.38.33130\bin\HostX64\x64;C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\Common7\IDE\VC\VCPackages;C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\Common7\IDE\CommonExtensions\Microsoft\TestWindow;C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\MSBuild\Current\bin\Roslyn;C:\Program Files (x86)\Windows Kits\10\bin\10.0.22621.0\\x64;C:\Program Files (x86)\Windows Kits\10\bin\\x64;C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\\MSBuild\Current\Bin\amd64;C:\Windows\Microsoft.NET\Framework64\v4.0.30319;C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\Common7\IDE\;C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\Common7\Tools\;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\WINDOWS\System32\OpenSSH\;C:\Program Files\dotnet\;C:\Program Files\apache-maven-3.9.6\bin;C:\Program Files\jdk-21.0.2\bin;C:\Program Files\Git\cmd;C:\Program Files\Git\mingw64\bin;C:\Program Files\Git\usr\bin;C:\Program Files (x86)\GnuWin32\bin;C:\Program Files\HDF_Group\HDF5\1.14.1\bin\;C:\Program Files (x86)\WiX Toolset v3.11\bin;C:\Program Files\CMake\bin;C:\Users\xxx\AppData\Local\Microsoft\WindowsApps;C:\Users\xxx\.dotnet\tools;;C:\Users\xxx\AppData\Local\Programs\Microsoft VS Code\bin;C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\Common7\IDE\CommonExtensions\Microsoft\CMake\CMake\bin;C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\Common7\IDE\CommonExtensions\Microsoft\CMake\Ninja;C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\Common7\IDE\VC\Linux\bin\ConnectionManagerExe
Download source code and Build
Download source code
git clone git@github.com:ggerganov/llama.cpp.git
In my case I am synced to:
commit d2f650cb5b04ee2726663e79b47da5efe196ce00 (HEAD -> master, tag: b1999, origin/master, origin/HEAD)
Author: Paul Tsochantaris <ptsochantaris@icloud.com>
Date: Sun Jan 28 19:50:16 2024 +0000
metal : free metal objects (#5161)
* Releasing MTLFunction references after Metal pipeline construction
* Keeping the `ggml_metal_kernel` structure
* Spacing fix
* Whitespace fix
Build Instructions
From the root of the repository:
set CL_BLAST_CMAKE_PKG="C:/Program Files/CLBlast-1.6.1-windows-x64/lib/cmake/CLBlast"
mkdir build
cd build
cmake .. -DBUILD_SHARED_LIBS=OFF -DLLAMA_CLBLAST=ON -DCMAKE_PREFIX_PATH=%CL_BLAST_CMAKE_PKG% -G "Visual Studio 17 2022" -A x64
cmake --build . --config Release
cmake --install . --prefix C:/LlamaCPP
Verify
c:\Users\xxx\code\llama.cpp\build\bin\Release>ls -al
total 30856
drwxr-xr-x 1 xxx 197609 0 Jan 28 16:13 .
drwxr-xr-x 1 xxx 197609 0 Jan 28 15:45 ..
-rwxr-xr-x 1 xxx 197609 505344 Jan 28 15:45 baby-llama.exe
-rwxr-xr-x 1 xxx 197609 772096 Jan 28 15:45 batched-bench.exe
-rwxr-xr-x 1 xxx 197609 791552 Jan 28 15:45 batched.exe
-rwxr-xr-x 1 xxx 197609 794112 Jan 28 15:45 beam-search.exe
-rwxr-xr-x 1 xxx 197609 422912 Jan 28 15:45 benchmark.exe
-rwxr-xr-x 1 xxx 197609 391680 Jan 28 15:45 convert-llama2c-to-ggml.exe
-rwxr-xr-x 1 xxx 197609 912384 Jan 28 15:45 embedding.exe
-rwxr-xr-x 1 xxx 197609 386048 Jan 28 15:45 export-lora.exe
-rwxr-xr-x 1 xxx 197609 908800 Jan 28 15:46 finetune.exe
-rwxr-xr-x 1 xxx 197609 912896 Jan 28 15:46 imatrix.exe
-rwxr-xr-x 1 xxx 197609 1051648 Jan 28 15:46 infill.exe
-rwxr-xr-x 1 xxx 197609 870912 Jan 28 15:46 llama-bench.exe
-rwxr-xr-x 1 xxx 197609 1079808 Jan 28 15:46 llava-cli.exe
-rwxr-xr-x 1 xxx 197609 987136 Jan 28 15:46 lookahead.exe
-rwxr-xr-x 1 xxx 197609 986624 Jan 28 15:46 lookup.exe
-rwxr-xr-x 1 xxx 197609 1078784 Jan 28 15:46 main.exe
-rw-r--r-- 1 xxx 197609 211380 Jan 28 16:47 main.log
-rwxr-xr-x 1 xxx 197609 999936 Jan 28 15:46 parallel.exe
-rwxr-xr-x 1 xxx 197609 786432 Jan 28 15:46 passkey.exe
-rwxr-xr-x 1 xxx 197609 1023488 Jan 28 15:46 perplexity.exe
-rwxr-xr-x 1 xxx 197609 349696 Jan 28 15:46 q8dot.exe
-rwxr-xr-x 1 xxx 197609 793088 Jan 28 15:46 quantize-stats.exe
-rwxr-xr-x 1 xxx 197609 610304 Jan 28 15:46 quantize.exe
-rwxr-xr-x 1 xxx 197609 917504 Jan 28 15:46 save-load-state.exe
-rwxr-xr-x 1 xxx 197609 1623552 Jan 28 15:46 server.exe
-rwxr-xr-x 1 xxx 197609 774144 Jan 28 15:46 simple.exe
-rwxr-xr-x 1 xxx 197609 991232 Jan 28 15:46 speculative.exe
-rwxr-xr-x 1 xxx 197609 734720 Jan 28 15:46 test-autorelease.exe
-rwxr-xr-x 1 xxx 197609 569344 Jan 28 15:46 test-backend-ops.exe
-rwxr-xr-x 1 xxx 197609 9728 Jan 28 15:46 test-c.exe
-rwxr-xr-x 1 xxx 197609 411648 Jan 28 15:46 test-grad0.exe
-rwxr-xr-x 1 xxx 197609 43008 Jan 28 15:46 test-grammar-parser.exe
-rwxr-xr-x 1 xxx 197609 448000 Jan 28 15:47 test-llama-grammar.exe
-rwxr-xr-x 1 xxx 197609 615424 Jan 28 15:47 test-model-load-cancel.exe
-rwxr-xr-x 1 xxx 197609 349696 Jan 28 15:47 test-quantize-fns.exe
-rwxr-xr-x 1 xxx 197609 360960 Jan 28 15:47 test-quantize-perf.exe
-rwxr-xr-x 1 xxx 197609 352768 Jan 28 15:47 test-rope.exe
-rwxr-xr-x 1 xxx 197609 457216 Jan 28 15:47 test-sampling.exe
-rwxr-xr-x 1 xxx 197609 772096 Jan 28 15:47 test-tokenizer-0-falcon.exe
-rwxr-xr-x 1 xxx 197609 772608 Jan 28 15:47 test-tokenizer-0-llama.exe
-rwxr-xr-x 1 xxx 197609 792064 Jan 28 15:47 test-tokenizer-1-bpe.exe
-rwxr-xr-x 1 xxx 197609 790528 Jan 28 15:47 test-tokenizer-1-llama.exe
-rwxr-xr-x 1 xxx 197609 741376 Jan 28 15:47 tokenize.exe
-rwxr-xr-x 1 xxx 197609 882688 Jan 28 15:47 train-text-from-scratch.exe
-rwxr-xr-x 1 xxx 197609 353280 Jan 28 15:47 vdot.exe
Download llama2
Now we need to get the LLM. In my case I am using llama2. To get the model weights I used following steps from a clean directory:
- create a virtual environment
python -m venv venv
- activate the virtual environment
venv\Scripts\activate
- install
huggingface-cli
pip install -U "huggingface_hub[cli]"
- Get the model
huggingface-cli download TheBloke/Llama-2-7b-Chat-GGUF llama-2-7b-chat.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False
Run
Use clinfo to get useful information:
c:\Users\xxx\Downloads\clinfo.exe -l
Platform #0: Intel(R) OpenCL HD Graphics
`-- Device #0: Intel(R) Iris(R) Plus Graphics 655
Platform #1: OpenCLOn12
+-- Device #0: Intel(R) Iris(R) Plus Graphics 655
`-- Device #1: Microsoft Basic Render Driver
cd to llama.cpp\build\bin\Release anf from there run (replace paths as necessary):
set GGML_OPENCL_PLATFORM=0
set GGML_OPENCL_DEVICE=0
main.exe -m c:\Users\xxx\code\models\llama-2-7b-chat.Q4_K_M.gguf -i --gpu-layers 43 -c 2048 -n 400 -r "User:" --in-prefix " " -e --file ..\..\..\prompts\chat-with-bob.txt --color
This is what I get:
Log start
main: build = 1999 (d2f650cb)
main: built with MSVC 19.38.33134.0 for x64
main: seed = 1706487793
ggml_opencl: selecting platform: 'Intel(R) OpenCL HD Graphics'
ggml_opencl: selecting device: 'Intel(R) Iris(R) Plus Graphics 655'
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from c:\Users\xxx\code\models\llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = LLaMA v2
llama_model_loader: - kv 2: llama.context_length u32 = 4096
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 10: general.file_type u32 = 15
llama_model_loader: - kv 11: tokenizer.ggml.model str = llama
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 18: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_K: 193 tensors
llama_model_loader: - type q6_K: 33 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V2
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 4096
llm_load_print_meta: n_embd_v_gqa = 4096
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 11008
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 6.74 B
llm_load_print_meta: model size = 3.80 GiB (4.84 BPW)
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 70.31 MiB
llm_load_tensors: OpenCL buffer size = 3820.93 MiB
..................................................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 1024.00 MiB
llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB
llama_new_context_with_model: CPU input buffer size = 12.01 MiB
llama_new_context_with_model: CPU compute buffer size = 167.20 MiB
llama_new_context_with_model: graph splits (measure): 1
system_info: n_threads = 4 / 8 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 |
main: interactive mode on.
Reverse prompt: 'User:'
Input prefix: ' '
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 2048, n_batch = 512, n_predict = 400, n_keep = 0
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to LLaMa.
- To return control without starting a new line, end your input with '/'.
- If you want to submit another line, end your input with '\'.
Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.
User: Hello, Bob.
Bob: Hello. How may I help you today?
User: Please tell me the largest city in Europe.
Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User:
Now start chatting!
User: who is sherlock holmes?
Bob: Sherlock Holmes is a fictional character created by Sir Arthur Conan Doyle. He is a consulting detective who is known for his intelligence, wit, and ability to solve complex crimes.
User: who is dr. watson?
Bob: Dr. John Watson is Sherlock Holmes' trusted friend and partner in crime-solving. He is a medical doctor and serves as the narrator of many of the stories featuring Holmes.
User: how are sherlock holmes and dr. watson related?
Bob: Sherlock Holmes and Dr. John Watson are fictional characters created by Sir Arthur Conan Doyle. They are the main protagonists of the Sherlock Holmes stories, with Holmes being the detective and Watson serving as his narrator and partner in crime-solving.
where does sherlock holmes live?
Bob: Sherlock Holmes is a fictional character, so he doesn't actually live anywhere. However, the stories often describe his London townhouse at 221B Baker Street as his residence.
how old is sherlock holmes?
Bob: Sherlock Holmes is a fictional character, so he doesn't have an actual age. However, in the stories, he is often described as being in his early 30s or late 20s.
Let me know how it goes!