Understanding internals of llama.cpp

Making notes for my understanding.

Summary

llama.cpp can only be used to do inference. It cannot be used to do training [1].
LLMs have this thing called maximum context size or window (measured in number of tokens). AFAIU, the entire conversation is cannot exceed this size. In case of llama.cpp you specify this size using the n_ctx parameter. However, note carefully that this is not something you can set to arbitrary value. You should always set it equal to the context window of the model you are using [2]. For vicuna it is 2048 [3].
If you try to go past the context size, this ugly code kicks in. Per my understanding, it creates a new session that keeps n_keep tokens from the start of previous session and last (n_ctx - n_keep) / 2 tokens of previous session. The new session can go on for n_ctx - n_keep - (n_ctx - n_keep) / 2 tokens before being reset again. Since this is a hack and poor man’s way of keeping the conversation going, the conversation will start to lose coherence depending on how good the tokens it got.
The context size has a quadratic impact on model performance [4].
When the program starts, it does a forward pass on the prompt given to it in the command line arguments

To get the output above I modified the source code like this and inserted printf statements:

print_embd(embd, ctx);
printf("\nDEBUG: running forward pass with n_eval = %d, n_past = %d\n", n_eval, n_past);
if (llama_eval(ctx, &embd[i], n_eval, n_past, params.n_threads)) {
       fprintf(stderr, "%s : failed to eval\n", __func__);
       return 1;
}

llama_eval is doing the forward pass on the neural network. n_past tracks our position in the context window. It is equal to the length of the conversation so far. After this there is some code in the program that will sample the model outputs and output tokens one by one (this gives the animation in the console where you can see words printed one by one). Here is how it works:

model generates a token
this token is printed to console and n_past is incremented by 1
Now this token is fed back to llama_eval to generate the next token. This is the auto-regressive nature of the model. You can see this in action over here. Note at this step embd contains only 1 token in it (the token from step 2) when llama_eval is called.

The loop above will keep repeating ad-infinitum until the model generates an antiprompt token (it looks like more than 1 antiprompt token can be given to the program) at which point it will exit the loop and hand it over to the user to enter some input. Then, the process will repeat all over. embd will now contain the tokenized user’s input.

So in a nutshell this is how it works AFAIU. Important variables and what they do:

embd_inp: stores tokenized user input.
embd: stores the token(s) (token ids really) on which the model runs the forward pass of the NN. It is equal to the user input just before its the model’s turn to generate text. And while the model is generating text, it equals one token at a time (length of embd = 1 during this phase).
last_n_tokens is a ring buffer of size = the context length or window. It starts out with zeros. Items are inserted at back of the buffer and each insert is accompanied by a corresponding pop where the first item in buffer is discarded (thrown out).
is_interacting is True while model is waiting for user to input text. It is False while model is generating text.

A closer look at `llama_eval` function and resetting the context

llama_eval function is called like this in the code:

llama_eval(ctx, &embd[i], n_eval, n_past, params.n_threads)

it makes a call to:

// evaluate the transformer
//
//   - lctx:         llama context
//   - tokens:       new batch of tokens to process
//   - n_past:       the context size so far
//   - n_threads:    number of threads to use
//   - cgraph_fname: filename of the exported computation graph
//
static bool llama_eval_internal(

this is a long function and I did not study it in detail. I am also not a C++ programmer so its difficult for me to understand C++ code. But anyway this is what I think llama_eval does:

take n_eval tokens from embd starting at position i.
run the forward pass (aka inference) on the data in step 1. To run the forward pass, the model also uses its context which is history of entire conversation. This history is presumably stored in an internal buffer (call it buf) which is of size n_ctx. n_past is used to tell the model to take buf[0:n_past-1] of the buffer as the context. The ctx object is thus stateful.
Before the function returns, the model also adds all the tokens in step 1 to its context – the buf – starting at position n_past.

The code in main.cpp subsequently increments n_past right after the function is called:

n_past += n_eval;

Now let’s take a look at the code that resets the context:

// infinite text generation via context swapping
// if we run out of context:
// - take the n_keep first tokens from the original prompt (via n_past)
// - take half of the last (n_ctx - n_keep) tokens and recompute the logits in batches
if (n_past + (int) embd.size() > n_ctx) {
   const int n_left = n_past - params.n_keep;

   // always keep the first token - BOS
   n_past = std::max(1, params.n_keep);

  // insert n_left/2 tokens at the start of embd from last_n_tokens
  embd.insert(embd.begin(), last_n_tokens.begin() + n_ctx - n_left/2 - embd.size(), last_n_tokens.end() - embd.size());

By resetting n_past to max(1, params.n_keep), we will mark the portion of the internal buffer buf from n_past to n_ctx - 1 as ineligible. We are effectively deleting the model’s accumulated context (i.e., our conversation history) from n_past to n_ctx - 1. After that we are taking last n_left/2 tokens of the conversation and adding that to embd and let the model run from there. So its as if a brand new conversation was started with these initial conditions.

To test this hypothesis, I did following experiment. Below is a screenshot from a conversation when the context gets reset (with n_keep=0 which is the default setting):

compare it to the result when I run a new instance of the program and initializing it with the text when the model got reset in the previous run:

The two outputs are not quite the same. I don’t know if its because of the temperature or something else (maybe my understanding of what happens internally when the context is reset is not fully correct).

What happens if you run the model with a context size that is greater than the context size on which the model was trained?

How I got it to debug

The only way to understand something is to run and step through the code in a debugger. Below are the steps to debug. Also see this:

create debug build (build/debug directory) followed by below steps in that dir:

cmake -DLLAMA_METAL=ON -DCMAKE_BUILD_TYPE=Debug ../..
cmake --build . --config Debug

DCMAKE_BUILD_TYPE=Debug is the crucial setting. without it, it did not work [1]. I got the tip from here.

After that you can cd to directory of the main executable and from there run:

╰─⠠⠵ lldb main
(lldb) target create "main"
Current executable set to '/llm/llama.cpp/build/debug/bin/main' (x86_64).
(lldb) b /llm/llama.cpp/examples/main/main.cpp:54
Breakpoint 1: where = main`main + 57 at main.cpp:54:26, address = 0x0000000100005799
(lldb) run -m /llm/llama.cpp/models/gpt4-x-vicuna-13B.ggmlv3.q5_1.bin --threads 6 -c 2048 --repeat_penalty 1.0 --color -i -r "User:" -f ../../chat-with-bob.txt --in-prefix " " --verbose-prompt

Screenshots:

Search for lldb cheat sheet for a cheat sheet. basic lldb commands:

b: set breakpoint
c: continue
p: print
exit: exit
run: run the program