Understanding internals of java-llama.cpp

In an earlier article we covered internals of llama.cpp. What if you want to use this library but from a Java application? Luckily you can thanks to java-llama.cpp. In this article we cover some of the internals of java-llama.cpp so you understand how it works.

You can use java-llama.cpp by adding following to your pom.xml:

<dependency>
    <groupId>de.kherud</groupId>
    <artifactId>llama</artifactId>
    <version>2.2.1</version>
</dependency>

How `java-llama.cpp` works

If you unpack the jar file that comes with the Maven artifact, you will see following files [1]:

> tar -xvf ~/.m2/repository/de/kherud/llama/2.2.1/llama-2.2.1.jar
x META-INF/
x META-INF/MANIFEST.MF
x de/
x de/kherud/
x de/kherud/llama/
x de/kherud/llama/Windows/
x de/kherud/llama/Windows/x86_64/
x de/kherud/llama/Linux/
x de/kherud/llama/Linux/aarch64/
x de/kherud/llama/Linux/x86_64/
x de/kherud/llama/Mac/
x de/kherud/llama/Mac/aarch64/
x de/kherud/llama/Mac/x86_64/
x de/kherud/llama/LlamaModel$Output.class
x de/kherud/llama/OSInfo.class
x de/kherud/llama/Windows/x86_64/jllama.dll
x de/kherud/llama/Windows/x86_64/llama.dll
x de/kherud/llama/LlamaModel$1.class
x de/kherud/llama/LlamaModel$LlamaIterator.class
x de/kherud/llama/InferenceParameters$1.class
x de/kherud/llama/InferenceParameters$MiroStat.class
x de/kherud/llama/LogLevel.class
x de/kherud/llama/ProcessRunner.class
x de/kherud/llama/ModelParameters$Builder.class
x de/kherud/llama/Linux/aarch64/libllama.so
x de/kherud/llama/Linux/x86_64/libllama.so
x de/kherud/llama/Linux/x86_64/libjllama.so
x de/kherud/llama/InferenceParameters.class
x de/kherud/llama/LlamaLoader.class
x de/kherud/llama/LlamaModel.class
x de/kherud/llama/InferenceParameters$Builder.class
x de/kherud/llama/ModelParameters$1.class
x de/kherud/llama/ModelParameters.class
x de/kherud/llama/LlamaException.class
x de/kherud/llama/Mac/aarch64/libllama.dylib
x de/kherud/llama/Mac/aarch64/libjllama.dylib
x de/kherud/llama/Mac/aarch64/ggml-metal.metal
x de/kherud/llama/Mac/x86_64/libllama.dylib
x de/kherud/llama/Mac/x86_64/libjllama.dylib
x de/kherud/llama/Mac/x86_64/ggml-metal.metal
x META-INF/maven/
x META-INF/maven/de.kherud/
x META-INF/maven/de.kherud/llama/
x META-INF/maven/de.kherud/llama/pom.xml
x META-INF/maven/de.kherud/llama/pom.properties

We can see there are two key pre-built dlls (I will use the term dll which stands for dynamic link library irrespective of the platform – Windows, Mac or Linux) which ship with the library. E.g., for MacOS they are:

libllama.dylib: This dll is built by compiling llama.cpp source code. The source code has to be compiled differently for different platforms, hence you have separate dlls for each platform. This is the dll which does the heavy lifting.
libjllama.dylib: This dll is built by compiling the C++ code in src/main/cpp.

The purpose of the code in src/main/cpp is to provide a JNI wrapper or shim using which C++ code in llama.cpp (in libllama.dylib to be more accurate) can be called from Java. And so this is how it works.

How are the dlls built?

Both dlls are built by this file and you have to run cmake to build the dlls as we are compiling C++ code not Java. The dlls are saved under ${CMAKE_SOURCE_DIR}/src/main/resources/de/kherud/llama/${OS_NAME}/${OS_ARCH}. Most of the code in this file has to do with building src/main/cpp. Where is the code that is building llama.cpp? It is this line:

include(build-args.cmake)

which ends up including this file.

Here you can verify that by default

set(LLAMA_METAL_DEFAULT ON)

so the pre-built dll for Mac has GPU acceleration built into it but not for windows.

Note that when you are using java-llama.cpp you are using a version of llama.cpp built using a CMake file that is different from the original. The original CMake file can be found here. You can compare the two for differences and this will come in handy when debugging any issues related to differences between behavior of official llama.cpp vs. java-llama.cpp.

How are the dlls loaded?

The dlls are loaded at runtime by this code:

static synchronized void initialize() throws UnsatisfiedLinkError {

There are native methods declared here:

private native void loadModel(String filePath, ModelParameters parameters) throws LlamaException;

the native keyword is used to declare a method that is implemented in platform-dependent, non-Java code, typically written in another programming language such as C or C++. Minimal example here.

Getting GPU acceleration

By default this library will not provide GPU acceleration [1]:

We support CPU inference for the following platforms out of the box

If none of the above listed platforms matches yours, currently you have to compile the library yourself (also if you want GPU acceleration, see below)

To get GPU acceleration, you have to build two custom dlls:

Linux: libllama.so, libjllama.so
MacOS: libllama.dylib, libjllama.dylib, ggml-metal.metal
Windows: llama.dll, jllama.dll

[lib]llama.dll|dylib|so is the dll corresponding to llama.cpp which does the actual heavy work. [lib]jllama.dll|dylib|so is a wrapper that provides interop between Java and C++ code

and set de.kherud.llama.lib.path system property to where system can find these dlls when you run your Java application. for example -Dde.kherud.llama.lib.path=/directory/containing/lib.

Not only that, the system should be able to load OpenCL and CLBlast dlls at runtime (we use OpenCL and CLBlast on Windows). So paths to those dlls need to be added to the PATH environment variable [1].

Common Steps for Mac and Windows

clone java-llama.cpp
checkout a tag so you are compiling against a well-known release

git checkout v2.2.1

download llama.cpp source code

git submodule update --init --recursive

In later versions of java-llama.cpp (e.g., 2.3.4) you do not have to run above command. Instead the CMake file has been modified so it will do it for you:

FetchContent_Declare(
	llama.cpp
	GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git
	GIT_TAG        b1645
)
FetchContent_MakeAvailable(llama.cpp)

Refer this for how it works.

Below we cover individual steps for Mac and Windows.

Windows

Build. When in doubt refer to build instructions of llama.cpp [1]. Below we are building against OpenCL to get GPU acceleration on an Intel GPU.

set CL_BLAST_CMAKE_PKG="C:/Program Files/CLBlast-1.6.1-windows-x64/lib/cmake/CLBlast"
mkdir build
cd build
cmake .. -DBUILD_SHARED_LIBS=OFF -DLLAMA_CLBLAST=ON -DCMAKE_PREFIX_PATH=%CL_BLAST_CMAKE_PKG% -G "Visual Studio 17 2022" -A x64
cmake --build . --config Release

The commands above will change for Mac. That is the only difference between Windows and Mac. For Mac you will perform a Metal build. Refer section on Mac.

Verify:

  jllama.vcxproj -> C:\Users\siddj\code\java-llama.cpp\src\main\resources\de\kherud\llama\Windows\x86_64\jllama.dll

c:\Users\siddj\code\java-llama.cpp\build>ls ..\src\main\resources\de\kherud\llama\Windows\x86_64
jllama.dll  llama.dll

Mac

As mentioned before, for Mac GPU acceleration is enabled by default so you don’t need to do anything but we cover steps for completeness.

In latest code:

On MacOS, Metal is enabled by default.

debug build:

cmake -DLLAMA_METAL=ON -DCMAKE_BUILD_TYPE=Debug ../..
cmake --build . --config Debug

release build:

cmake -DLLAMA_METAL=ON ../..
cmake --build . --config Release