In an earlier article we covered internals of llama.cpp. What if you want to use this library but from a Java application? Luckily you can thanks to java-llama.cpp. In this article we cover some of the internals of java-llama.cpp so you understand how it works.
You can use java-llama.cpp by adding following to your pom.xml:
<dependency>
<groupId>de.kherud</groupId>
<artifactId>llama</artifactId>
<version>2.2.1</version>
</dependency>
How java-llama.cpp works
If you unpack the jar file that comes with the Maven artifact, you will see following files [1]:
> tar -xvf ~/.m2/repository/de/kherud/llama/2.2.1/llama-2.2.1.jar
x META-INF/
x META-INF/MANIFEST.MF
x de/
x de/kherud/
x de/kherud/llama/
x de/kherud/llama/Windows/
x de/kherud/llama/Windows/x86_64/
x de/kherud/llama/Linux/
x de/kherud/llama/Linux/aarch64/
x de/kherud/llama/Linux/x86_64/
x de/kherud/llama/Mac/
x de/kherud/llama/Mac/aarch64/
x de/kherud/llama/Mac/x86_64/
x de/kherud/llama/LlamaModel$Output.class
x de/kherud/llama/OSInfo.class
x de/kherud/llama/Windows/x86_64/jllama.dll
x de/kherud/llama/Windows/x86_64/llama.dll
x de/kherud/llama/LlamaModel$1.class
x de/kherud/llama/LlamaModel$LlamaIterator.class
x de/kherud/llama/InferenceParameters$1.class
x de/kherud/llama/InferenceParameters$MiroStat.class
x de/kherud/llama/LogLevel.class
x de/kherud/llama/ProcessRunner.class
x de/kherud/llama/ModelParameters$Builder.class
x de/kherud/llama/Linux/aarch64/libllama.so
x de/kherud/llama/Linux/x86_64/libllama.so
x de/kherud/llama/Linux/x86_64/libjllama.so
x de/kherud/llama/InferenceParameters.class
x de/kherud/llama/LlamaLoader.class
x de/kherud/llama/LlamaModel.class
x de/kherud/llama/InferenceParameters$Builder.class
x de/kherud/llama/ModelParameters$1.class
x de/kherud/llama/ModelParameters.class
x de/kherud/llama/LlamaException.class
x de/kherud/llama/Mac/aarch64/libllama.dylib
x de/kherud/llama/Mac/aarch64/libjllama.dylib
x de/kherud/llama/Mac/aarch64/ggml-metal.metal
x de/kherud/llama/Mac/x86_64/libllama.dylib
x de/kherud/llama/Mac/x86_64/libjllama.dylib
x de/kherud/llama/Mac/x86_64/ggml-metal.metal
x META-INF/maven/
x META-INF/maven/de.kherud/
x META-INF/maven/de.kherud/llama/
x META-INF/maven/de.kherud/llama/pom.xml
x META-INF/maven/de.kherud/llama/pom.properties
We can see there are two key pre-built dlls (I will use the term dll which stands for dynamic link library irrespective of the platform – Windows, Mac or Linux) which ship with the library. E.g., for MacOS they are:
libllama.dylib: This dll is built by compilingllama.cppsource code. The source code has to be compiled differently for different platforms, hence you have separate dlls for each platform. This is the dll which does the heavy lifting.libjllama.dylib: This dll is built by compiling the C++ code insrc/main/cpp.
The purpose of the code in src/main/cpp is to provide a JNI wrapper or shim
using which C++ code in llama.cpp (in libllama.dylib to be more accurate) can be called from Java. And so this is how it works.
How are the dlls built?
Both dlls are built by this file and you have to run cmake to build the dlls as we are compiling C++
code not Java. The dlls are saved under
${CMAKE_SOURCE_DIR}/src/main/resources/de/kherud/llama/${OS_NAME}/${OS_ARCH}.
Most of the code in this file has to do with building
src/main/cpp. Where is the code that is building llama.cpp?
It is this line:
include(build-args.cmake)
which ends up including this file.
Here you can verify that by default
set(LLAMA_METAL_DEFAULT ON)
so the pre-built dll for Mac has GPU acceleration built into it but not for windows.
Note that when you are using java-llama.cpp you are using a version of llama.cpp built using a CMake file that is different from the original.
The original CMake file can be found here. You can compare the two for differences
and this will come in handy when debugging any issues related to differences between behavior of official llama.cpp vs. java-llama.cpp.
How are the dlls loaded?
The dlls are loaded at runtime by this code:
static synchronized void initialize() throws UnsatisfiedLinkError {
There are native methods declared here:
private native void loadModel(String filePath, ModelParameters parameters) throws LlamaException;
the native keyword is used to declare a method that is implemented in platform-dependent, non-Java code, typically written in another programming language such as C or C++.
Minimal example here.
Getting GPU acceleration
By default this library will not provide GPU acceleration [1]:
We support CPU inference for the following platforms out of the box
If none of the above listed platforms matches yours, currently you have to compile the library yourself (also if you want GPU acceleration, see below)
To get GPU acceleration, you have to build two custom dlls:
- Linux:
libllama.so,libjllama.so - MacOS:
libllama.dylib,libjllama.dylib,ggml-metal.metal - Windows:
llama.dll,jllama.dll
[lib]llama.dll|dylib|so is the dll corresponding to llama.cpp which does the actual heavy work.
[lib]jllama.dll|dylib|so is a wrapper that provides interop between Java and C++ code
and set de.kherud.llama.lib.path system property to where system can find these dlls when you run your Java application. for example -Dde.kherud.llama.lib.path=/directory/containing/lib.
Not only that, the system should be able to load OpenCL and CLBlast dlls at runtime (we use OpenCL and CLBlast on Windows). So paths to those dlls need to be added to the PATH environment variable [1].
Common Steps for Mac and Windows
- clone
java-llama.cpp - checkout a tag so you are compiling against a well-known release
git checkout v2.2.1
- download
llama.cppsource code
git submodule update --init --recursive
In later versions of java-llama.cpp (e.g., 2.3.4) you do not have to run above command. Instead the CMake file has been modified so it will do it for you:
FetchContent_Declare(
llama.cpp
GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git
GIT_TAG b1645
)
FetchContent_MakeAvailable(llama.cpp)
Refer this for how it works.
Below we cover individual steps for Mac and Windows.
Windows
- Build. When in doubt refer to build instructions of
llama.cpp[1]. Below we are building against OpenCL to get GPU acceleration on an Intel GPU.
set CL_BLAST_CMAKE_PKG="C:/Program Files/CLBlast-1.6.1-windows-x64/lib/cmake/CLBlast"
mkdir build
cd build
cmake .. -DBUILD_SHARED_LIBS=OFF -DLLAMA_CLBLAST=ON -DCMAKE_PREFIX_PATH=%CL_BLAST_CMAKE_PKG% -G "Visual Studio 17 2022" -A x64
cmake --build . --config Release
The commands above will change for Mac. That is the only difference between Windows and Mac. For Mac you will perform a Metal build. Refer section on Mac.
Verify:
jllama.vcxproj -> C:\Users\siddj\code\java-llama.cpp\src\main\resources\de\kherud\llama\Windows\x86_64\jllama.dll
c:\Users\siddj\code\java-llama.cpp\build>ls ..\src\main\resources\de\kherud\llama\Windows\x86_64
jllama.dll llama.dll
Mac
As mentioned before, for Mac GPU acceleration is enabled by default so you don’t need to do anything but we cover steps for completeness.
In latest code:
On MacOS, Metal is enabled by default.
debug build:
cmake -DLLAMA_METAL=ON -DCMAKE_BUILD_TYPE=Debug ../..
cmake --build . --config Debug
release build:
cmake -DLLAMA_METAL=ON ../..
cmake --build . --config Release