Setting up Deep Learning VM in GCP – Part 3

Earlier I had setup a deep learning VM in GCP from scratch but it gave me a few issues:

  • Not able to run vicuna with NVIDIA P100 GPU
  • I installed CUDA 12.1 (latest at time of the writing) which was incompatible with GPTQ-for-Llama / AutoGPTQ. The max CUDA version supported by GPTQ-for-LLama / AutoGPTQ is 11.8 (at time of writing)

So I tried installing another VM. Making notes to help me out next time. This time I noted down the shell command which is like this:

#!/bin/bash
#https://cloud.google.com/compute/docs/gpus#a100-gpus
gcloud compute instances create deep-learning-a100 \
    --project=xxx \
    --zone=us-central1-f \
    --machine-type=a2-highgpu-1g \
    --network-interface=stack-type=IPV4_ONLY,subnet=subnet-a,no-address \
    --maintenance-policy=TERMINATE \
    --provisioning-model=STANDARD \
    --service-account=xxx-compute@developer.gserviceaccount.com \
    --scopes=https://www.googleapis.com/auth/cloud-platform \
    --accelerator=count=1,type=nvidia-tesla-a100 \
    --create-disk=auto-delete=yes,boot=yes,device-name=deep-leearning-a100-boot-disk,image=projects/ml-images/global/images/c0-deeplearning-common-cu113-v20230615-debian-11-py310,mode=rw,size=100,type=projects/xxx/zones/us-central1-f/diskTypes/pd-balanced \
    --create-disk=device-name=deep-leearning-a100-data-disk,kms-key=projects/xxx/locations/us-central1/keyRings/compute_keyring_uscentral_1/cryptoKeys/compute_cmek_symmetric_hsm_uscentral-1,mode=rw,name=deep-leearning-a100-data-disk,size=1024,type=projects/xxx/zones/us-central1-f/diskTypes/pd-balanced \
    --no-shielded-secure-boot \
    --shielded-vtpm \
    --shielded-integrity-monitoring \
    --labels=goog-ec-src=vm_add-gcloud \
    --reservation-affinity=any

What it does: provision a a2-highgpu-1g VM that comes with NVIDIA A100 40GB GPU and 85GB RAM. The VM is provisioned in us-central1-f and I also attach a data disk of size 1024GB. The data disk is encrypted with a customer managed encryption key that I created separately. I do not assign any public IP to the VM and use an internal IP. The OS image used to bootstrap the VM is c0-deeplearning-common-cu113-v20230615-debian-11-py310. This gave me Debian 11 with Python 10, git, virtualenv and CUDA 11 pre-installed. I have also given https://www.googleapis.com/auth/cloud-platform scope to the VM which allows applications running on it to access any Google Cloud API.

When I ssh-ed to the VM for the first time I saw this message which was misleading:

I did not have to install any drivers. The drivers were already pre-installed in /usr/local/cuda. The nvcc compiler is installed in /usr/local/cuda/bin/nvcc and the nvidia-smi program is installed in /usr/bin/nvidia-smi.

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0
$ nvidia-smi
Fri Jun 23 17:14:49 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   43C    P0    58W / 400W |      0MiB / 40960MiB |     27%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Even with no process consuming GPU I see 27% usage which is a bummer. This did not happen with P100 GPU. Is it because Jupyter is running behind the scenes? I think the error message:

This VM requires Nvidia drivers to function correctly.   Installation takes ~1 minute.
Would you like to install the Nvidia driver? [y/n] n

happens because even though the drivers are installed, they are not there in the PATH. To add them to the PATH, I edited ~/.profile like so:

export TRANSFORMERS_CACHE=/app/.cache
export HF_DATASETS_CACHE=/app/.cache

# https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#environment-setup
export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

In above I am also overriding the default locations which hugging face and transformers will store their assets. This brings me to /app. The data disk I created gets mounted at /home/jupyter automatically. I saw this when I ran:

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             42G     0   42G   0% /dev
tmpfs           8.4G  448K  8.4G   1% /run
/dev/sda1        99G   39G   56G  41% /
tmpfs            42G     0   42G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sda15      124M   11M  114M   9% /boot/efi
/dev/sdb       1007G   18M 1007G   1% /home/jupyter
tmpfs           8.4G     0  8.4G   0% /run/user/1001

so I created a symlink to /home/jupyter:

$ sudo ln -s /home/jupyter /app
$ sudo chown me /app

I have to say this was a much smoother experience for me. After this, I was able to do deep learning programming as usual. Lessons learnt:

  • Do NOT install a version of CUDA that is more recent than what comes with PyTorch. You can see the latest CUDA supported by PyTorch here.
  • The OS image c0-deeplearning-common-cu113-v20230615-debian-11-py310 took a lot of the pain out compared to previous time. This gave me Debian 11 with Python 10, git, virtualenv and CUDA 11 pre-installed. Only thing you need to do is add paths to CUDA in your ~/.profile.

The only suboptimal thing was the additional disk gets automatically mounted at this weird location /home/jupyter. I guess I can live with that.

This entry was posted in Computers, programming, Software and tagged , , . Bookmark the permalink.

Leave a comment