Understanding `transformers` Internals

transformers is the goto library for LLM development. In this blog post I try to capture some tips and tricks about transformers in no particular order. The focus is on trying to understand the internals of the library – what happens underneath when you call a function and how you can debug it etc. This is based on my belief that:

All coding is debugging

Understanding Tokenizers

Broadly speaking, a tokenizer serves two purposes:

defines a vocabulary
defines a method to take a string as input and tokenize it into tokens from the vocabulary (and vice-versa)

Takeaway: When fine-tuning a pre-trained model, you must use the pre-trained model’s tokenizer otherwise you will get non-sense results.

>>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
>>> tokenizer
GPT2TokenizerFast(name_or_path='gpt2', vocab_size=50257, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}, clean_up_tokenization_spaces=True)

we can see the context size of gpt2 is 1024 tokens.

bos = beginning of sequence eos = end of sequence. a special token that tells the model to stop generating text.

Fast tokenizers are implemented in Rust.

`lmsys/vicuna-13b-v1.3` tokenizer

>>> tokenizer
LlamaTokenizerFast(name_or_path='lmsys/vicuna-13b-v1.3', vocab_size=32000, model_max_length=2048, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'pad_token': '<unk>'}, clean_up_tokenization_spaces=False)

we can see vicuna has a context length of 2048 tokens.

Understand difference between `tokenize` and `call` functions

>>> tokenizer.tokenize(tokenizer.pad_token)
['<unk>']
>>> tokenizer(tokenizer.pad_token)
{'input_ids': [1, 0], 'attention_mask': [1, 1]}

Why did it return 2 input_ids? Its because the vicuna tokenizer by default adds a bos token (input_id=1) to every argument. verify:

>>> tokenizer.convert_ids_to_tokens([1, 0])
['<s>', '<unk>']
>>> tokenizer.pad_token
'<unk>'
>>> tokenizer.bos_token
'<s>'

How to wrap the tokens so they don’t overflow the model’s context length?

If you try to tokenize a very long piece of text, it may overflow the model’s context window. You will also see a warning. For purposes of causal language modelling, we would like to simply wrap around the overflowing tokens into a new line. It can be done using following function call:

tokenizer_output = tokenizer(rows, max_length=tokenizer.model_max_length, truncation=True, stride=4, return_overflowing_tokens=True)

Try it out. rows are the examples from original dataset. This will wrap around overflowing rows into new lines. The stride dictates number of overlapping tokens when a row is wrapped around.

Understanding Datasets

Use the Dataset.from_dict method to create a dataset from your own data that is not published on hugging face hub. You will frequently encounter the dataset.map method. It takes as input a function that is applied to tokenize the data. The function can do anything – its not necessary for it to do tokenization. When things aren’t working the way you expect, stick a breakpoint on this line /datasets/arrow_dataset.py:3344:

processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)

This is the point where the function is applied and will help to debug any issues. See Issue 5997 for an issue/bug with datasets when you wrap around overflowing tokens.

Understanding DataCollators

The purpose of a data collator is to turn jagged arrays of token ids to arrays of uniform length so they can be passed to the trainer.

The meat of a data collator can be found here:

def torch_call(self, examples: List[Union[List[int], Any, Dict[str, Any]]]) -> Dict[str, Any]:
    # Handle dict or lists with proper padding and conversion to tensor.
    if isinstance(examples[0], Mapping):
        batch = self.tokenizer.pad(examples, return_tensors="pt", pad_to_multiple_of=self.pad_to_multiple_of)
    else:
        batch = {
            "input_ids": _torch_collate_batch(examples, self.tokenizer, pad_to_multiple_of=self.pad_to_multiple_of)
        }

    # If special token mask has been preprocessed, pop it from the dict.
    special_tokens_mask = batch.pop("special_tokens_mask", None)
    if self.mlm:
        batch["input_ids"], batch["labels"] = self.torch_mask_tokens(
            batch["input_ids"], special_tokens_mask=special_tokens_mask
        )
    else:
        labels = batch["input_ids"].clone()
        if self.tokenizer.pad_token_id is not None:
            labels[labels == self.tokenizer.pad_token_id] = -100 ## NOTE
        batch["labels"] = labels
    return batch

Stick a breakpoint in this function to debug. This function is basically doing 2 things:

convert jagged arrays of token ids to arrays of uniform length. This is done by the call to pad. Make sure tokenizer.pad_token is defined before this function is called.
The labels with pad token ids are replaced with -100. Why? Refer transformers book p. 161 or elsewhere. The function that computes the training loss will ignore labels marked as -100.

For purposes of causal language modelling, you will set mlm =False. mlm stands for masked language modelling. If you set mlm=True, the data collator will randomly mask a few labels according to a probability set by the user. The model has to then predict or fill-in the masked labels.

Here is a notebook showing the data collator in action. Notice how it turns the jagged arrays of input_ids into arrays of uniform length.

Where are we defining the expected outputs?

This is all well and good but where are the expected outputs defined? The labels define input to the model. They contain token ids of the tokenized text. But where are we defining the expected outputs? Turns out we don’t do that explicitly. By convention causal language models do this implicitly in their forward pass. See e.g., here:

# Shift so that tokens < n predict n
shift_logits = logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous() # expected outputs of the causal LM

You can pick any causal LM you like under models. You will see code similar to above in the model’s forward function. To calculate the training loss, we calculate the cross entropy between model predictions and the expected outputs. See this:

loss = loss_fct(shift_logits, shift_labels)

The logits variable contains un-normalized probabilities of the model’s predictions. This will be a vector of length = vocab size and f(i) = probability of i-th token appearing next in the output. A maximum likelihood estimator will thus select the token with the largest probability (argmax of f). The shift_labels contains the ground truth (scalar value). Cross-entropy function take the logits vector and calculates its cross-entropy w.r.t. the ground truth. rank(shift_labels) = rank(shift_logits) - 1 since shift_logits contains logits for every token whereas shift_labels has unambiguous ground-truth (scalar).

The `DataLoader`

The Trainer class uses the training dataset and data collator to create a DataLoader as seen here:

def get_train_dataloader(self) -> DataLoader:
    """
    Returns the training [`~torch.utils.data.DataLoader`].

    Will use no sampler if `train_dataset` does not implement `__len__`, a random sampler (adapted to distributed
    training if necessary) otherwise.

    Subclass and override this method if you want to inject some custom behavior.
    """
    ...

    return self.accelerator.prepare(DataLoader(train_dataset, **dataloader_params))

What is the purpose of the DataLoader? The training loop uses this DataLoader to get the next batch of samples for each iteration of the loop. This is happening in following lines of _inner_training_loop:

for epoch in range(epochs_trained, num_train_epochs):
    epoch_iterator = train_dataloader
...
    for step, inputs in enumerate(epoch_iterator):

Understanding `TrainingArguments`

TrainingArguments class is defined here. It takes input a large number of arguments. Let us explain some of them.

`per_device_train_batch_size`

Used to tell how many examples (data points) to select at a time from the training dataset for gradient optimization (model fitting).

`gradient_accumulation_steps`

A trick to feed large batches to the model while keeping per_device_train_batch_size small. The computer will quickly run into out of memory as you increase per_device_train_batch_size. Feeding very small batches to the model will lead to poor training and optimization. The effective batch size = per_device_train_batch_size * gradient_accumulation_steps. You should first keep gradient_accumulation_steps = 1 and increase per_device_train_batch_size to the point where you get OOM. After that you should increase gradient_accumulation_steps as necessary.

`max_steps` vs. `num_train_epochs`

We need to tell the trainer when it can stop. We can do it in 2 ways: either set max_steps (default = -1) or num_train_epochs. a step = an update of model weights. usually this includes 1 forward and backward pass but if you set gradient_accumulation_steps > 0 the update happens after that many passes. an epoch = as many steps as required such that the entire dataset has been fed to the optimizer. If user sets both, the library will use max_steps to decide (the two are mutually exclusive) when to stop the training. num_train_epochs will be reset accordingly (see tip later in the section).

number of steps = (per_device_train_batch_size * gradient_accumulation_steps) / (number of examples in training dataset) * number of epochs

Why don’t we have a criterion like stop the training when the incremental (i.e., the successive delta) loss between each iteration becomes lower than a specified number like 0.01%? That to me is the proper way to identify when the optimization has converged and is what is taught in textbooks rather than forcing the optimizer to perform an absolute number of iterations.

Tip: Below is the code that overrides num_train_epochs when max_steps > 0 [1]:

if args.max_steps > 0:
    max_steps = args.max_steps
    num_train_epochs = args.max_steps // num_update_steps_per_epoch + int(
        args.max_steps % num_update_steps_per_epoch > 0
    )

`gradient_checkpointing`

again, its a technique to cope up with OOM exceptions. It reduces memory footprint at expense of increased computation. It does not affect the output of the optimizer.

`no_cuda`

Set no_cuda=True if you do not want to use GPU even if its available.

QLoRa

QLoRa is not some separate library. QLoRa adds quantization on top of LoRa. In transformers ecosystem, the quantization is provided by the bitsandbytes library (so you could think of bitsandbytes as the QLoRa library if you want). bitsandbytes requires GPU (specifically CUDA). so it won’t run on Mac. You don’t have to make explicit calls to bitsandbytes in your code. The calls happen implicitly from transformers modules once you load a model with a BitsAndBytesConfig (which is defined in transformers not bitsandbytes).

Using `accelerator` library

You don’t have to make explicit calls to accelerator. The HF Trainer will do it for you. That’s why we use it instead of PyTorch’s Trainer.

Where is the code that returns what optimizer class to instantiate?

Here:

def get_optimizer_cls_and_kwargs(args: TrainingArguments) -> Tuple[Any, Any]:
"""
Returns the optimizer class and optimizer parameters based on the training arguments.

Args:
    args (`transformers.training_args.TrainingArguments`):
        The training arguments for the training session.

"""

All the optimizer names are defined here:

class OptimizerNames(ExplicitEnum):
    """
    Stores the acceptable string identifiers for optimizers.
    """

    ADAMW_HF = "adamw_hf"
    ADAMW_TORCH = "adamw_torch"
    ADAMW_TORCH_FUSED = "adamw_torch_fused"
    ADAMW_TORCH_XLA = "adamw_torch_xla"
    ADAMW_APEX_FUSED = "adamw_apex_fused"
    ADAFACTOR = "adafactor"
    ADAMW_ANYPRECISION = "adamw_anyprecision"
    SGD = "sgd"
    ADAGRAD = "adagrad"
    ADAMW_BNB = "adamw_bnb_8bit"
    ADAMW_8BIT = "adamw_8bit"  # just an alias for adamw_bnb_8bit
    LION_8BIT = "lion_8bit"
    LION = "lion_32bit"
    PAGED_ADAMW = "paged_adamw_32bit"
    PAGED_ADAMW_8BIT = "paged_adamw_8bit"
    PAGED_LION = "paged_lion_32bit"
    PAGED_LION_8BIT = "paged_lion_8bit"

`target_modules`

To use LoRa with transformers, we need to specify the target_modules in LoraConfig. You could leave it empty in which case the library tries to automatically populate them for you based on a pre-defined config. This config can be found here:

TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING = {
    "t5": ["q", "v"],
    "mt5": ["q", "v"],
    "bart": ["q_proj", "v_proj"],
    "gpt2": ["c_attn"],
    "bloom": ["query_key_value"],
    "blip-2": ["q", "v", "q_proj", "v_proj"],
    "opt": ["q_proj", "v_proj"],
    "gptj": ["q_proj", "v_proj"],
    "gpt_neox": ["query_key_value"],
    "gpt_neo": ["q_proj", "v_proj"],
    "bert": ["query", "value"],
    "roberta": ["query", "value"],
    "xlm-roberta": ["query", "value"],
    "electra": ["query", "value"],
    "deberta-v2": ["query_proj", "value_proj"],
    "deberta": ["in_proj"],
    "layoutlm": ["query", "value"],
    "llama": ["q_proj", "v_proj"],
    "chatglm": ["query_key_value"],
    "gpt_bigcode": ["c_attn"],
    "mpt": ["Wqkv"],
}