Image generated with ChatGPT Fine-tuning LLMs to your own tasks is expensive, and if you do it naively, a big chunk of that compute is simply burned on padding. LLMs train on batches of token sequences, and every sequence in a batch must have the same length. For instance, if you have a batch with four sequences of length 4, 12, 13, and 5,000 tokens, the short ones are padded with dummy tokens up to 5,000.