[bugfix] tokenizers respect padding: true with non-null max_length #1284

dwisdom0 · 2025-04-14T03:20:54Z

This commit changes the behavior of tokenizers to fix a bug when a user passes both padding: true and max_length: <some number>.

The expected behavior is described in the docs of the Python library here
https://huggingface.co/docs/transformers/v4.51.1/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast.__call__.padding

padding (bool, str or PaddingStrategy, optional, defaults to False) — Activates and controls padding. Accepts the following values:

True or 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
* 'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
* False or 'do_not_pad' (default): No padding (i.e., can output a batch with sequences of different lengths).

And in the Transformers.js docs here
https://huggingface.co/docs/transformers.js/api/tokenizers#pretrainedtokenizercalltext-options--code-batchencoding-code

[options.padding] boolean | 'max_length' false Whether to pad the input sequences.
[options.max_length] number Maximum length of the returned list and optionally padding length.

Before this commit, passing

{
  padding: true,
  max_length: 512
}

or

{
  padding: 'max_length',
  max_length: 512
}

would both always pad all outputs to 512 tokens, even if the longest encoding in the batch was shorter than 512 tokens. This is the correct behavior for padding: 'max_length', but it's incorrect for padding: true.

After this change,

{
  padding: true,
  max_length: 512
}

will now pad the outputs to match the longest encoding or max_length, whichever is shorter.

This commit also adds a test to prevent regressions.

This commit changes the behavior of tokenizers to match the behavior described in the docs and the behavior of the Python library. Before this commit, passing { padding: true, max_length: 512 } or { padding: 'max_length', max_length: 512 } would both always pad all outputs to 512 tokens. After this change, { padding: true, max_length: 512 } will now pad the outputs to match the longest encoding or max_length, whichever is shorter. This commit also adds a test to prevent regressions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bugfix] tokenizers respect padding: true with non-null max_length #1284

[bugfix] tokenizers respect padding: true with non-null max_length #1284

dwisdom0 commented Apr 14, 2025

[bugfix] tokenizers respect padding: true with non-null max_length #1284

Are you sure you want to change the base?

[bugfix] tokenizers respect padding: true with non-null max_length #1284

Conversation

dwisdom0 commented Apr 14, 2025