Skip to content

[bugfix] tokenizers respect padding: true with non-null max_length #1284

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dwisdom0
Copy link

This commit changes the behavior of tokenizers to fix a bug when a user passes both padding: true and max_length: <some number>.

The expected behavior is described in the docs of the Python library here
https://huggingface.co/docs/transformers/v4.51.1/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast.__call__.padding

padding (bool, str or PaddingStrategy, optional, defaults to False) — Activates and controls padding. Accepts the following values:

  • True or 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
    * 'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
    * False or 'do_not_pad' (default): No padding (i.e., can output a batch with sequences of different lengths).

And in the Transformers.js docs here
https://huggingface.co/docs/transformers.js/api/tokenizers#pretrainedtokenizercalltext-options--code-batchencoding-code

[options.padding] boolean | 'max_length' false Whether to pad the input sequences.
[options.max_length] number Maximum length of the returned list and optionally padding length.

Before this commit, passing

{
  padding: true,
  max_length: 512
}

or

{
  padding: 'max_length',
  max_length: 512
}

would both always pad all outputs to 512 tokens, even if the longest encoding in the batch was shorter than 512 tokens. This is the correct behavior for padding: 'max_length', but it's incorrect for padding: true.

After this change,

{
  padding: true,
  max_length: 512
}

will now pad the outputs to match the longest encoding or max_length, whichever is shorter.

This commit also adds a test to prevent regressions.

This commit changes the behavior of tokenizers to match the
behavior described in the docs and the behavior of the Python
library.

Before this commit, passing
{
  padding: true,
  max_length: 512
}
or
{
  padding: 'max_length',
  max_length: 512
}
would both always pad all outputs to 512 tokens.

After this change,
{
  padding: true,
  max_length: 512
}
will now pad the outputs to match the longest encoding
or max_length, whichever is shorter.

This commit  also adds a test to prevent regressions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant