[bugfix] tokenizers respect padding: true with non-null max_length #1284
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This commit changes the behavior of tokenizers to fix a bug when a user passes both
padding: true
andmax_length: <some number>
.The expected behavior is described in the docs of the Python library here
https://huggingface.co/docs/transformers/v4.51.1/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast.__call__.padding
And in the Transformers.js docs here
https://huggingface.co/docs/transformers.js/api/tokenizers#pretrainedtokenizercalltext-options--code-batchencoding-code
Before this commit, passing
or
would both always pad all outputs to 512 tokens, even if the longest encoding in the batch was shorter than 512 tokens. This is the correct behavior for
padding: 'max_length'
, but it's incorrect forpadding: true
.After this change,
will now pad the outputs to match the longest encoding or
max_length
, whichever is shorter.This commit also adds a test to prevent regressions.