Skip to content

[Qwen2.5-VL] Fix empty string input crash in processor #38421

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

Flink-ddd
Copy link

This PR fixes #38417.
When passing an empty string to the Qwen2.5 tokenizer with return_tensors="pt", the original output was a float32 tensor.
This patch ensures a consistent torch.long dtype by returning torch.empty((1, 0), dtype=torch.long) for empty input.
A test is included to validate the fix.

This is my first contribution to 🤗 Transformers. Happy to help and open to feedback!

@Flink-ddd
Copy link
Author

test(Qwen2Tokenizer): Add regression test for empty string input dtype (#38417)

This commit introduces a regression test within Qwen2_5_VLProcessorTest
to verify the dtype of input_ids returned by Qwen2Tokenizer
when processing an empty string with return_tensors="pt".

The test uses the "Qwen/Qwen2-0.5B" model and asserts that the
output dtype should be torch.long (int64). Currently, this test
is expected to FAIL as it correctly reproduces the behavior
reported in issue #38417, where torch.float32 is returned instead.

This test serves to prevent future regressions and will pass once
the underlying issue in Qwen2Tokenizer is resolved.

@Flink-ddd
Copy link
Author

Hi maintainers,

I've pushed an update to this PR. Here's the current status of the CI checks:

  1. Both the tests_processors and run_tests jobs are marked as failing.
  2. These failures are both due to the same single reason: the newly added regression test method, test_qwen2_tokenizer_empty_string_regression, is working as expected and failing with AssertionError: torch.float32 != torch.int64. This confirms it successfully reproduces the bug described in issue Tokenizer returns float32 tensor for empty string input instead of long dtype #38417 when using the "Qwen/Qwen2-0.5B" model.
  3. All other unrelated CI checks (e.g., code formatting, quality) should now be passing.

This PR, which aims to add this regression test, is exhibiting the expected behavior for its core test (i.e., identifying the existing bug). It should now be ready for review.

Thanks!

@Rocketknight1
Copy link
Member

Thank you for the PR, but unfortunately it's a duplicate of #36555! It's unfortunate - your code looks good and I hope this doesn't discourage you from contributing in future

@Flink-ddd
Copy link
Author

Hi @Rocketknight1, thank you for the quick review and the kind words!

I understand that this PR is a duplicate of #36555. I appreciate the feedback and the opportunity to go through the contribution process. It was a great learning experience for me.

I'll continue exploring other open issues and look forward to contributing more in the future!

Thanks again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tokenizer returns float32 tensor for empty string input instead of long dtype
2 participants