Skip to content

Performance and Behavioral Differences Between Versions of model.safetensors.index.json #98

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mingkai-zheng opened this issue Jan 22, 2025 · 3 comments

Comments

@mingkai-zheng
Copy link

Hello,

I recently noticed that the model.safetensors.index.json file was updated to include two additional keys: vision_tower.post_layernorm.bias and vision_tower.post_layernorm.weight. I also noticed that the model conducted with the current GitHub codebase does not have the vision_tower.post_layernorm layer.

After testing the image cat.png using the demo example (with do_sample=False and no temperature applied), I observed slight differences in the output based on two versions.

Could you clarify:

  1. Are these differences expected?
  2. Do the changes introduce any performance or behavioral implications?
  3. Is there anything I might have overlooked regarding this update?

Thank you for your assistance!

@xffxff
Copy link
Collaborator

xffxff commented Jan 22, 2025

@mingkai-zheng Hi

Great observation! You're absolutely right -- the vision tower model arch used in Aria doesn't actually include the post_layernorm layer. The changes to the ckpt weights, including adding "vision_tower.post_layernorm", were introduced by the transformers team, as they integrated Aria into their repo.

I didn't check if the two implementations (ours and transformers) are 100% identical, but I don't think the differences you're seeing are caused by the added post_layernorm weights. From what I understand,transformers add the weights of post_layernorm just because they can reuse the existing Idefics3VisionTransformer without creating a new one that removes the post_layernorm. The transformers impl doesn't actually use the post_layernorm layer. They set output_hidden_states=True, and fetch the hidden state at vision_feature_layer, which is the output right before the post_layernorm, so even though the weights exist, the layer itself doesn't affect the final inference outputs
Image

https://github.com/huggingface/transformers/blob/f4f33a20a23aa90f3510280e34592b2784d48ebe/src/transformers/models/aria/modeling_aria.py#L1417-L1425

@mingkai-zheng
Copy link
Author

mingkai-zheng commented Jan 22, 2025

Hi @xffxff

Thank you so much for providing the information! Actually, both versions of the model give quite reasonable responses for the provided cat.png, I’m just a bit confused about where the discrepancies comes in and will investigate further on my end.

Currently, I’m trying to reproduce the performance of Aria on the MMMU (val) benchmark using both versions, based on the lmms-eval codebase. In my experiments, both versions achieve quite similar results, with performance around 45 (input resolution fixed to 980x980). However, this is significantly lower than the 54.9 reported in Table 1 of your paper.

Could you clarify how your team evaluates performance on MMMU? Specifically:

  1. Are you using a different codebase other than lmms-eval?
  2. Are there variations in the prompts or evaluation setup that might explain the discrepancy?

BTW, I believe this question might also be related to the other issue I created yesterday (#90).

Thank you so much for your help !!!

@xffxff
Copy link
Collaborator

xffxff commented Jan 22, 2025

@mingkai-zheng

Thanks for your question!

I’m not directly involved in the training and evaluation of the Aria model, so I don’t have all the details about MMLU evaluation. What I can share is that we use an internal evaluation framework, which might differ from lmms-eval.

@LiJunnan1992 @dxli94 @teowu may know more details

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants