-
Notifications
You must be signed in to change notification settings - Fork 28.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: Update LayoutLMv3 model card with standardized format and impro… #37155
base: main
Are you sure you want to change the base?
docs: Update LayoutLMv3 model card with standardized format and impro… #37155
Conversation
Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is a good start! Please refer to the Gemma 3 docs see how to standardize this doc 🤗
@@ -14,24 +14,150 @@ rendered properly in your Markdown viewer. | |||
|
|||
--> | |||
|
|||
[](https://pytorch.org/get-started/locally/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please style these with the <div>
tags. You can copy it from one of the existing updated model cards on main
like Gemma 3.
# LayoutLMv3 | ||
|
||
## Overview | ||
LayoutLMv3 is a powerful multimodal transformer model designed specifically for Document AI tasks. What makes it unique is its unified approach to handling both text and images in documents, using a simple yet effective architecture that combines patch embeddings with transformer layers. Unlike its predecessor LayoutLMv2, it uses a more streamlined approach with patch embeddings (similar to ViT) instead of a CNN backbone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LayoutLMv3 is a powerful multimodal transformer model designed specifically for Document AI tasks. What makes it unique is its unified approach to handling both text and images in documents, using a simple yet effective architecture that combines patch embeddings with transformer layers. Unlike its predecessor LayoutLMv2, it uses a more streamlined approach with patch embeddings (similar to ViT) instead of a CNN backbone. | |
[LayoutLMv3](https://huggingface.co/papers/2204.08387) is a multimodal transformer model designed specifically for Document AI tasks. It unites the pretraining objective for text and images, masked language and masked image modeling, and also includes a word-patch alignment objective for even stronger text and image alignment. The model architecture is also unified and uses a more streamlined approach with patch embeddings (similar to [ViT](./vit)) instead of a CNN backbone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not fully resolved yet, missing link to the model
<Tip> | ||
Click on the right sidebar for more examples of how to use the model for different tasks! | ||
</Tip> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<Tip> | |
Click on the right sidebar for more examples of how to use the model for different tasks! | |
</Tip> | |
> [!TIP] | |
> Click on the LayoutLMv3 models in the right sidebar for more examples of how to apply LayoutLMv3 to different vision and language tasks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not resolved yet either
outputs = model(**encoding) | ||
``` | ||
|
||
## Using transformers-cli |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove this since transformers-cli
doesn't support image inputs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unresolved
## Quantization | ||
|
||
For large models, you can use quantization to reduce memory usage: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update the code example below accordingly
## Quantization | |
For large models, you can use quantization to reduce memory usage: | |
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](https://huggingface.co/docs/transformers/main/en/quantization/overview) overview for more available quantization backends. | |
The example below uses [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) to only quantize the weights to int4. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not resolved yet. You only need to show quantization for either 8 or 4-bits instead of both. Also the code for quantizing the model is incorrect.
@@ -74,81 +200,66 @@ LayoutLMv3 is nearly identical to LayoutLMv2, so we've also included LayoutLMv2 | |||
**Document question answering** | |||
- [Document question answering task guide](../tasks/document_question_answering) | |||
|
|||
## LayoutLMv3Config | |||
## Model Details |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rest of these changes should be reverted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not resolved yet, the ## Model Details
is still there as are the changes to the header levels of the LayoutLMv3 classes
819c757
to
5b92ea6
Compare
9368ed6
to
b0aeeec
Compare
b15eb3d
to
294e6e9
Compare
@stevhliu , have a look on this |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Hey, it is still a bit off! For example:
I suggest taking a look at the Gemma 3 model card again and trying to align your model card with it as much as possible! |
7836f29
to
61f22d5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a lot of unresolved changes, so please don't mark them as resolved 😅
# LayoutLMv3 | ||
|
||
## Overview | ||
LayoutLMv3 is a powerful multimodal transformer model designed specifically for Document AI tasks. What makes it unique is its unified approach to handling both text and images in documents, using a simple yet effective architecture that combines patch embeddings with transformer layers. Unlike its predecessor LayoutLMv2, it uses a more streamlined approach with patch embeddings (similar to ViT) instead of a CNN backbone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not fully resolved yet, missing link to the model
|
||
*Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis.* | ||
[Paper](https://arxiv.org/abs/2204.08387) | [Official Checkpoints](https://huggingface.co/microsoft/layoutlmv3-base) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Paper](https://arxiv.org/abs/2204.08387) | [Official Checkpoints](https://huggingface.co/microsoft/layoutlmv3-base) | |
You can find all the original LayoutLMv3 checkpoints under the [LayoutLM](https://huggingface.co/collections/microsoft/layoutlm-6564539601de72cb631d0902) collection. |
<Tip> | ||
Click on the right sidebar for more examples of how to use the model for different tasks! | ||
</Tip> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not resolved yet either
@@ -74,81 +200,66 @@ LayoutLMv3 is nearly identical to LayoutLMv2, so we've also included LayoutLMv2 | |||
**Document question answering** | |||
- [Document question answering task guide](../tasks/document_question_answering) | |||
|
|||
## LayoutLMv3Config | |||
## Model Details |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not resolved yet, the ## Model Details
is still there as are the changes to the header levels of the LayoutLMv3 classes
## Quantization | ||
|
||
For large models, you can use quantization to reduce memory usage: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not resolved yet. You only need to show quantization for either 8 or 4-bits instead of both. Also the code for quantizing the model is incorrect.
outputs = model(**encoding) | ||
``` | ||
|
||
## Using transformers-cli |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unresolved
result = token_classifier("form.jpg") | ||
|
||
# For question answering | ||
qa = pipeline("document-question-answering", model="microsoft/layoutlmv3-base") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unresolved
## Using the Pipeline | ||
|
||
The easiest way to use LayoutLMv3 is through the pipeline API: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unresolved as there are still other examples here besides question answering
|
||
## Quick Start | ||
|
||
Here's a quick example of how to use LayoutLMv3 for document understanding: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unresolved
We'll need to update the badges to include FlashAttention and the code examples to include SDPA once #35469 is merged! |
Update LayoutLMv3 Model Card Documentation
This PR updates the LayoutLMv3 model card documentation to follow the standardized format as requested in #36979. The changes improve the documentation's clarity and usability while maintaining consistency with other model cards in the repository.
What does this PR do?
This PR enhances the LayoutLMv3 model card documentation by:
The changes make the documentation more accessible and provide ready-to-use examples for different use cases, following the standardized format used in other model cards like Gemma 3, PaliGemma, and ViT.
Fixes #36979
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
Since this is a documentation update for a vision-language model, I would suggest tagging: