docs: Update LayoutLMv3 model card with standardized format and impro… #37155

carrycooldude · 2025-03-31T18:33:29Z

Update LayoutLMv3 Model Card Documentation

This PR updates the LayoutLMv3 model card documentation to follow the standardized format as requested in #36979. The changes improve the documentation's clarity and usability while maintaining consistency with other model cards in the repository.

What does this PR do?

This PR enhances the LayoutLMv3 model card documentation by:

Adding badges for framework support (PyTorch, TensorFlow, Flax) and optimizations (Flash Attention, SDPA)
Reorganizing code examples into clear sections:
- Quick Start (basic usage)
- Pipeline API examples
- AutoModel examples
- transformers-cli examples
Adding quantization examples for large models (8-bit and 4-bit)
Adding attention visualization examples using AttentionMaskVisualizer
Maintaining existing functionality while improving documentation structure

The changes make the documentation more accessible and provide ready-to-use examples for different use cases, following the standardized format used in other model cards like Gemma 3, PaliGemma, and ViT.

Fixes #36979

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Since this is a documentation update for a vision-language model, I would suggest tagging:

@amyeroberts (vision models)
@stevhliu (documentation)

…ved examples

github-actions · 2025-03-31T18:33:40Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

stevhliu

Thanks, this is a good start! Please refer to the Gemma 3 docs see how to standardize this doc 🤗

stevhliu · 2025-04-01T17:17:31Z

docs/source/en/model_doc/layoutlmv3.md

@@ -14,24 +14,150 @@ rendered properly in your Markdown viewer.

 -->

+[![PyTorch](https://img.shields.io/badge/PyTorch-1.12+-blue.svg)](https://pytorch.org/get-started/locally/)


Please style these with the <div> tags. You can copy it from one of the existing updated model cards on main like Gemma 3.

stevhliu · 2025-04-01T17:30:32Z

docs/source/en/model_doc/layoutlmv3.md

 # LayoutLMv3

-## Overview
+LayoutLMv3 is a powerful multimodal transformer model designed specifically for Document AI tasks. What makes it unique is its unified approach to handling both text and images in documents, using a simple yet effective architecture that combines patch embeddings with transformer layers. Unlike its predecessor LayoutLMv2, it uses a more streamlined approach with patch embeddings (similar to ViT) instead of a CNN backbone.


Suggested change

LayoutLMv3 is a powerful multimodal transformer model designed specifically for Document AI tasks. What makes it unique is its unified approach to handling both text and images in documents, using a simple yet effective architecture that combines patch embeddings with transformer layers. Unlike its predecessor LayoutLMv2, it uses a more streamlined approach with patch embeddings (similar to ViT) instead of a CNN backbone.

[LayoutLMv3](https://huggingface.co/papers/2204.08387) is a multimodal transformer model designed specifically for Document AI tasks. It unites the pretraining objective for text and images, masked language and masked image modeling, and also includes a word-patch alignment objective for even stronger text and image alignment. The model architecture is also unified and uses a more streamlined approach with patch embeddings (similar to [ViT](./vit)) instead of a CNN backbone.

Not fully resolved yet, missing link to the model

docs/source/en/model_doc/layoutlmv3.md

stevhliu · 2025-04-01T17:33:13Z

docs/source/en/model_doc/layoutlmv3.md

+<Tip>
+Click on the right sidebar for more examples of how to use the model for different tasks!
+</Tip>


Suggested change

<Tip>

Click on the right sidebar for more examples of how to use the model for different tasks!

</Tip>

> [!TIP]

> Click on the LayoutLMv3 models in the right sidebar for more examples of how to apply LayoutLMv3 to different vision and language tasks.

Not resolved yet either

docs/source/en/model_doc/layoutlmv3.md

stevhliu · 2025-04-01T17:40:15Z

docs/source/en/model_doc/layoutlmv3.md

+outputs = model(**encoding)
+```
+
+## Using transformers-cli


We can remove this since transformers-cli doesn't support image inputs

stevhliu · 2025-04-01T17:40:59Z

docs/source/en/model_doc/layoutlmv3.md

+## Quantization
+
+For large models, you can use quantization to reduce memory usage:


Update the code example below accordingly

Suggested change

## Quantization

For large models, you can use quantization to reduce memory usage:

Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](https://huggingface.co/docs/transformers/main/en/quantization/overview) overview for more available quantization backends.

The example below uses [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) to only quantize the weights to int4.

Not resolved yet. You only need to show quantization for either 8 or 4-bits instead of both. Also the code for quantizing the model is incorrect.

docs/source/en/model_doc/layoutlmv3.md

stevhliu · 2025-04-01T17:42:02Z

docs/source/en/model_doc/layoutlmv3.md

@@ -74,81 +200,66 @@ LayoutLMv3 is nearly identical to LayoutLMv2, so we've also included LayoutLMv2
 **Document question answering**
 - [Document question answering task guide](../tasks/document_question_answering)

-## LayoutLMv3Config
+## Model Details


The rest of these changes should be reverted

Not resolved yet, the ## Model Details is still there as are the changes to the header levels of the LayoutLMv3 classes

docs/source/en/model_doc/layoutlmv3.md

carrycooldude · 2025-04-03T08:54:13Z

@stevhliu , have a look on this

HuggingFaceDocBuilderDev · 2025-04-03T16:47:46Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

stevhliu · 2025-04-03T17:23:17Z

Hey, it is still a bit off! For example:

the badges look different and should be aligned on the right
the code examples should be inside <hfoption> blocks so users can easily toggle between Pipeline and AutoModel
the Resources section should be removed

I suggest taking a look at the Gemma 3 model card again and trying to align your model card with it as much as possible!

stevhliu

There are a lot of unresolved changes, so please don't mark them as resolved 😅

stevhliu · 2025-04-04T18:21:56Z

docs/source/en/model_doc/layoutlmv3.md

 # LayoutLMv3

-## Overview
+LayoutLMv3 is a powerful multimodal transformer model designed specifically for Document AI tasks. What makes it unique is its unified approach to handling both text and images in documents, using a simple yet effective architecture that combines patch embeddings with transformer layers. Unlike its predecessor LayoutLMv2, it uses a more streamlined approach with patch embeddings (similar to ViT) instead of a CNN backbone.


Not fully resolved yet, missing link to the model

stevhliu · 2025-04-04T18:22:46Z

docs/source/en/model_doc/layoutlmv3.md


-*Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis.*
+[Paper](https://arxiv.org/abs/2204.08387) | [Official Checkpoints](https://huggingface.co/microsoft/layoutlmv3-base)


Suggested change

[Paper](https://arxiv.org/abs/2204.08387) | [Official Checkpoints](https://huggingface.co/microsoft/layoutlmv3-base)

You can find all the original LayoutLMv3 checkpoints under the [LayoutLM](https://huggingface.co/collections/microsoft/layoutlm-6564539601de72cb631d0902) collection.

stevhliu · 2025-04-04T18:23:18Z

docs/source/en/model_doc/layoutlmv3.md

+<Tip>
+Click on the right sidebar for more examples of how to use the model for different tasks!
+</Tip>


Not resolved yet either

stevhliu · 2025-04-04T18:29:05Z

docs/source/en/model_doc/layoutlmv3.md

@@ -74,81 +200,66 @@ LayoutLMv3 is nearly identical to LayoutLMv2, so we've also included LayoutLMv2
 **Document question answering**
 - [Document question answering task guide](../tasks/document_question_answering)

-## LayoutLMv3Config
+## Model Details


Not resolved yet, the ## Model Details is still there as are the changes to the header levels of the LayoutLMv3 classes

stevhliu · 2025-04-04T18:30:53Z

docs/source/en/model_doc/layoutlmv3.md

+## Quantization
+
+For large models, you can use quantization to reduce memory usage:


Not resolved yet. You only need to show quantization for either 8 or 4-bits instead of both. Also the code for quantizing the model is incorrect.

stevhliu · 2025-04-04T18:31:15Z

docs/source/en/model_doc/layoutlmv3.md

+outputs = model(**encoding)
+```
+
+## Using transformers-cli


stevhliu · 2025-04-04T18:31:44Z

docs/source/en/model_doc/layoutlmv3.md

+result = token_classifier("form.jpg")
+
+# For question answering
+qa = pipeline("document-question-answering", model="microsoft/layoutlmv3-base")


stevhliu · 2025-04-04T18:32:07Z

docs/source/en/model_doc/layoutlmv3.md

+## Using the Pipeline
+
+The easiest way to use LayoutLMv3 is through the pipeline API:


Unresolved as there are still other examples here besides question answering

stevhliu · 2025-04-04T18:32:51Z

docs/source/en/model_doc/layoutlmv3.md

+
+## Quick Start
+
+Here's a quick example of how to use LayoutLMv3 for document understanding:


stevhliu · 2025-04-08T15:58:09Z

We'll need to update the badges to include FlashAttention and the code examples to include SDPA once #35469 is merged!

docs: Update LayoutLMv3 model card with standardized format and impro…

0d06915

…ved examples

github-actions bot marked this pull request as draft March 31, 2025 18:33

carrycooldude marked this pull request as ready for review March 31, 2025 18:39

github-actions bot requested a review from stevhliu March 31, 2025 18:39

stevhliu mentioned this pull request Mar 31, 2025

[Community contributions] Model cards #36979

Open

stevhliu reviewed Apr 1, 2025

View reviewed changes

Added done

5b92ea6

carrycooldude force-pushed the feature/update-layoutlmv3-doc branch from 819c757 to 5b92ea6 Compare April 3, 2025 07:47

added __call__

b0aeeec

carrycooldude force-pushed the feature/update-layoutlmv3-doc branch from 9368ed6 to b0aeeec Compare April 3, 2025 08:23

added feature_extractor

294e6e9

carrycooldude force-pushed the feature/update-layoutlmv3-doc branch from b15eb3d to 294e6e9 Compare April 3, 2025 08:31

updated the doc by taking gemma3 reference

61f22d5

carrycooldude force-pushed the feature/update-layoutlmv3-doc branch from 7836f29 to 61f22d5 Compare April 3, 2025 18:41

carrycooldude added 3 commits April 4, 2025 00:11

Merge branch 'main' into feature/update-layoutlmv3-doc

c0107d7

Merge branch 'main' into feature/update-layoutlmv3-doc

f90cd6f

Merge branch 'main' into feature/update-layoutlmv3-doc

9923d5a

stevhliu reviewed Apr 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: Update LayoutLMv3 model card with standardized format and impro… #37155

docs: Update LayoutLMv3 model card with standardized format and impro… #37155

carrycooldude commented Mar 31, 2025

github-actions bot commented Mar 31, 2025

stevhliu left a comment

stevhliu Apr 1, 2025

stevhliu Apr 1, 2025

stevhliu Apr 4, 2025

stevhliu Apr 1, 2025

stevhliu Apr 4, 2025

stevhliu Apr 1, 2025

stevhliu Apr 4, 2025

stevhliu Apr 1, 2025

stevhliu Apr 4, 2025

stevhliu Apr 1, 2025

stevhliu Apr 4, 2025

carrycooldude commented Apr 3, 2025

HuggingFaceDocBuilderDev commented Apr 3, 2025

stevhliu commented Apr 3, 2025

stevhliu left a comment

stevhliu Apr 4, 2025

stevhliu Apr 4, 2025

stevhliu Apr 4, 2025

stevhliu Apr 4, 2025

stevhliu Apr 4, 2025

stevhliu Apr 4, 2025

stevhliu Apr 4, 2025

stevhliu Apr 4, 2025

stevhliu Apr 4, 2025

stevhliu commented Apr 8, 2025

		@@ -14,24 +14,150 @@ rendered properly in your Markdown viewer.

		-->

		[![PyTorch](https://img.shields.io/badge/PyTorch-1.12+-blue.svg)](https://pytorch.org/get-started/locally/)

-<Tip>
-Click on the right sidebar for more examples of how to use the model for different tasks!
-</Tip>
+> [!TIP]
+> Click on the LayoutLMv3 models in the right sidebar for more examples of how to apply LayoutLMv3 to different vision and language tasks.

		## Quantization

		For large models, you can use quantization to reduce memory usage:


		Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis.
		[Paper](https://arxiv.org/abs/2204.08387) \| [Official Checkpoints](https://huggingface.co/microsoft/layoutlmv3-base)

	[Paper](https://arxiv.org/abs/2204.08387) \| [Official Checkpoints](https://huggingface.co/microsoft/layoutlmv3-base)
	You can find all the original LayoutLMv3 checkpoints under the [LayoutLM](https://huggingface.co/collections/microsoft/layoutlm-6564539601de72cb631d0902) collection.

		## Using the Pipeline

		The easiest way to use LayoutLMv3 is through the pipeline API:


		## Quick Start

		Here's a quick example of how to use LayoutLMv3 for document understanding:

docs: Update LayoutLMv3 model card with standardized format and impro… #37155

Are you sure you want to change the base?

docs: Update LayoutLMv3 model card with standardized format and impro… #37155

Conversation

carrycooldude commented Mar 31, 2025

Update LayoutLMv3 Model Card Documentation

What does this PR do?

Before submitting

Who can review?

github-actions bot commented Mar 31, 2025

stevhliu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carrycooldude commented Apr 3, 2025

HuggingFaceDocBuilderDev commented Apr 3, 2025

stevhliu commented Apr 3, 2025

stevhliu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevhliu commented Apr 8, 2025