Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Update LayoutLMv3 model card with standardized format and impro… #37155

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

carrycooldude
Copy link

Update LayoutLMv3 Model Card Documentation

This PR updates the LayoutLMv3 model card documentation to follow the standardized format as requested in #36979. The changes improve the documentation's clarity and usability while maintaining consistency with other model cards in the repository.

What does this PR do?

This PR enhances the LayoutLMv3 model card documentation by:

  • Adding badges for framework support (PyTorch, TensorFlow, Flax) and optimizations (Flash Attention, SDPA)
  • Reorganizing code examples into clear sections:
    • Quick Start (basic usage)
    • Pipeline API examples
    • AutoModel examples
    • transformers-cli examples
  • Adding quantization examples for large models (8-bit and 4-bit)
  • Adding attention visualization examples using AttentionMaskVisualizer
  • Maintaining existing functionality while improving documentation structure

The changes make the documentation more accessible and provide ready-to-use examples for different use cases, following the standardized format used in other model cards like Gemma 3, PaliGemma, and ViT.

Fixes #36979

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Since this is a documentation update for a vision-language model, I would suggest tagging:

@github-actions github-actions bot marked this pull request as draft March 31, 2025 18:33
Copy link

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

@carrycooldude carrycooldude marked this pull request as ready for review March 31, 2025 18:39
@github-actions github-actions bot requested a review from stevhliu March 31, 2025 18:39
Copy link
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is a good start! Please refer to the Gemma 3 docs see how to standardize this doc 🤗

@@ -14,24 +14,150 @@ rendered properly in your Markdown viewer.

-->

[![PyTorch](https://img.shields.io/badge/PyTorch-1.12+-blue.svg)](https://pytorch.org/get-started/locally/)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please style these with the <div> tags. You can copy it from one of the existing updated model cards on main like Gemma 3.

# LayoutLMv3

## Overview
LayoutLMv3 is a powerful multimodal transformer model designed specifically for Document AI tasks. What makes it unique is its unified approach to handling both text and images in documents, using a simple yet effective architecture that combines patch embeddings with transformer layers. Unlike its predecessor LayoutLMv2, it uses a more streamlined approach with patch embeddings (similar to ViT) instead of a CNN backbone.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LayoutLMv3 is a powerful multimodal transformer model designed specifically for Document AI tasks. What makes it unique is its unified approach to handling both text and images in documents, using a simple yet effective architecture that combines patch embeddings with transformer layers. Unlike its predecessor LayoutLMv2, it uses a more streamlined approach with patch embeddings (similar to ViT) instead of a CNN backbone.
[LayoutLMv3](https://huggingface.co/papers/2204.08387) is a multimodal transformer model designed specifically for Document AI tasks. It unites the pretraining objective for text and images, masked language and masked image modeling, and also includes a word-patch alignment objective for even stronger text and image alignment. The model architecture is also unified and uses a more streamlined approach with patch embeddings (similar to [ViT](./vit)) instead of a CNN backbone.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not fully resolved yet, missing link to the model

Comment on lines 37 to 39
<Tip>
Click on the right sidebar for more examples of how to use the model for different tasks!
</Tip>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<Tip>
Click on the right sidebar for more examples of how to use the model for different tasks!
</Tip>
> [!TIP]
> Click on the LayoutLMv3 models in the right sidebar for more examples of how to apply LayoutLMv3 to different vision and language tasks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not resolved yet either

outputs = model(**encoding)
```

## Using transformers-cli
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this since transformers-cli doesn't support image inputs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unresolved

Comment on lines 110 to 112
## Quantization

For large models, you can use quantization to reduce memory usage:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the code example below accordingly

Suggested change
## Quantization
For large models, you can use quantization to reduce memory usage:
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](https://huggingface.co/docs/transformers/main/en/quantization/overview) overview for more available quantization backends.
The example below uses [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) to only quantize the weights to int4.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not resolved yet. You only need to show quantization for either 8 or 4-bits instead of both. Also the code for quantizing the model is incorrect.

@@ -74,81 +200,66 @@ LayoutLMv3 is nearly identical to LayoutLMv2, so we've also included LayoutLMv2
**Document question answering**
- [Document question answering task guide](../tasks/document_question_answering)

## LayoutLMv3Config
## Model Details
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest of these changes should be reverted

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not resolved yet, the ## Model Details is still there as are the changes to the header levels of the LayoutLMv3 classes

@carrycooldude carrycooldude force-pushed the feature/update-layoutlmv3-doc branch from 819c757 to 5b92ea6 Compare April 3, 2025 07:47
@carrycooldude carrycooldude force-pushed the feature/update-layoutlmv3-doc branch from 9368ed6 to b0aeeec Compare April 3, 2025 08:23
@carrycooldude carrycooldude force-pushed the feature/update-layoutlmv3-doc branch from b15eb3d to 294e6e9 Compare April 3, 2025 08:31
@carrycooldude
Copy link
Author

@stevhliu , have a look on this

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@stevhliu
Copy link
Member

stevhliu commented Apr 3, 2025

Hey, it is still a bit off! For example:

  • the badges look different and should be aligned on the right
  • the code examples should be inside <hfoption> blocks so users can easily toggle between Pipeline and AutoModel
  • the Resources section should be removed

I suggest taking a look at the Gemma 3 model card again and trying to align your model card with it as much as possible!

@carrycooldude carrycooldude force-pushed the feature/update-layoutlmv3-doc branch from 7836f29 to 61f22d5 Compare April 3, 2025 18:41
Copy link
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of unresolved changes, so please don't mark them as resolved 😅

# LayoutLMv3

## Overview
LayoutLMv3 is a powerful multimodal transformer model designed specifically for Document AI tasks. What makes it unique is its unified approach to handling both text and images in documents, using a simple yet effective architecture that combines patch embeddings with transformer layers. Unlike its predecessor LayoutLMv2, it uses a more streamlined approach with patch embeddings (similar to ViT) instead of a CNN backbone.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not fully resolved yet, missing link to the model


*Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis.*
[Paper](https://arxiv.org/abs/2204.08387) | [Official Checkpoints](https://huggingface.co/microsoft/layoutlmv3-base)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[Paper](https://arxiv.org/abs/2204.08387) | [Official Checkpoints](https://huggingface.co/microsoft/layoutlmv3-base)
You can find all the original LayoutLMv3 checkpoints under the [LayoutLM](https://huggingface.co/collections/microsoft/layoutlm-6564539601de72cb631d0902) collection.

Comment on lines 37 to 39
<Tip>
Click on the right sidebar for more examples of how to use the model for different tasks!
</Tip>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not resolved yet either

@@ -74,81 +200,66 @@ LayoutLMv3 is nearly identical to LayoutLMv2, so we've also included LayoutLMv2
**Document question answering**
- [Document question answering task guide](../tasks/document_question_answering)

## LayoutLMv3Config
## Model Details
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not resolved yet, the ## Model Details is still there as are the changes to the header levels of the LayoutLMv3 classes

Comment on lines 110 to 112
## Quantization

For large models, you can use quantization to reduce memory usage:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not resolved yet. You only need to show quantization for either 8 or 4-bits instead of both. Also the code for quantizing the model is incorrect.

outputs = model(**encoding)
```

## Using transformers-cli
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unresolved

result = token_classifier("form.jpg")

# For question answering
qa = pipeline("document-question-answering", model="microsoft/layoutlmv3-base")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unresolved

Comment on lines 63 to 65
## Using the Pipeline

The easiest way to use LayoutLMv3 is through the pipeline API:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unresolved as there are still other examples here besides question answering


## Quick Start

Here's a quick example of how to use LayoutLMv3 for document understanding:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unresolved

@stevhliu
Copy link
Member

stevhliu commented Apr 8, 2025

We'll need to update the badges to include FlashAttention and the code examples to include SDPA once #35469 is merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Community contributions] Model cards
3 participants