From 19a8a8062cfdffd2f113d4779792f9fb852ab4ff Mon Sep 17 00:00:00 2001
From: "ramanlog.logesh@gmail.com"
<74873758+Logeswaran7@users.noreply.github.com>
Date: Fri, 4 Apr 2025 21:55:27 +0530
Subject: [PATCH 01/11] Updated documentation for Donut model
---
docs/source/en/model_doc/donut.md | 238 +++++++++++-------------------
1 file changed, 87 insertions(+), 151 deletions(-)
diff --git a/docs/source/en/model_doc/donut.md b/docs/source/en/model_doc/donut.md
index 6e5cfe648d09..508d95f6ac39 100644
--- a/docs/source/en/model_doc/donut.md
+++ b/docs/source/en/model_doc/donut.md
@@ -13,180 +13,116 @@ rendered properly in your Markdown viewer.
specific language governing permissions and limitations under the License. -->
-# Donut
+
+
+

+
+
-## Overview
+# Donut
-The Donut model was proposed in [OCR-free Document Understanding Transformer](https://arxiv.org/abs/2111.15664) by
-Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park.
-Donut consists of an image Transformer encoder and an autoregressive text Transformer decoder to perform document understanding
-tasks such as document image classification, form understanding and visual question answering.
+The [Donut (Document Understanding Transformer)](https://arxiv.org/abs/2111.15664) is a cutting-edge model designed for visual document understanding without relying on Optical Character Recognition (OCR). Unlike traditional approaches that extract text using OCR before processing, Donut employs an end-to-end Transformer-based architecture to directly analyze document images. This eliminates OCR-related inefficiencies making it more accurate and adaptable to diverse languages and formats.
-The abstract from the paper is the following:
+Donut features a multimodal sequence-to-sequence design, combining a Swin Transformer as its vision encoder and BART as its text decoder. The vision encoder converts document images into embeddings, which the decoder then processes into meaningful text sequences. This architecture enhances efficiency in tasks such as document classification, form understanding, and visual question answering.
-*Understanding document images (e.g., invoices) is a core but challenging task since it requires complex functions such as reading text and a holistic understanding of the document. Current Visual Document Understanding (VDU) methods outsource the task of reading text to off-the-shelf Optical Character Recognition (OCR) engines and focus on the understanding task with the OCR outputs. Although such OCR-based approaches have shown promising performance, they suffer from 1) high computational costs for using OCR; 2) inflexibility of OCR models on languages or types of document; 3) OCR error propagation to the subsequent process. To address these issues, in this paper, we introduce a novel OCR-free VDU model named Donut, which stands for Document understanding transformer. As the first step in OCR-free VDU research, we propose a simple architecture (i.e., Transformer) with a pre-training objective (i.e., cross-entropy loss). Donut is conceptually simple yet effective. Through extensive experiments and analyses, we show a simple OCR-free VDU model, Donut, achieves state-of-the-art performances on various VDU tasks in terms of both speed and accuracy. In addition, we offer a synthetic data generator that helps the model pre-training to be flexible in various languages and domains.*
+You can find all the original DONUT checkpoints under the [DONUT](https://huggingface.co/models?other=donut) collection in Modelhub.
-
+> [!TIP]
+> Click on the DONUT models in the right sidebar for more examples of how to apply DONUT to different language and vision tasks.
- Donut high-level overview. Taken from the original paper.
+The examples below demonstrate how to perform document understanding tasks using Donut with [`Pipeline`] and [`AutoModel`]
-This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found
-[here](https://github.com/clovaai/donut).
+
+
-## Usage tips
+```py
+import torch
+from transformers import pipeline
+from PIL import Image
+
+pipeline = pipeline(
+ task="document-question-answering",
+ model="naver-clova-ix/donut-base-finetuned-docvqa",
+ device=0,
+ torch_dtype=torch.float16
+)
+
+pipeline(
+ image=Image.open("path/to/document.png"),
+ question="What is the purchase amount?"
+)
+```
-- The quickest way to get started with Donut is by checking the [tutorial
- notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Donut), which show how to use the model
- at inference time as well as fine-tuning on custom data.
-- Donut is always used within the [VisionEncoderDecoder](vision-encoder-decoder) framework.
+
+
-## Inference examples
+```py
+import torch
+from transformers import DonutProcessor, AutoModelForVision2Seq
+from PIL import Image
-Donut's [`VisionEncoderDecoder`] model accepts images as input and makes use of
-[`~generation.GenerationMixin.generate`] to autoregressively generate text given the input image.
+processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa")
+model = AutoModelForVision2Seq.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa")
-The [`DonutImageProcessor`] class is responsible for preprocessing the input image and
-[`XLMRobertaTokenizer`/`XLMRobertaTokenizerFast`] decodes the generated target tokens to the target string. The
-[`DonutProcessor`] wraps [`DonutImageProcessor`] and [`XLMRobertaTokenizer`/`XLMRobertaTokenizerFast`]
-into a single instance to both extract the input features and decode the predicted token ids.
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+model.to(device)
-- Step-by-step Document Image Classification
+image = Image.open("path/to/document.png")
+question = "What is the purchase amount?"
-```py
->>> import re
-
->>> from transformers import DonutProcessor, VisionEncoderDecoderModel
->>> from datasets import load_dataset
->>> import torch
-
->>> processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-rvlcdip")
->>> model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-rvlcdip")
-
->>> device = "cuda" if torch.cuda.is_available() else "cpu"
->>> model.to(device) # doctest: +IGNORE_RESULT
-
->>> # load document image
->>> dataset = load_dataset("hf-internal-testing/example-documents", split="test")
->>> image = dataset[1]["image"]
-
->>> # prepare decoder inputs
->>> task_prompt = ""
->>> decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids
-
->>> pixel_values = processor(image, return_tensors="pt").pixel_values
-
->>> outputs = model.generate(
-... pixel_values.to(device),
-... decoder_input_ids=decoder_input_ids.to(device),
-... max_length=model.decoder.config.max_position_embeddings,
-... pad_token_id=processor.tokenizer.pad_token_id,
-... eos_token_id=processor.tokenizer.eos_token_id,
-... use_cache=True,
-... bad_words_ids=[[processor.tokenizer.unk_token_id]],
-... return_dict_in_generate=True,
-... )
-
->>> sequence = processor.batch_decode(outputs.sequences)[0]
->>> sequence = sequence.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
->>> sequence = re.sub(r"<.*?>", "", sequence, count=1).strip() # remove first task start token
->>> print(processor.token2json(sequence))
-{'class': 'advertisement'}
-```
+task_prompt = f"{question}"
+encoding = processor(image, task_prompt, return_tensors="pt").to(device)
-- Step-by-step Document Parsing
+outputs = model.generate(
+ input_ids=encoding.input_ids,
+ pixel_values=encoding.pixel_values,
+ max_length=512
+)
-```py
->>> import re
-
->>> from transformers import DonutProcessor, VisionEncoderDecoderModel
->>> from datasets import load_dataset
->>> import torch
-
->>> processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")
->>> model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")
-
->>> device = "cuda" if torch.cuda.is_available() else "cpu"
->>> model.to(device) # doctest: +IGNORE_RESULT
-
->>> # load document image
->>> dataset = load_dataset("hf-internal-testing/example-documents", split="test")
->>> image = dataset[2]["image"]
-
->>> # prepare decoder inputs
->>> task_prompt = ""
->>> decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids
-
->>> pixel_values = processor(image, return_tensors="pt").pixel_values
-
->>> outputs = model.generate(
-... pixel_values.to(device),
-... decoder_input_ids=decoder_input_ids.to(device),
-... max_length=model.decoder.config.max_position_embeddings,
-... pad_token_id=processor.tokenizer.pad_token_id,
-... eos_token_id=processor.tokenizer.eos_token_id,
-... use_cache=True,
-... bad_words_ids=[[processor.tokenizer.unk_token_id]],
-... return_dict_in_generate=True,
-... )
-
->>> sequence = processor.batch_decode(outputs.sequences)[0]
->>> sequence = sequence.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
->>> sequence = re.sub(r"<.*?>", "", sequence, count=1).strip() # remove first task start token
->>> print(processor.token2json(sequence))
-{'menu': {'nm': 'CINNAMON SUGAR', 'unitprice': '17,000', 'cnt': '1 x', 'price': '17,000'}, 'sub_total': {'subtotal_price': '17,000'}, 'total': {'total_price': '17,000', 'cashprice': '20,000', 'changeprice': '3,000'}}
+answer = processor.decode(outputs[0], skip_special_tokens=True)
+print(answer)
```
-- Step-by-step Document Visual Question Answering (DocVQA)
+Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](https://huggingface.co/docs/transformers/main/en/quantization/overview) overview for more available quantization backends.
+
+The example below uses [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) to only quantize the weights to int4.
```py
->>> import re
-
->>> from transformers import DonutProcessor, VisionEncoderDecoderModel
->>> from datasets import load_dataset
->>> import torch
-
->>> processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa")
->>> model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa")
-
->>> device = "cuda" if torch.cuda.is_available() else "cpu"
->>> model.to(device) # doctest: +IGNORE_RESULT
-
->>> # load document image from the DocVQA dataset
->>> dataset = load_dataset("hf-internal-testing/example-documents", split="test")
->>> image = dataset[0]["image"]
-
->>> # prepare decoder inputs
->>> task_prompt = "{user_input}"
->>> question = "When is the coffee break?"
->>> prompt = task_prompt.replace("{user_input}", question)
->>> decoder_input_ids = processor.tokenizer(prompt, add_special_tokens=False, return_tensors="pt").input_ids
-
->>> pixel_values = processor(image, return_tensors="pt").pixel_values
-
->>> outputs = model.generate(
-... pixel_values.to(device),
-... decoder_input_ids=decoder_input_ids.to(device),
-... max_length=model.decoder.config.max_position_embeddings,
-... pad_token_id=processor.tokenizer.pad_token_id,
-... eos_token_id=processor.tokenizer.eos_token_id,
-... use_cache=True,
-... bad_words_ids=[[processor.tokenizer.unk_token_id]],
-... return_dict_in_generate=True,
-... )
-
->>> sequence = processor.batch_decode(outputs.sequences)[0]
->>> sequence = sequence.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
->>> sequence = re.sub(r"<.*?>", "", sequence, count=1).strip() # remove first task start token
->>> print(processor.token2json(sequence))
-{'question': 'When is the coffee break?', 'answer': '11-14 to 11:39 a.m.'}
+#pip install torchao
+import torch
+from transformers import DonutProcessor, VisionEncoderDecoderModel, TorchAoConfig
+from PIL import Image
+
+model_name = "naver-clova-ix/donut-base-finetuned-docvqa"
+processor = DonutProcessor.from_pretrained(model_name)
+quantization_config = TorchAoConfig("int4_weight_only", group_size=128)
+model = VisionEncoderDecoderModel.from_pretrained(
+ model_name,
+ torch_dtype=torch.bfloat16,
+ device_map="auto",
+ quantization_config=quantization_config
+)
+
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+model.to(device)
+
+image = Image.open("path/to/document.png")
+question = "What is the purchase amount?"
+
+task_prompt = f"{question}"
+inputs = processor(image, task_prompt, return_tensors="pt").to(device)
+
+outputs = model.generate(**inputs, max_length=512)
+answer = processor.decode(outputs[0], skip_special_tokens=True)
+print(answer)
```
-See the [model hub](https://huggingface.co/models?filter=donut) to look for Donut checkpoints.
+## Notes
-## Training
-
-We refer to the [tutorial notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Donut).
+- The quickest way to get started with Donut is by checking the [tutorial notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Donut), which show how to use the model at inference time as well as fine-tuning on custom data.
+- Donut is always used within the [VisionEncoderDecoder](vision-encoder-decoder) framework.
+- This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found
+[here](https://github.com/clovaai/donut).
## DonutSwinConfig
From 009a72bf6f1138bd4e3fde1839cb994bdb49c5f1 Mon Sep 17 00:00:00 2001
From: logesh R <74873758+Logeswaran7@users.noreply.github.com>
Date: Sat, 5 Apr 2025 11:16:48 +0530
Subject: [PATCH 02/11] Update docs/source/en/model_doc/donut.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
docs/source/en/model_doc/donut.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/docs/source/en/model_doc/donut.md b/docs/source/en/model_doc/donut.md
index 508d95f6ac39..1cdd3152ad01 100644
--- a/docs/source/en/model_doc/donut.md
+++ b/docs/source/en/model_doc/donut.md
@@ -21,7 +21,7 @@ specific language governing permissions and limitations under the License. -->
# Donut
-The [Donut (Document Understanding Transformer)](https://arxiv.org/abs/2111.15664) is a cutting-edge model designed for visual document understanding without relying on Optical Character Recognition (OCR). Unlike traditional approaches that extract text using OCR before processing, Donut employs an end-to-end Transformer-based architecture to directly analyze document images. This eliminates OCR-related inefficiencies making it more accurate and adaptable to diverse languages and formats.
+[Donut (Document Understanding Transformer)](https://huggingface.co/papers2111.15664) is a visual document understanding model that doesn't require an Optical Character Recognition (OCR) engine. Unlike traditional approaches that extract text using OCR before processing, Donut employs an end-to-end Transformer-based architecture to directly analyze document images. This eliminates OCR-related inefficiencies making it more accurate and adaptable to diverse languages and formats.
Donut features a multimodal sequence-to-sequence design, combining a Swin Transformer as its vision encoder and BART as its text decoder. The vision encoder converts document images into embeddings, which the decoder then processes into meaningful text sequences. This architecture enhances efficiency in tasks such as document classification, form understanding, and visual question answering.
From d09aeb95b012f5c3c13e84dc41562123a8f61c4e Mon Sep 17 00:00:00 2001
From: logesh R <74873758+Logeswaran7@users.noreply.github.com>
Date: Sat, 5 Apr 2025 11:28:53 +0530
Subject: [PATCH 03/11] Update docs/source/en/model_doc/donut.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
docs/source/en/model_doc/donut.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/docs/source/en/model_doc/donut.md b/docs/source/en/model_doc/donut.md
index 1cdd3152ad01..adf3d2741e09 100644
--- a/docs/source/en/model_doc/donut.md
+++ b/docs/source/en/model_doc/donut.md
@@ -23,7 +23,7 @@ specific language governing permissions and limitations under the License. -->
[Donut (Document Understanding Transformer)](https://huggingface.co/papers2111.15664) is a visual document understanding model that doesn't require an Optical Character Recognition (OCR) engine. Unlike traditional approaches that extract text using OCR before processing, Donut employs an end-to-end Transformer-based architecture to directly analyze document images. This eliminates OCR-related inefficiencies making it more accurate and adaptable to diverse languages and formats.
-Donut features a multimodal sequence-to-sequence design, combining a Swin Transformer as its vision encoder and BART as its text decoder. The vision encoder converts document images into embeddings, which the decoder then processes into meaningful text sequences. This architecture enhances efficiency in tasks such as document classification, form understanding, and visual question answering.
+Donut features vision encoder ([Swin](./swin)) and a text decoder ([BART](./bart)). Swin converts document images into embeddings and BART processes them into meaningful text sequences.
You can find all the original DONUT checkpoints under the [DONUT](https://huggingface.co/models?other=donut) collection in Modelhub.
From 6b95685332a675576e740b0e863aa1b2c0808e66 Mon Sep 17 00:00:00 2001
From: logesh R <74873758+Logeswaran7@users.noreply.github.com>
Date: Sat, 5 Apr 2025 11:29:32 +0530
Subject: [PATCH 04/11] Update docs/source/en/model_doc/donut.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
docs/source/en/model_doc/donut.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/docs/source/en/model_doc/donut.md b/docs/source/en/model_doc/donut.md
index adf3d2741e09..2c04d9e607a5 100644
--- a/docs/source/en/model_doc/donut.md
+++ b/docs/source/en/model_doc/donut.md
@@ -25,7 +25,7 @@ specific language governing permissions and limitations under the License. -->
Donut features vision encoder ([Swin](./swin)) and a text decoder ([BART](./bart)). Swin converts document images into embeddings and BART processes them into meaningful text sequences.
-You can find all the original DONUT checkpoints under the [DONUT](https://huggingface.co/models?other=donut) collection in Modelhub.
+You can find all the original Donut checkpoints under the [Naver Clova Information Extraction](https://huggingface.co/naver-clova-ix) organization.
> [!TIP]
> Click on the DONUT models in the right sidebar for more examples of how to apply DONUT to different language and vision tasks.
From 241b88e37583c4bbda2774d93831e843da933f89 Mon Sep 17 00:00:00 2001
From: logesh R <74873758+Logeswaran7@users.noreply.github.com>
Date: Sat, 5 Apr 2025 11:29:58 +0530
Subject: [PATCH 05/11] Update docs/source/en/model_doc/donut.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
docs/source/en/model_doc/donut.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/docs/source/en/model_doc/donut.md b/docs/source/en/model_doc/donut.md
index 2c04d9e607a5..5a1372733029 100644
--- a/docs/source/en/model_doc/donut.md
+++ b/docs/source/en/model_doc/donut.md
@@ -28,7 +28,7 @@ Donut features vision encoder ([Swin](./swin)) and a text decoder ([BART](./bart
You can find all the original Donut checkpoints under the [Naver Clova Information Extraction](https://huggingface.co/naver-clova-ix) organization.
> [!TIP]
-> Click on the DONUT models in the right sidebar for more examples of how to apply DONUT to different language and vision tasks.
+> Click on the Donut models in the right sidebar for more examples of how to apply Donut to different language and vision tasks.
The examples below demonstrate how to perform document understanding tasks using Donut with [`Pipeline`] and [`AutoModel`]
From 530159ba5e9acca695431261418ebb559e00c31f Mon Sep 17 00:00:00 2001
From: "ramanlog.logesh@gmail.com"
<74873758+Logeswaran7@users.noreply.github.com>
Date: Sat, 5 Apr 2025 11:52:23 +0530
Subject: [PATCH 06/11] Updated code suggestions
---
docs/source/en/model_doc/donut.md | 31 ++++++++++++++-----------------
1 file changed, 14 insertions(+), 17 deletions(-)
diff --git a/docs/source/en/model_doc/donut.md b/docs/source/en/model_doc/donut.md
index 5a1372733029..ce03d0a7a6bf 100644
--- a/docs/source/en/model_doc/donut.md
+++ b/docs/source/en/model_doc/donut.md
@@ -36,6 +36,7 @@ The examples below demonstrate how to perform document understanding tasks using
```py
+# pip install datasets
import torch
from transformers import pipeline
from PIL import Image
@@ -46,39 +47,35 @@ pipeline = pipeline(
device=0,
torch_dtype=torch.float16
)
+dataset = load_dataset("hf-internal-testing/example-documents", split="test")
+image = dataset[0]["image"]
-pipeline(
- image=Image.open("path/to/document.png"),
- question="What is the purchase amount?"
-)
+pipeline(image=image, question="What time is the coffee break?")
```
```py
+# pip install datasets
import torch
-from transformers import DonutProcessor, AutoModelForVision2Seq
-from PIL import Image
+from datasets import load_dataset
+from transformers import AutoProcessor, AutoModelForVision2Seq
-processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa")
+processor = AutoProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa")
model = AutoModelForVision2Seq.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa")
-device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-model.to(device)
-
-image = Image.open("path/to/document.png")
-question = "What is the purchase amount?"
-
+dataset = load_dataset("hf-internal-testing/example-documents", split="test")
+image = dataset[0]["image"]
+question = "What time is the coffee break?"
task_prompt = f"{question}"
-encoding = processor(image, task_prompt, return_tensors="pt").to(device)
+inputs = processor(image, task_prompt, return_tensors="pt")
outputs = model.generate(
- input_ids=encoding.input_ids,
- pixel_values=encoding.pixel_values,
+ input_ids=inputs.input_ids,
+ pixel_values=inputs.pixel_values,
max_length=512
)
-
answer = processor.decode(outputs[0], skip_special_tokens=True)
print(answer)
```
From 34afb9fd7a11979a874b928fe971de0a8786c50c Mon Sep 17 00:00:00 2001
From: logesh R <74873758+Logeswaran7@users.noreply.github.com>
Date: Sat, 5 Apr 2025 11:56:52 +0530
Subject: [PATCH 07/11] Update docs/source/en/model_doc/donut.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
docs/source/en/model_doc/donut.md | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/docs/source/en/model_doc/donut.md b/docs/source/en/model_doc/donut.md
index ce03d0a7a6bf..0c66ee6abe31 100644
--- a/docs/source/en/model_doc/donut.md
+++ b/docs/source/en/model_doc/donut.md
@@ -80,9 +80,9 @@ answer = processor.decode(outputs[0], skip_special_tokens=True)
print(answer)
```
-Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](https://huggingface.co/docs/transformers/main/en/quantization/overview) overview for more available quantization backends.
+Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
-The example below uses [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) to only quantize the weights to int4.
+The example below uses [torchao](../quantization/torchao) to only quantize the weights to int4.
```py
#pip install torchao
From 5c09518231e59a8c58e240184bbdde155e3dacdf Mon Sep 17 00:00:00 2001
From: "ramanlog.logesh@gmail.com"
<74873758+Logeswaran7@users.noreply.github.com>
Date: Sat, 5 Apr 2025 12:04:55 +0530
Subject: [PATCH 08/11] Updated code suggestion to Align with the AutoModel
example
---
docs/source/en/model_doc/donut.md | 33 +++++++++++++------------------
1 file changed, 14 insertions(+), 19 deletions(-)
diff --git a/docs/source/en/model_doc/donut.md b/docs/source/en/model_doc/donut.md
index 0c66ee6abe31..ad87f2292129 100644
--- a/docs/source/en/model_doc/donut.md
+++ b/docs/source/en/model_doc/donut.md
@@ -85,31 +85,26 @@ Quantization reduces the memory burden of large models by representing the weigh
The example below uses [torchao](../quantization/torchao) to only quantize the weights to int4.
```py
-#pip install torchao
+# pip install datasets torchao
import torch
-from transformers import DonutProcessor, VisionEncoderDecoderModel, TorchAoConfig
-from PIL import Image
+from datasets import load_dataset
+from transformers import TorchAoConfig, AutoProcessor, AutoModelForVision2Seq
-model_name = "naver-clova-ix/donut-base-finetuned-docvqa"
-processor = DonutProcessor.from_pretrained(model_name)
quantization_config = TorchAoConfig("int4_weight_only", group_size=128)
-model = VisionEncoderDecoderModel.from_pretrained(
- model_name,
- torch_dtype=torch.bfloat16,
- device_map="auto",
- quantization_config=quantization_config
-)
-
-device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-model.to(device)
-
-image = Image.open("path/to/document.png")
-question = "What is the purchase amount?"
+processor = AutoProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa")
+model = AutoModelForVision2Seq.from_pretrained("naver-clova-ix/donut-base-finetuned-docvqa", quantization_config=quantization_config)
+dataset = load_dataset("hf-internal-testing/example-documents", split="test")
+image = dataset[0]["image"]
+question = "What time is the coffee break?"
task_prompt = f"{question}"
-inputs = processor(image, task_prompt, return_tensors="pt").to(device)
+inputs = processor(image, task_prompt, return_tensors="pt")
-outputs = model.generate(**inputs, max_length=512)
+outputs = model.generate(
+ input_ids=inputs.input_ids,
+ pixel_values=inputs.pixel_values,
+ max_length=512
+)
answer = processor.decode(outputs[0], skip_special_tokens=True)
print(answer)
```
From 9f94ebfa1ae86780ae5218d641b82e1603b0d8df Mon Sep 17 00:00:00 2001
From: logesh R <74873758+Logeswaran7@users.noreply.github.com>
Date: Sat, 5 Apr 2025 12:09:55 +0530
Subject: [PATCH 09/11] Update docs/source/en/model_doc/donut.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
docs/source/en/model_doc/donut.md | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/docs/source/en/model_doc/donut.md b/docs/source/en/model_doc/donut.md
index ad87f2292129..1dd511ffb0b1 100644
--- a/docs/source/en/model_doc/donut.md
+++ b/docs/source/en/model_doc/donut.md
@@ -111,10 +111,10 @@ print(answer)
## Notes
-- The quickest way to get started with Donut is by checking the [tutorial notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Donut), which show how to use the model at inference time as well as fine-tuning on custom data.
-- Donut is always used within the [VisionEncoderDecoder](vision-encoder-decoder) framework.
-- This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found
-[here](https://github.com/clovaai/donut).
+- Use Donut for document image classification as shown below.
+
+- Use Donut for document parsing as shown below.
+
## DonutSwinConfig
From 51dda173eace4ba0fe4e58d03d9d6cbbbe7e1c1b Mon Sep 17 00:00:00 2001
From: "ramanlog.logesh@gmail.com"
<74873758+Logeswaran7@users.noreply.github.com>
Date: Sat, 5 Apr 2025 12:16:42 +0530
Subject: [PATCH 10/11] Updated notes section included code examples
---
docs/source/en/model_doc/donut.md | 85 ++++++++++++++++++++++++++++++-
1 file changed, 83 insertions(+), 2 deletions(-)
diff --git a/docs/source/en/model_doc/donut.md b/docs/source/en/model_doc/donut.md
index 1dd511ffb0b1..515438c5a6a9 100644
--- a/docs/source/en/model_doc/donut.md
+++ b/docs/source/en/model_doc/donut.md
@@ -112,9 +112,90 @@ print(answer)
## Notes
- Use Donut for document image classification as shown below.
-
+
+```py
+>>> import re
+
+>>> from transformers import DonutProcessor, VisionEncoderDecoderModel
+>>> from datasets import load_dataset
+>>> import torch
+
+>>> processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-rvlcdip")
+>>> model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-rvlcdip")
+
+>>> device = "cuda" if torch.cuda.is_available() else "cpu"
+>>> model.to(device) # doctest: +IGNORE_RESULT
+
+>>> # load document image
+>>> dataset = load_dataset("hf-internal-testing/example-documents", split="test")
+>>> image = dataset[1]["image"]
+
+>>> # prepare decoder inputs
+>>> task_prompt = ""
+>>> decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids
+
+>>> pixel_values = processor(image, return_tensors="pt").pixel_values
+
+>>> outputs = model.generate(
+... pixel_values.to(device),
+... decoder_input_ids=decoder_input_ids.to(device),
+... max_length=model.decoder.config.max_position_embeddings,
+... pad_token_id=processor.tokenizer.pad_token_id,
+... eos_token_id=processor.tokenizer.eos_token_id,
+... use_cache=True,
+... bad_words_ids=[[processor.tokenizer.unk_token_id]],
+... return_dict_in_generate=True,
+... )
+
+>>> sequence = processor.batch_decode(outputs.sequences)[0]
+>>> sequence = sequence.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
+>>> sequence = re.sub(r"<.*?>", "", sequence, count=1).strip() # remove first task start token
+>>> print(processor.token2json(sequence))
+{'class': 'advertisement'}
+```
+
- Use Donut for document parsing as shown below.
-
+
+```py
+>>> import re
+
+>>> from transformers import DonutProcessor, VisionEncoderDecoderModel
+>>> from datasets import load_dataset
+>>> import torch
+
+>>> processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")
+>>> model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")
+
+>>> device = "cuda" if torch.cuda.is_available() else "cpu"
+>>> model.to(device) # doctest: +IGNORE_RESULT
+
+>>> # load document image
+>>> dataset = load_dataset("hf-internal-testing/example-documents", split="test")
+>>> image = dataset[2]["image"]
+
+>>> # prepare decoder inputs
+>>> task_prompt = ""
+>>> decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids
+
+>>> pixel_values = processor(image, return_tensors="pt").pixel_values
+
+>>> outputs = model.generate(
+... pixel_values.to(device),
+... decoder_input_ids=decoder_input_ids.to(device),
+... max_length=model.decoder.config.max_position_embeddings,
+... pad_token_id=processor.tokenizer.pad_token_id,
+... eos_token_id=processor.tokenizer.eos_token_id,
+... use_cache=True,
+... bad_words_ids=[[processor.tokenizer.unk_token_id]],
+... return_dict_in_generate=True,
+... )
+
+>>> sequence = processor.batch_decode(outputs.sequences)[0]
+>>> sequence = sequence.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
+>>> sequence = re.sub(r"<.*?>", "", sequence, count=1).strip() # remove first task start token
+>>> print(processor.token2json(sequence))
+{'menu': {'nm': 'CINNAMON SUGAR', 'unitprice': '17,000', 'cnt': '1 x', 'price': '17,000'}, 'sub_total': {'subtotal_price': '17,000'}, 'total': {'total_price': '17,000', 'cashprice': '20,000', 'changeprice': '3,000'}}
+```
## DonutSwinConfig
From 2524c160de88285df8d767c8e5414576816a012d Mon Sep 17 00:00:00 2001
From: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Date: Mon, 7 Apr 2025 11:07:36 -0700
Subject: [PATCH 11/11] close hfoption block and indent
---
docs/source/en/model_doc/donut.md | 162 +++++++++++++++---------------
1 file changed, 82 insertions(+), 80 deletions(-)
diff --git a/docs/source/en/model_doc/donut.md b/docs/source/en/model_doc/donut.md
index 515438c5a6a9..1bc1a3bcfd0b 100644
--- a/docs/source/en/model_doc/donut.md
+++ b/docs/source/en/model_doc/donut.md
@@ -80,6 +80,9 @@ answer = processor.decode(outputs[0], skip_special_tokens=True)
print(answer)
```
+
+
+
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
The example below uses [torchao](../quantization/torchao) to only quantize the weights to int4.
@@ -113,89 +116,88 @@ print(answer)
- Use Donut for document image classification as shown below.
-```py
->>> import re
-
->>> from transformers import DonutProcessor, VisionEncoderDecoderModel
->>> from datasets import load_dataset
->>> import torch
-
->>> processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-rvlcdip")
->>> model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-rvlcdip")
-
->>> device = "cuda" if torch.cuda.is_available() else "cpu"
->>> model.to(device) # doctest: +IGNORE_RESULT
-
->>> # load document image
->>> dataset = load_dataset("hf-internal-testing/example-documents", split="test")
->>> image = dataset[1]["image"]
-
->>> # prepare decoder inputs
->>> task_prompt = ""
->>> decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids
-
->>> pixel_values = processor(image, return_tensors="pt").pixel_values
-
->>> outputs = model.generate(
-... pixel_values.to(device),
-... decoder_input_ids=decoder_input_ids.to(device),
-... max_length=model.decoder.config.max_position_embeddings,
-... pad_token_id=processor.tokenizer.pad_token_id,
-... eos_token_id=processor.tokenizer.eos_token_id,
-... use_cache=True,
-... bad_words_ids=[[processor.tokenizer.unk_token_id]],
-... return_dict_in_generate=True,
-... )
-
->>> sequence = processor.batch_decode(outputs.sequences)[0]
->>> sequence = sequence.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
->>> sequence = re.sub(r"<.*?>", "", sequence, count=1).strip() # remove first task start token
->>> print(processor.token2json(sequence))
-{'class': 'advertisement'}
-```
+ ```py
+ >>> import re
+ >>> from transformers import DonutProcessor, VisionEncoderDecoderModel
+ >>> from datasets import load_dataset
+ >>> import torch
+
+ >>> processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-rvlcdip")
+ >>> model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-rvlcdip")
+
+ >>> device = "cuda" if torch.cuda.is_available() else "cpu"
+ >>> model.to(device) # doctest: +IGNORE_RESULT
+
+ >>> # load document image
+ >>> dataset = load_dataset("hf-internal-testing/example-documents", split="test")
+ >>> image = dataset[1]["image"]
+
+ >>> # prepare decoder inputs
+ >>> task_prompt = ""
+ >>> decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids
+
+ >>> pixel_values = processor(image, return_tensors="pt").pixel_values
+
+ >>> outputs = model.generate(
+ ... pixel_values.to(device),
+ ... decoder_input_ids=decoder_input_ids.to(device),
+ ... max_length=model.decoder.config.max_position_embeddings,
+ ... pad_token_id=processor.tokenizer.pad_token_id,
+ ... eos_token_id=processor.tokenizer.eos_token_id,
+ ... use_cache=True,
+ ... bad_words_ids=[[processor.tokenizer.unk_token_id]],
+ ... return_dict_in_generate=True,
+ ... )
+
+ >>> sequence = processor.batch_decode(outputs.sequences)[0]
+ >>> sequence = sequence.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
+ >>> sequence = re.sub(r"<.*?>", "", sequence, count=1).strip() # remove first task start token
+ >>> print(processor.token2json(sequence))
+ {'class': 'advertisement'}
+ ```
- Use Donut for document parsing as shown below.
-```py
->>> import re
-
->>> from transformers import DonutProcessor, VisionEncoderDecoderModel
->>> from datasets import load_dataset
->>> import torch
-
->>> processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")
->>> model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")
-
->>> device = "cuda" if torch.cuda.is_available() else "cpu"
->>> model.to(device) # doctest: +IGNORE_RESULT
-
->>> # load document image
->>> dataset = load_dataset("hf-internal-testing/example-documents", split="test")
->>> image = dataset[2]["image"]
-
->>> # prepare decoder inputs
->>> task_prompt = ""
->>> decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids
-
->>> pixel_values = processor(image, return_tensors="pt").pixel_values
-
->>> outputs = model.generate(
-... pixel_values.to(device),
-... decoder_input_ids=decoder_input_ids.to(device),
-... max_length=model.decoder.config.max_position_embeddings,
-... pad_token_id=processor.tokenizer.pad_token_id,
-... eos_token_id=processor.tokenizer.eos_token_id,
-... use_cache=True,
-... bad_words_ids=[[processor.tokenizer.unk_token_id]],
-... return_dict_in_generate=True,
-... )
-
->>> sequence = processor.batch_decode(outputs.sequences)[0]
->>> sequence = sequence.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
->>> sequence = re.sub(r"<.*?>", "", sequence, count=1).strip() # remove first task start token
->>> print(processor.token2json(sequence))
-{'menu': {'nm': 'CINNAMON SUGAR', 'unitprice': '17,000', 'cnt': '1 x', 'price': '17,000'}, 'sub_total': {'subtotal_price': '17,000'}, 'total': {'total_price': '17,000', 'cashprice': '20,000', 'changeprice': '3,000'}}
-```
+ ```py
+ >>> import re
+ >>> from transformers import DonutProcessor, VisionEncoderDecoderModel
+ >>> from datasets import load_dataset
+ >>> import torch
+
+ >>> processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")
+ >>> model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")
+
+ >>> device = "cuda" if torch.cuda.is_available() else "cpu"
+ >>> model.to(device) # doctest: +IGNORE_RESULT
+
+ >>> # load document image
+ >>> dataset = load_dataset("hf-internal-testing/example-documents", split="test")
+ >>> image = dataset[2]["image"]
+
+ >>> # prepare decoder inputs
+ >>> task_prompt = ""
+ >>> decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids
+
+ >>> pixel_values = processor(image, return_tensors="pt").pixel_values
+
+ >>> outputs = model.generate(
+ ... pixel_values.to(device),
+ ... decoder_input_ids=decoder_input_ids.to(device),
+ ... max_length=model.decoder.config.max_position_embeddings,
+ ... pad_token_id=processor.tokenizer.pad_token_id,
+ ... eos_token_id=processor.tokenizer.eos_token_id,
+ ... use_cache=True,
+ ... bad_words_ids=[[processor.tokenizer.unk_token_id]],
+ ... return_dict_in_generate=True,
+ ... )
+
+ >>> sequence = processor.batch_decode(outputs.sequences)[0]
+ >>> sequence = sequence.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
+ >>> sequence = re.sub(r"<.*?>", "", sequence, count=1).strip() # remove first task start token
+ >>> print(processor.token2json(sequence))
+ {'menu': {'nm': 'CINNAMON SUGAR', 'unitprice': '17,000', 'cnt': '1 x', 'price': '17,000'}, 'sub_total': {'subtotal_price': '17,000'}, 'total':
+ {'total_price': '17,000', 'cashprice': '20,000', 'changeprice': '3,000'}}
+ ```
## DonutSwinConfig