Skip to content

Commit dffd8c4

Browse files
committed
[docs] Update model-card for DINOv2
1 parent 471cf1d commit dffd8c4

File tree

1 file changed

+100
-37
lines changed

1 file changed

+100
-37
lines changed

Diff for: docs/source/en/model_doc/dinov2.md

+100-37
Original file line numberDiff line numberDiff line change
@@ -10,71 +10,134 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
1010
specific language governing permissions and limitations under the License.
1111
-->
1212

13-
# DINOv2
14-
15-
<div class="flex flex-wrap space-x-1">
16-
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
17-
<img alt="Flax" src="https://img.shields.io/badge/Flax-29a79b.svg?style=flat&logo=
18-
">
19-
<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
20-
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
13+
<div style="float: right;">
14+
<div class="flex flex-wrap space-x-1">
15+
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
16+
<img alt="Flax" src="https://img.shields.io/badge/Flax-29a79b.svg?style=flat&logo=">
17+
<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
18+
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
19+
</div>
2120
</div>
2221

23-
## Overview
2422

25-
The DINOv2 model was proposed in [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193) by
26-
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski.
27-
DINOv2 is an upgrade of [DINO](https://arxiv.org/abs/2104.14294), a self-supervised method applied on [Vision Transformers](vit). This method enables all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning.
23+
# DINOv2
2824

29-
The abstract from the paper is the following:
25+
[DINOv2](https://huggingface.co/papers/2304.07193) is a vision foundation model that uses [ViT](./vit) as a feature extractor for multiple downstream tasks like image classification and depth estimation. It focuses on stabilizing and accelerating training through techniques like a faster memory-efficient attention, sequence packing, improved stochastic depth, Fully Sharded Data Parallel (FSDP), and model distillation.
3026

31-
*The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.*
27+
You can find all the original DINOv2 checkpoints under the [Dinov2](https://huggingface.co/collections/facebook/dinov2-6526c98554b3d2576e071ce3) collection.
3228

33-
This model was contributed by [nielsr](https://huggingface.co/nielsr).
34-
The original code can be found [here](https://github.com/facebookresearch/dinov2).
29+
> [!TIP]
30+
> Click on the DINOv2 models in the right sidebar for more examples of how to apply DINOv2 to different vision tasks.
3531
36-
## Usage tips
32+
The example below demonstrates how to obtain an image embedding with [`Pipeline`] or the [`AutoModel`] class.
3733

38-
The model can be traced using `torch.jit.trace` which leverages JIT compilation to optimize the model making it faster to run. Note this still produces some mis-matched elements and the difference between the original model and the traced model is of the order of 1e-4.
34+
<hfoptions id="usage">
35+
<hfoption id="Pipeline">
3936

40-
```python
37+
```py
4138
import torch
42-
from transformers import AutoImageProcessor, AutoModel
39+
from transformers import pipeline
40+
41+
pipe = pipeline(
42+
task="image-classification",
43+
model="facebook/dinov2-small-imagenet1k-1-layer",
44+
torch_dtype=torch.float16,
45+
device=0
46+
)
47+
48+
pipe("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")
49+
```
50+
51+
</hfoption>
52+
<hfoption id="AutoModel">
53+
54+
```py
55+
import requests
56+
from transformers import AutoImageProcessor, AutoModelForImageClassification
4357
from PIL import Image
58+
59+
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
60+
image = Image.open(requests.get(url, stream=True).raw)
61+
62+
processor = AutoImageProcessor.from_pretrained("facebook/dinov2-small-imagenet1k-1-layer")
63+
model = AutoModelForImageClassification.from_pretrained(
64+
"facebook/dinov2-small-imagenet1k-1-layer",
65+
torch_dtype=torch.float16,
66+
device_map="auto",
67+
attn_implementation="sdpa"
68+
)
69+
70+
inputs = processor(images=image, return_tensors="pt")
71+
logits = model(**inputs).logits
72+
predicted_class_idx = logits.argmax(-1).item()
73+
print("Predicted class:", model.config.id2label[predicted_class_idx])
74+
```
75+
76+
</hfoption>
77+
</hfoptions>
78+
79+
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
80+
81+
The example below uses [torchao](../quantization/torchao) to only quantize the weights to int4.
82+
83+
```py
84+
# pip install torchao
4485
import requests
86+
from transformers import TorchAoConfig, AutoImageProcessor, AutoModelForImageClassification
87+
from torchao.quantization import Int4WeightOnlyConfig
88+
from PIL import Image
4589

4690
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
4791
image = Image.open(requests.get(url, stream=True).raw)
4892

49-
processor = AutoImageProcessor.from_pretrained('facebook/dinov2-base')
50-
model = AutoModel.from_pretrained('facebook/dinov2-base')
93+
processor = AutoImageProcessor.from_pretrained('facebook/dinov2-giant-imagenet1k-1-layer')
94+
95+
quant_config = Int4WeightOnlyConfig(group_size=128)
96+
quantization_config = TorchAoConfig(quant_type=quant_config)
97+
98+
model = AutoModelForImageClassification.from_pretrained(
99+
'facebook/dinov2-giant-imagenet1k-1-layer',
100+
torch_dtype=torch.bfloat16,
101+
device_map="auto",
102+
quantization_config=quantization_config
103+
)
51104

52105
inputs = processor(images=image, return_tensors="pt")
53106
outputs = model(**inputs)
54-
last_hidden_states = outputs[0]
107+
logits = outputs.logits
108+
predicted_class_idx = logits.argmax(-1).item()
109+
print("Predicted class:", model.config.id2label[predicted_class_idx])
110+
```
55111

56-
# We have to force return_dict=False for tracing
57-
model.config.return_dict = False
112+
## Notes
58113

59-
with torch.no_grad():
60-
traced_model = torch.jit.trace(model, [inputs.pixel_values])
61-
traced_outputs = traced_model(inputs.pixel_values)
114+
- Use [torch.jit.trace](https://pytorch.org/docs/stable/generated/torch.jit.trace.html) to speedup inference. However, it will produce some mismatched elements. The difference between the original and traced model is 1e-4.
62115

63-
print((last_hidden_states - traced_outputs[0]).abs().max())
64-
```
116+
```py
117+
import torch
118+
from transformers import AutoImageProcessor, AutoModel
119+
from PIL import Image
120+
import requests
65121

66-
## Resources
122+
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
123+
image = Image.open(requests.get(url, stream=True).raw)
67124

68-
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with DINOv2.
125+
processor = AutoImageProcessor.from_pretrained('facebook/dinov2-base')
126+
model = AutoModel.from_pretrained('facebook/dinov2-base')
69127

70-
- Demo notebooks for DINOv2 can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/DINOv2). 🌎
128+
inputs = processor(images=image, return_tensors="pt")
129+
outputs = model(**inputs)
130+
last_hidden_states = outputs[0]
71131

72-
<PipelineTag pipeline="image-classification"/>
132+
# We have to force return_dict=False for tracing
133+
model.config.return_dict = False
73134

74-
- [`Dinov2ForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb).
75-
- See also: [Image classification task guide](../tasks/image_classification)
135+
with torch.no_grad():
136+
traced_model = torch.jit.trace(model, [inputs.pixel_values])
137+
traced_outputs = traced_model(inputs.pixel_values)
76138

77-
If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
139+
print((last_hidden_states - traced_outputs[0]).abs().max())
140+
```
78141

79142
## Dinov2Config
80143

0 commit comments

Comments
 (0)