Skip to content

Commit 15ac2b6

Browse files
Update Model Card for ModernBERT (#37052)
* Modify Model Card for ModernBERT. * Update as per code review. Co-authored-by: Steven Liu <[email protected]> * Update model card. * Update model card. --------- Co-authored-by: Steven Liu <[email protected]>
1 parent b552708 commit 15ac2b6

File tree

1 file changed

+55
-31
lines changed

1 file changed

+55
-31
lines changed

Diff for: docs/source/en/model_doc/modernbert.md

+55-31
Original file line numberDiff line numberDiff line change
@@ -14,55 +14,79 @@ rendered properly in your Markdown viewer.
1414
1515
-->
1616

17-
# ModernBERT
18-
19-
<div class="flex flex-wrap space-x-1">
20-
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
21-
<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
22-
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
17+
<div style="float: right;">
18+
<div class="flex flex-wrap space-x-1">
19+
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
20+
<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
21+
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
22+
</div>
2323
</div>
2424

25-
## Overview
25+
# ModernBERT
2626

27-
The ModernBERT model was proposed in [Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference](https://arxiv.org/abs/2412.13663) by Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Galalgher, Raja Bisas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Grifin Adams, Jeremy Howard and Iacopo Poli.
27+
[ModernBERT](https://huggingface.co/papers/2412.13663) is a modernized version of [`BERT`] trained on 2T tokens. It brings many improvements to the original architecture such as rotary positional embeddings to support sequences of up to 8192 tokens, unpadding to avoid wasting compute on padding tokens, GeGLU layers, and alternating attention.
2828

29-
It is a refresh of the traditional encoder architecture, as used in previous models such as [BERT](https://huggingface.co/docs/transformers/en/model_doc/bert) and [RoBERTa](https://huggingface.co/docs/transformers/en/model_doc/roberta).
29+
You can find all the original ModernBERT checkpoints under the [ModernBERT](https://huggingface.co/collections/answerdotai/modernbert-67627ad707a4acbf33c41deb) collection.
3030

31-
It builds on BERT and implements many modern architectural improvements which have been developed since its original release, such as:
32-
- [Rotary Positional Embeddings](https://huggingface.co/blog/designing-positional-encoding) to support sequences of up to 8192 tokens.
33-
- [Unpadding](https://arxiv.org/abs/2208.08124) to ensure no compute is wasted on padding tokens, speeding up processing time for batches with mixed-length sequences.
34-
- [GeGLU](https://arxiv.org/abs/2002.05202) Replacing the original MLP layers with GeGLU layers, shown to improve performance.
35-
- [Alternating Attention](https://arxiv.org/abs/2004.05150v2) where most attention layers employ a sliding window of 128 tokens, with Global Attention only used every 3 layers.
36-
- [Flash Attention](https://github.com/Dao-AILab/flash-attention) to speed up processing.
37-
- A model designed following recent [The Case for Co-Designing Model Architectures with Hardware](https://arxiv.org/abs/2401.14489), ensuring maximum efficiency across inference GPUs.
38-
- Modern training data scales (2 trillion tokens) and mixtures (including code ande math data)
31+
> [!TIP]
32+
> Click on the ModernBERT models in the right sidebar for more examples of how to apply ModernBERT to different language tasks.
3933
40-
The abstract from the paper is the following:
34+
The example below demonstrates how to predict the `[MASK]` token with [`Pipeline`], [`AutoModel`], and from the command line.
4135

42-
*Encoder-only transformer models such as BERT offer a great performance-size tradeoff for retrieval and classification tasks with respect to larger decoder-only models. Despite being the workhorse of numerous production pipelines, there have been limited Pareto improvements to BERT since its release. In this paper, we introduce ModernBERT, bringing modern model optimizations to encoder-only models and representing a major Pareto improvement over older encoders. Trained on 2 trillion tokens with a native 8192 sequence length, ModernBERT models exhibit state-of-the-art results on a large pool of evaluations encompassing diverse classification tasks and both single and multi-vector retrieval on different domains (including code). In addition to strong downstream performance, ModernBERT is also the most speed and memory efficient encoder and is designed for inference on common GPUs.*
36+
<hfoptions id="usage">
37+
<hfoption id="Pipeline">
4338

44-
The original code can be found [here](https://github.com/answerdotai/modernbert).
39+
```py
40+
import torch
41+
from transformers import pipeline
4542

46-
## Resources
43+
pipeline = pipeline(
44+
task="fill-mask",
45+
model="answerdotai/ModernBERT-base",
46+
torch_dtype=torch.float16,
47+
device=0
48+
)
49+
pipeline("Plants create [MASK] through a process known as photosynthesis.")
50+
```
4751

48-
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with ModernBert.
52+
</hfoption>
53+
<hfoption id="AutoModel">
4954

50-
<PipelineTag pipeline="text-classification"/>
55+
```py
56+
import torch
57+
from transformers import AutoModelForMaskedLM, AutoTokenizer
5158

52-
- A notebook on how to [finetune for General Language Understanding Evaluation (GLUE) with Transformers](https://github.com/AnswerDotAI/ModernBERT/blob/main/examples/finetune_modernbert_on_glue.ipynb), also available as a Google Colab [notebook](https://colab.research.google.com/github/AnswerDotAI/ModernBERT/blob/main/examples/finetune_modernbert_on_glue.ipynb). 🌎
59+
tokenizer = AutoTokenizer.from_pretrained(
60+
"answerdotai/ModernBERT-base",
61+
)
62+
model = AutoModelForMaskedLM.from_pretrained(
63+
"answerdotai/ModernBERT-base",
64+
torch_dtype=torch.float16,
65+
device_map="auto",
66+
attn_implementation="sdpa"
67+
)
68+
inputs = tokenizer("Plants create [MASK] through a process known as photosynthesis.", return_tensors="pt").to("cuda")
5369

54-
<PipelineTag pipeline="sentence-similarity"/>
70+
with torch.no_grad():
71+
outputs = model(**inputs)
72+
predictions = outputs.logits
5573

56-
- A script on how to [finetune for text similarity or information retrieval with Sentence Transformers](https://github.com/AnswerDotAI/ModernBERT/blob/main/examples/train_st.py). 🌎
57-
- A script on how to [finetune for information retrieval with PyLate](https://github.com/AnswerDotAI/ModernBERT/blob/main/examples/train_pylate.py). 🌎
74+
masked_index = torch.where(inputs['input_ids'] == tokenizer.mask_token_id)[1]
75+
predicted_token_id = predictions[0, masked_index].argmax(dim=-1)
76+
predicted_token = tokenizer.decode(predicted_token_id)
5877

59-
<PipelineTag pipeline="fill-mask"/>
78+
print(f"The predicted token is: {predicted_token}")
79+
```
6080

61-
- [Masked language modeling task guide](../tasks/masked_language_modeling)
81+
</hfoption>
82+
<hfoption id="transformers-cli">
6283

63-
<PipelineTag pipeline="question-answering"/>
84+
```bash
85+
echo -e "Plants create [MASK] through a process known as photosynthesis." | transformers-cli run --task fill-mask --model answerdotai/ModernBERT-base --device 0
86+
```
6487

65-
- [`ModernBertForQuestionAnswering`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering) and [colab notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb).
88+
</hfoption>
89+
</hfoptions>
6690

6791
## ModernBertConfig
6892

0 commit comments

Comments
 (0)