Skip to content

Commit 5e85509

Browse files
ricalanisstevhliu
andauthored
Update falcon mamba card (#37253)
* feat: edit falcon mamba card * fix: edit statement on falconmamba arch * Update docs/source/en/model_doc/falcon_mamba.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/model_doc/falcon_mamba.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/model_doc/falcon_mamba.md Co-authored-by: Steven Liu <[email protected]> * fix: add right indent for tags * fix: remove notas --------- Co-authored-by: Steven Liu <[email protected]>
1 parent 416b5a8 commit 5e85509

File tree

1 file changed

+68
-63
lines changed

1 file changed

+68
-63
lines changed

docs/source/en/model_doc/falcon_mamba.md

Lines changed: 68 additions & 63 deletions
Original file line numberDiff line numberDiff line change
@@ -14,95 +14,100 @@ rendered properly in your Markdown viewer.
1414
1515
-->
1616

17-
# FalconMamba
18-
19-
<div class="flex flex-wrap space-x-1">
20-
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
17+
<div style="float: right;">
18+
<div class="flex flex-wrap space-x-1">
19+
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
20+
</div>
2121
</div>
2222

23-
## Overview
24-
25-
The FalconMamba model was proposed by TII UAE (Technology Innovation Institute) in their release.
26-
27-
The abstract from the paper is the following:
28-
29-
*We present FalconMamba, a new base large language model based on the novel Mamba architecture. FalconMamba is trained on 5.8 trillion tokens with carefully selected data mixtures. As a pure Mamba-based model, FalconMamba surpasses leading open-weight models based on Transformers, such as Mistral 7B, Llama3 8B, and Falcon2 11B. It is on par with Gemma 7B and outperforms models with different architecture designs, such as RecurrentGemma 9B. Currently, FalconMamba is the best-performing Mamba model in the literature at this scale, surpassing both existing Mamba and hybrid Mamba-Transformer models.
30-
Due to its architecture, FalconMamba is significantly faster at inference and requires substantially less memory for long sequence generation. Despite recent studies suggesting that hybrid Mamba-Transformer models outperform pure architecture designs, we argue and demonstrate that the pure Mamba design can achieve similar, even superior results compared to the hybrid design. We make the weights of our implementation of FalconMamba publicly available under a permissive license.*
31-
32-
Tips:
23+
# FalconMamba
3324

34-
- FalconMamba is mostly based on Mamba architecture, the same [tips and best practices](./mamba) would be relevant here.
25+
[FalconMamba](https://huggingface.co/papers/2410.05355) is a 7B large language model, available as pretrained and instruction-tuned variants, based on the [Mamba](./mamba). This model implements a pure Mamba design that focuses on computational efficiency while maintaining strong performance. FalconMamba is significantly faster at inference and requires substantially less memory for long sequence generation. The models are pretrained on a diverse 5.8T token dataset including [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), technical content, code, and mathematical data.
3526

36-
The model has been trained on approximtely 6T tokens consisting a mixture of many data sources such as RefineWeb, Cosmopedia and Math data.
27+
You can find the official FalconMamba checkpoints in the [FalconMamba 7B](https://huggingface.co/collections/tiiuae/falconmamba-7b-66b9a580324dd1598b0f6d4a) collection.
3728

38-
For more details about the training procedure and the architecture, have a look at [the technical paper of FalconMamba]() (coming soon).
29+
> [!TIP]
30+
> Click on the FalconMamba models in the right sidebar for more examples of how to apply FalconMamba to different language tasks.
3931
40-
# Usage
32+
The examples below demonstrate how to generate text with [`Pipeline`], [`AutoModel`], and from the command line.
4133

42-
Below we demonstrate how to use the model:
34+
<hfoptions id="usage">
35+
<hfoption id="Pipeline">
4336

44-
```python
45-
from transformers import FalconMambaForCausalLM, AutoTokenizer
37+
```py
4638
import torch
47-
48-
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-mamba-7b")
49-
model = FalconMambaForCausalLM.from_pretrained("tiiuae/falcon-mamba-7b")
50-
51-
input_ids = tokenizer("Hey how are you doing?", return_tensors= "pt")["input_ids"]
52-
53-
out = model.generate(input_ids, max_new_tokens=10)
54-
print(tokenizer.batch_decode(out))
39+
from transformers import pipeline
40+
41+
pipeline = pipeline(
42+
"text-generation",
43+
model="tiiuae/falcon-mamba-7b-instruct",
44+
torch_dtype=torch.bfloat16,
45+
device=0
46+
)
47+
pipeline(
48+
"Explain the difference between transformers and SSMs",
49+
max_length=100,
50+
do_sample=True,
51+
temperature=0.7
52+
)
5553
```
5654

57-
The architecture is also compatible with `torch.compile` for faster generation:
55+
</hfoption>
56+
<hfoption id="AutoModel">
5857

59-
```python
60-
from transformers import FalconMambaForCausalLM, AutoTokenizer
58+
```py
6159
import torch
60+
from transformers import AutoTokenizer, AutoModelForCausalLM
6261

63-
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-mamba-7b")
64-
model = FalconMambaForCausalLM.from_pretrained("tiiuae/falcon-mamba-7b", torch_dtype=torch.bfloat16).to(0)
65-
model = torch.compile(model)
62+
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-mamba-7b-instruct")
63+
model = AutoModelForCausalLM.from_pretrained(
64+
"tiiuae/falcon-mamba-7b-instruct",
65+
torch_dtype=torch.bfloat16,
66+
device_map="auto"
67+
)
6668

67-
input_ids = tokenizer("Hey how are you doing?", return_tensors= "pt")["input_ids"]
69+
input_ids = tokenizer("Explain the difference between transformers and SSMs", return_tensors="pt").to("cuda")
6870

69-
out = model.generate(input_ids, max_new_tokens=10)
70-
print(tokenizer.batch_decode(out))
71+
output = model.generate(**input_ids, max_new_tokens=100, cache_implementation="static")
72+
print(tokenizer.decode(output[0], skip_special_tokens=True))
7173
```
7274

73-
If you have access to a GPU that is compatible with `bitsandbytes`, you can also quantize the model in 4-bit precision:
75+
</hfoption>
76+
<hfoption id="transformers-cli">
7477

75-
```python
76-
from transformers import FalconMambaForCausalLM, AutoTokenizer, BitsAndBytesConfig
77-
import torch
78-
79-
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-mamba-7b")
80-
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
81-
model = FalconMambaForCausalLM.from_pretrained("tiiuae/falcon-mamba-7b", quantization_config=quantization_config)
78+
```bash
79+
transformers-cli chat --model_name_or_path tiiuae/falcon-mamba-7b-instruct --torch_dtype auto --device 0
80+
```
8281

83-
input_ids = tokenizer("Hey how are you doing?", return_tensors= "pt")["input_ids"]
82+
</hfoption>
83+
</hfoptions>
8484

85-
out = model.generate(input_ids, max_new_tokens=10)
86-
print(tokenizer.batch_decode(out))
87-
```
85+
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
8886

89-
You can also play with the instruction fine-tuned model:
87+
The example below uses [bitsandbytes](../quantization/bitsandbytes) to quantize the weights to 4-bits.
9088

91-
```python
92-
from transformers import FalconMambaForCausalLM, AutoTokenizer
89+
```python
9390
import torch
91+
from transformers import AutoTokenizer, FalconMambaForCausalLM, BitsAndBytesConfig
9492

95-
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-mamba-7b-instruct")
96-
model = FalconMambaForCausalLM.from_pretrained("tiiuae/falcon-mamba-7b-instruct")
93+
quantization_config = BitsAndBytesConfig(
94+
load_in_4bit=True,
95+
bnb_4bit_compute_dtype=torch.bfloat16,
96+
bnb_4bit_quant_type="nf4",
97+
bnb_4bit_use_double_quant=True,
98+
)
9799

98-
# We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
99-
messages = [
100-
{"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
101-
]
102-
input_ids = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True).input_ids
103-
104-
outputs = model.generate(input_ids)
105-
print(tokenizer.decode(outputs[0]))
100+
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-mamba-7b")
101+
model = FalconMambaForCausalLM.from_pretrained(
102+
"tiiuae/falcon-mamba-7b",
103+
torch_dtype=torch.bfloat16,
104+
device_map="auto",
105+
quantization_config=quantization_config,
106+
)
107+
108+
inputs = tokenizer("Explain the concept of state space models in simple terms", return_tensors="pt").to("cuda")
109+
outputs = model.generate(**inputs, max_new_tokens=100)
110+
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
106111
```
107112

108113
## FalconMambaConfig

0 commit comments

Comments
 (0)