Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update falcon mamba card #37253

Merged
merged 7 commits into from
Apr 7, 2025
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
130 changes: 69 additions & 61 deletions docs/source/en/model_doc/falcon_mamba.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,97 +14,105 @@ rendered properly in your Markdown viewer.

-->

# FalconMamba

<div class="flex flex-wrap space-x-1">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrap this with the below to align it to the right:

<div style="float: right;">
   badges
</div>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! Forgot about that. Adding.

<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
</div>

## Overview

The FalconMamba model was proposed by TII UAE (Technology Innovation Institute) in their release.

The abstract from the paper is the following:

*We present FalconMamba, a new base large language model based on the novel Mamba architecture. FalconMamba is trained on 5.8 trillion tokens with carefully selected data mixtures. As a pure Mamba-based model, FalconMamba surpasses leading open-weight models based on Transformers, such as Mistral 7B, Llama3 8B, and Falcon2 11B. It is on par with Gemma 7B and outperforms models with different architecture designs, such as RecurrentGemma 9B. Currently, FalconMamba is the best-performing Mamba model in the literature at this scale, surpassing both existing Mamba and hybrid Mamba-Transformer models.
Due to its architecture, FalconMamba is significantly faster at inference and requires substantially less memory for long sequence generation. Despite recent studies suggesting that hybrid Mamba-Transformer models outperform pure architecture designs, we argue and demonstrate that the pure Mamba design can achieve similar, even superior results compared to the hybrid design. We make the weights of our implementation of FalconMamba publicly available under a permissive license.*

Tips:
# FalconMamba

- FalconMamba is mostly based on Mamba architecture, the same [tips and best practices](./mamba) would be relevant here.
markdownCopiar[FalconMamba](https://huggingface.co/papers/2410.05355) is a family of large language models based on the State Space Model (SSM) architecture, available in 7B parameter size as pretrained and instruction-tuned variants. This model implements a pure Mamba design that focuses on computational efficiency while maintaining strong performance. FalconMamba is significantly faster at inference and requires substantially less memory for long sequence generation. The models are pretrained on a diverse 5.8 trillion token dataset including [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), technical content, code, and mathematical data.

The model has been trained on approximtely 6T tokens consisting a mixture of many data sources such as RefineWeb, Cosmopedia and Math data.
You can find the official FalconMamba checkpoints in the [TII UAE collection](https://huggingface.co/collections/tiiuae/falconmamba-7b-66b9a580324dd1598b0f6d4a).

For more details about the training procedure and the architecture, have a look at [the technical paper of FalconMamba]() (coming soon).
> [!TIP]
> Click on the FalconMamba models in the right sidebar for more examples of how to apply FalconMamba to different language tasks.

# Usage
The examples below demonstrate how to generate text with [`Pipeline`], [`AutoModel`], and from the command line.

Below we demonstrate how to use the model:
<hfoptions id="usage">
<hfoption id="Pipeline">

```python
from transformers import FalconMambaForCausalLM, AutoTokenizer
```py
import torch

tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-mamba-7b")
model = FalconMambaForCausalLM.from_pretrained("tiiuae/falcon-mamba-7b")

input_ids = tokenizer("Hey how are you doing?", return_tensors= "pt")["input_ids"]

out = model.generate(input_ids, max_new_tokens=10)
print(tokenizer.batch_decode(out))
from transformers import pipeline

pipeline = pipeline(
"text-generation",
model="tiiuae/falcon-mamba-7b-instruct",
torch_dtype=torch.bfloat16,
device=0
)
pipeline(
"Explain the difference between transformers and SSMs",
max_length=100,
do_sample=True,
temperature=0.7
)
```

The architecture is also compatible with `torch.compile` for faster generation:
</hfoption>
<hfoption id="AutoModel">

```python
from transformers import FalconMambaForCausalLM, AutoTokenizer
```py
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-mamba-7b")
model = FalconMambaForCausalLM.from_pretrained("tiiuae/falcon-mamba-7b", torch_dtype=torch.bfloat16).to(0)
model = torch.compile(model)
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-mamba-7b-instruct")
model = AutoModelForCausalLM.from_pretrained(
"tiiuae/falcon-mamba-7b-instruct",
torch_dtype=torch.bfloat16,
device_map="auto"
)

input_ids = tokenizer("Hey how are you doing?", return_tensors= "pt")["input_ids"]
input_ids = tokenizer("Explain the difference between transformers and SSMs", return_tensors="pt").to("cuda")

out = model.generate(input_ids, max_new_tokens=10)
print(tokenizer.batch_decode(out))
output = model.generate(**input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```

If you have access to a GPU that is compatible with `bitsandbytes`, you can also quantize the model in 4-bit precision:

```python
from transformers import FalconMambaForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
</hfoption>
<hfoption id="transformers-cli">

tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-mamba-7b")
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
model = FalconMambaForCausalLM.from_pretrained("tiiuae/falcon-mamba-7b", quantization_config=quantization_config)
```bash
transformers-cli chat --model_name_or_path tiiuae/falcon-mamba-7b-instruct --torch_dtype auto --device 0
```

input_ids = tokenizer("Hey how are you doing?", return_tensors= "pt")["input_ids"]
</hfoption>
</hfoptions>

out = model.generate(input_ids, max_new_tokens=10)
print(tokenizer.batch_decode(out))
```
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.

You can also play with the instruction fine-tuned model:
The example below uses [bitsandbytes](../quantization/bitsandbytes) to quantize the weights to 4-bits.

```python
from transformers import FalconMambaForCausalLM, AutoTokenizer
```python
import torch
from transformers import AutoTokenizer, FalconMambaForCausalLM, BitsAndBytesConfig

tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-mamba-7b-instruct")
model = FalconMambaForCausalLM.from_pretrained("tiiuae/falcon-mamba-7b-instruct")

# We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
messages = [
{"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
]
input_ids = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True).input_ids
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
)

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-mamba-7b")
model = FalconMambaForCausalLM.from_pretrained(
"tiiuae/falcon-mamba-7b",
torch_dtype=torch.bfloat16,
device_map="auto",
quantization_config=quantization_config,
)

inputs = tokenizer("Explain the concept of state space models in simple terms", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## Notes

- FalconMamba is based on the Mamba architecture. The same [tips and best practices](./mamba) for Mamba models are relevant here.
- The architecture is compatible with `torch.compile` for faster generation via `model = torch.compile(model)`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove these notes as they're not that useful. The model is automatically compiled when we set cache_implementation="static" in generate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed!


## FalconMambaConfig

[[autodoc]] FalconMambaConfig
Expand Down