From c65685fc2559556fb6bde2f48c458aec9c394707 Mon Sep 17 00:00:00 2001
From: Ricardo Alanis <ricardo.alanis@gmail.com>
Date: Thu, 3 Apr 2025 09:23:58 -0600
Subject: [PATCH 1/7] feat: edit falcon mamba card

---
 docs/source/en/model_doc/falcon_mamba.md | 130 ++++++++++++-----------
 1 file changed, 69 insertions(+), 61 deletions(-)

diff --git a/docs/source/en/model_doc/falcon_mamba.md b/docs/source/en/model_doc/falcon_mamba.md
index fb6debfef921..cd8a95950142 100644
--- a/docs/source/en/model_doc/falcon_mamba.md
+++ b/docs/source/en/model_doc/falcon_mamba.md
@@ -14,97 +14,105 @@ rendered properly in your Markdown viewer.
 
 -->
 
-# FalconMamba
-
 <div class="flex flex-wrap space-x-1">
 <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
 </div>
 
-## Overview
-
-The FalconMamba model was proposed by TII UAE (Technology Innovation Institute) in their release.
-
-The abstract from the paper is the following:
-
-*We present FalconMamba, a new base large language model based on the novel Mamba architecture. FalconMamba is trained on 5.8 trillion tokens with carefully selected data mixtures. As a pure Mamba-based model, FalconMamba surpasses leading open-weight models based on Transformers, such as Mistral 7B, Llama3 8B, and Falcon2 11B. It is on par with Gemma 7B and outperforms models with different architecture designs, such as RecurrentGemma 9B. Currently, FalconMamba is the best-performing Mamba model in the literature at this scale, surpassing both existing Mamba and hybrid Mamba-Transformer models.
-Due to its architecture, FalconMamba is significantly faster at inference and requires substantially less memory for long sequence generation. Despite recent studies suggesting that hybrid Mamba-Transformer models outperform pure architecture designs, we argue and demonstrate that the pure Mamba design can achieve similar, even superior results compared to the hybrid design. We make the weights of our implementation of FalconMamba publicly available under a permissive license.*
-
-Tips:
+# FalconMamba
 
-- FalconMamba is mostly based on Mamba architecture, the same [tips and best practices](./mamba) would be relevant here.
+[FalconMamba](https://huggingface.co/papers/2410.05355) is a family of large language models based on the State Space Model (SSM) architecture, available in 7B parameter size as pretrained and instruction-tuned variants. This model implements a pure Mamba design that focuses on computational efficiency while maintaining strong performance. FalconMamba uses linear-time selective state space models and rotary positional embeddings (RoPE). The models are pretrained on a diverse 5.8 trillion token dataset including [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), technical content, code, and mathematical data.
 
-The model has been trained on approximtely 6T tokens consisting a mixture of many data sources such as RefineWeb, Cosmopedia and Math data.
+You can find the official FalconMamba checkpoints in the [TII UAE collection](https://huggingface.co/collections/tiiuae/falconmamba-7b-66b9a580324dd1598b0f6d4a).
 
-For more details about the training procedure and the architecture, have a look at [the technical paper of FalconMamba]() (coming soon).
+> [!TIP]
+> Click on the FalconMamba models in the right sidebar for more examples of how to apply FalconMamba to different language tasks.
 
-# Usage
+The examples below demonstrate how to generate text with [`Pipeline`], [`AutoModel`], and from the command line.
 
-Below we demonstrate how to use the model:
+<hfoptions id="usage">
+<hfoption id="Pipeline">
 
-```python 
-from transformers import FalconMambaForCausalLM, AutoTokenizer
+```py
 import torch
-
-tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-mamba-7b")
-model = FalconMambaForCausalLM.from_pretrained("tiiuae/falcon-mamba-7b")
-
-input_ids = tokenizer("Hey how are you doing?", return_tensors= "pt")["input_ids"]
-
-out = model.generate(input_ids, max_new_tokens=10)
-print(tokenizer.batch_decode(out))
+from transformers import pipeline
+
+pipeline = pipeline(
+    "text-generation", 
+    model="tiiuae/falcon-mamba-7b-instruct",
+    torch_dtype=torch.bfloat16,
+    device=0
+)
+pipeline(
+    "Explain the difference between transformers and SSMs",
+    max_length=100,
+    do_sample=True,
+    temperature=0.7
+)
 ```
 
-The architecture is also compatible with `torch.compile` for faster generation:
+</hfoption>
+<hfoption id="AutoModel">
 
-```python 
-from transformers import FalconMambaForCausalLM, AutoTokenizer
+```py
 import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM
 
-tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-mamba-7b")
-model = FalconMambaForCausalLM.from_pretrained("tiiuae/falcon-mamba-7b", torch_dtype=torch.bfloat16).to(0)
-model = torch.compile(model)
+tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-mamba-7b-instruct")
+model = AutoModelForCausalLM.from_pretrained(
+    "tiiuae/falcon-mamba-7b-instruct",
+    torch_dtype=torch.bfloat16,
+    device_map="auto"
+)
 
-input_ids = tokenizer("Hey how are you doing?", return_tensors= "pt")["input_ids"]
+input_ids = tokenizer("Explain the difference between transformers and SSMs", return_tensors="pt").to("cuda")
 
-out = model.generate(input_ids, max_new_tokens=10)
-print(tokenizer.batch_decode(out))
+output = model.generate(**input_ids, max_new_tokens=100)
+print(tokenizer.decode(output[0], skip_special_tokens=True))
 ```
 
-If you have access to a GPU that is compatible with `bitsandbytes`, you can also quantize the model in 4-bit precision:
-
-```python 
-from transformers import FalconMambaForCausalLM, AutoTokenizer, BitsAndBytesConfig
-import torch
+</hfoption>
+<hfoption id="transformers-cli">
 
-tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-mamba-7b")
-quantization_config = BitsAndBytesConfig(load_in_4bit=True)
-model = FalconMambaForCausalLM.from_pretrained("tiiuae/falcon-mamba-7b", quantization_config=quantization_config)
+```bash
+transformers-cli chat --model_name_or_path tiiuae/falcon-mamba-7b-instruct --torch_dtype auto --device 0
+```
 
-input_ids = tokenizer("Hey how are you doing?", return_tensors= "pt")["input_ids"]
+</hfoption>
+</hfoptions>
 
-out = model.generate(input_ids, max_new_tokens=10)
-print(tokenizer.batch_decode(out))
-```
+Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
 
-You can also play with the instruction fine-tuned model:
+The example below uses [bitsandbytes](../quantization/bitsandbytes) to quantize the weights to 4-bits.
 
-```python 
-from transformers import FalconMambaForCausalLM, AutoTokenizer
+```python
 import torch
+from transformers import AutoTokenizer, FalconMambaForCausalLM, BitsAndBytesConfig
 
-tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-mamba-7b-instruct")
-model = FalconMambaForCausalLM.from_pretrained("tiiuae/falcon-mamba-7b-instruct")
-
-# We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
-messages = [
-    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
-]
-input_ids = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True).input_ids
+quantization_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_compute_dtype=torch.bfloat16,
+    bnb_4bit_quant_type="nf4",
+    bnb_4bit_use_double_quant=True,
+)
 
-outputs = model.generate(input_ids)
-print(tokenizer.decode(outputs[0]))
+tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-mamba-7b")
+model = FalconMambaForCausalLM.from_pretrained(
+    "tiiuae/falcon-mamba-7b",
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    quantization_config=quantization_config,
+)
+
+inputs = tokenizer("Explain the concept of state space models in simple terms", return_tensors="pt").to("cuda")
+outputs = model.generate(**inputs, max_new_tokens=100)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
 
+## Notes
+
+- FalconMamba is based on the Mamba architecture. The same [tips and best practices](./mamba) for Mamba models are relevant here.
+- The architecture is compatible with `torch.compile` for faster generation via `model = torch.compile(model)`.
+
 ## FalconMambaConfig
 
 [[autodoc]] FalconMambaConfig

From 8209bc93a4feebdc23dc281ce648699885a791ae Mon Sep 17 00:00:00 2001
From: Ricardo Alanis <ricardo.alanis@gmail.com>
Date: Thu, 3 Apr 2025 11:06:00 -0600
Subject: [PATCH 2/7] fix: edit statement on falconmamba arch

---
 docs/source/en/model_doc/falcon_mamba.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/en/model_doc/falcon_mamba.md b/docs/source/en/model_doc/falcon_mamba.md
index cd8a95950142..a59332f03ac6 100644
--- a/docs/source/en/model_doc/falcon_mamba.md
+++ b/docs/source/en/model_doc/falcon_mamba.md
@@ -20,7 +20,7 @@ rendered properly in your Markdown viewer.
 
 # FalconMamba
 
-[FalconMamba](https://huggingface.co/papers/2410.05355) is a family of large language models based on the State Space Model (SSM) architecture, available in 7B parameter size as pretrained and instruction-tuned variants. This model implements a pure Mamba design that focuses on computational efficiency while maintaining strong performance. FalconMamba uses linear-time selective state space models and rotary positional embeddings (RoPE). The models are pretrained on a diverse 5.8 trillion token dataset including [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), technical content, code, and mathematical data.
+markdownCopiar[FalconMamba](https://huggingface.co/papers/2410.05355) is a family of large language models based on the State Space Model (SSM) architecture, available in 7B parameter size as pretrained and instruction-tuned variants. This model implements a pure Mamba design that focuses on computational efficiency while maintaining strong performance. FalconMamba is significantly faster at inference and requires substantially less memory for long sequence generation. The models are pretrained on a diverse 5.8 trillion token dataset including [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), technical content, code, and mathematical data.
 
 You can find the official FalconMamba checkpoints in the [TII UAE collection](https://huggingface.co/collections/tiiuae/falconmamba-7b-66b9a580324dd1598b0f6d4a).
 

From 470a91f589174d8498594aac53de23b62e8e3093 Mon Sep 17 00:00:00 2001
From: Ricardo Alanis <ricardo.alanis@gmail.com>
Date: Thu, 3 Apr 2025 17:06:28 -0600
Subject: [PATCH 3/7] Update docs/source/en/model_doc/falcon_mamba.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
 docs/source/en/model_doc/falcon_mamba.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/en/model_doc/falcon_mamba.md b/docs/source/en/model_doc/falcon_mamba.md
index a59332f03ac6..0769221f20f7 100644
--- a/docs/source/en/model_doc/falcon_mamba.md
+++ b/docs/source/en/model_doc/falcon_mamba.md
@@ -20,7 +20,7 @@ rendered properly in your Markdown viewer.
 
 # FalconMamba
 
-markdownCopiar[FalconMamba](https://huggingface.co/papers/2410.05355) is a family of large language models based on the State Space Model (SSM) architecture, available in 7B parameter size as pretrained and instruction-tuned variants. This model implements a pure Mamba design that focuses on computational efficiency while maintaining strong performance. FalconMamba is significantly faster at inference and requires substantially less memory for long sequence generation. The models are pretrained on a diverse 5.8 trillion token dataset including [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), technical content, code, and mathematical data.
+[FalconMamba](https://huggingface.co/papers/2410.05355) is a 7B large language model, available as pretrained and instruction-tuned variants, based on the [Mamba](./mamba). This model implements a pure Mamba design that focuses on computational efficiency while maintaining strong performance. FalconMamba is significantly faster at inference and requires substantially less memory for long sequence generation. The models are pretrained on a diverse 5.8T token dataset including [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), technical content, code, and mathematical data.
 
 You can find the official FalconMamba checkpoints in the [TII UAE collection](https://huggingface.co/collections/tiiuae/falconmamba-7b-66b9a580324dd1598b0f6d4a).
 

From 2c66026e004bd9cabd690def609b108167193237 Mon Sep 17 00:00:00 2001
From: Ricardo Alanis <ricardo.alanis@gmail.com>
Date: Thu, 3 Apr 2025 17:06:41 -0600
Subject: [PATCH 4/7] Update docs/source/en/model_doc/falcon_mamba.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
 docs/source/en/model_doc/falcon_mamba.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/en/model_doc/falcon_mamba.md b/docs/source/en/model_doc/falcon_mamba.md
index 0769221f20f7..0ad5ead10c09 100644
--- a/docs/source/en/model_doc/falcon_mamba.md
+++ b/docs/source/en/model_doc/falcon_mamba.md
@@ -22,7 +22,7 @@ rendered properly in your Markdown viewer.
 
 [FalconMamba](https://huggingface.co/papers/2410.05355) is a 7B large language model, available as pretrained and instruction-tuned variants, based on the [Mamba](./mamba). This model implements a pure Mamba design that focuses on computational efficiency while maintaining strong performance. FalconMamba is significantly faster at inference and requires substantially less memory for long sequence generation. The models are pretrained on a diverse 5.8T token dataset including [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), technical content, code, and mathematical data.
 
-You can find the official FalconMamba checkpoints in the [TII UAE collection](https://huggingface.co/collections/tiiuae/falconmamba-7b-66b9a580324dd1598b0f6d4a).
+You can find the official FalconMamba checkpoints in the [FalconMamba 7B](https://huggingface.co/collections/tiiuae/falconmamba-7b-66b9a580324dd1598b0f6d4a) collection.
 
 > [!TIP]
 > Click on the FalconMamba models in the right sidebar for more examples of how to apply FalconMamba to different language tasks.

From 3aabeccefcf40a256692828cf08a8ac452aa6e54 Mon Sep 17 00:00:00 2001
From: Ricardo Alanis <ricardo.alanis@gmail.com>
Date: Thu, 3 Apr 2025 17:06:52 -0600
Subject: [PATCH 5/7] Update docs/source/en/model_doc/falcon_mamba.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---
 docs/source/en/model_doc/falcon_mamba.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/en/model_doc/falcon_mamba.md b/docs/source/en/model_doc/falcon_mamba.md
index 0ad5ead10c09..9dfa2c02540d 100644
--- a/docs/source/en/model_doc/falcon_mamba.md
+++ b/docs/source/en/model_doc/falcon_mamba.md
@@ -66,7 +66,7 @@ model = AutoModelForCausalLM.from_pretrained(
 
 input_ids = tokenizer("Explain the difference between transformers and SSMs", return_tensors="pt").to("cuda")
 
-output = model.generate(**input_ids, max_new_tokens=100)
+output = model.generate(**input_ids, max_new_tokens=100, cache_implementation="static")
 print(tokenizer.decode(output[0], skip_special_tokens=True))
 ```
 

From 89e818e36c87035a2c9b88d7fc4c548839ecd0dc Mon Sep 17 00:00:00 2001
From: Ricardo Alanis <ricardo.alanis@gmail.com>
Date: Thu, 3 Apr 2025 17:10:37 -0600
Subject: [PATCH 6/7] fix: add right indent for tags

---
 docs/source/en/model_doc/falcon_mamba.md | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/docs/source/en/model_doc/falcon_mamba.md b/docs/source/en/model_doc/falcon_mamba.md
index 9dfa2c02540d..5bd106e38fa2 100644
--- a/docs/source/en/model_doc/falcon_mamba.md
+++ b/docs/source/en/model_doc/falcon_mamba.md
@@ -14,8 +14,10 @@ rendered properly in your Markdown viewer.
 
 -->
 
-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
+<div style="float: right;">
+    <div class="flex flex-wrap space-x-1">
+        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
+    </div>
 </div>
 
 # FalconMamba

From 6507c768e782a7f36ee4e6b0e68c20e85a209310 Mon Sep 17 00:00:00 2001
From: Ricardo Alanis <ricardo.alanis@gmail.com>
Date: Sat, 5 Apr 2025 08:20:16 -0600
Subject: [PATCH 7/7] fix: remove notas

---
 docs/source/en/model_doc/falcon_mamba.md | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/docs/source/en/model_doc/falcon_mamba.md b/docs/source/en/model_doc/falcon_mamba.md
index 5bd106e38fa2..ef346e89892e 100644
--- a/docs/source/en/model_doc/falcon_mamba.md
+++ b/docs/source/en/model_doc/falcon_mamba.md
@@ -110,11 +110,6 @@ outputs = model.generate(**inputs, max_new_tokens=100)
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
 
-## Notes
-
-- FalconMamba is based on the Mamba architecture. The same [tips and best practices](./mamba) for Mamba models are relevant here.
-- The architecture is compatible with `torch.compile` for faster generation via `model = torch.compile(model)`.
-
 ## FalconMambaConfig
 
 [[autodoc]] FalconMambaConfig