Skip to content

Commit 30df5d1

Browse files
AbhishekRP2002stevhliu
authored andcommittedApr 5, 2025
chore: Update model doc for code_llama (huggingface#37115)
* Update code_llama.md aims to handle huggingface#36979 (comment) sub part of huggingface#36979 * Update docs/source/en/model_doc/code_llama.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/code_llama.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/code_llama.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * make changes as per code review * chore: make the function smaller for attention mask visualizer * chore[docs]: update code_llama.md with some more suggested changes * Update docs/source/en/model_doc/code_llama.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * chore[docs] : Update code_llama.md with indentation changes --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
1 parent 14da65e commit 30df5d1

File tree

1 file changed

+123
-77
lines changed

1 file changed

+123
-77
lines changed
 

‎docs/source/en/model_doc/code_llama.md

+123-77
Original file line numberDiff line numberDiff line change
@@ -14,108 +14,154 @@ rendered properly in your Markdown viewer.
1414
1515
-->
1616

17-
# CodeLlama
18-
19-
<div class="flex flex-wrap space-x-1">
20-
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
21-
<img alt="Flax" src="https://img.shields.io/badge/Flax-29a79b.svg?style=flat&logo=
22-
">
17+
<div style="float: right;">
18+
<div class="flex flex-wrap space-x-1">
19+
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
20+
<img alt="Flax" src="https://img.shields.io/badge/Flax-29a79b.svg?style=flat&logo=
21+
">
22+
</div>
2323
</div>
2424

25-
## Overview
26-
27-
The Code Llama model was proposed in [Code Llama: Open Foundation Models for Code](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/) by Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve.
25+
# CodeLlama
2826

29-
The abstract from the paper is the following:
27+
[Code Llama](https://huggingface.co/papers/2308.12950) is a specialized family of large language models based on [Llama 2](./llama2) for coding tasks. It comes in different flavors - general code, Python-specific, and instruction-following variant - all available in 7B, 13B, 34B, and 70B parameters. Code Llama models can generate, explain, and even fill in missing parts of your code (called "infilling"). It can also handle very long contexts with stable generation up to 100k tokens, even though it was trained on sequences of 16K tokens.
3028

31-
*We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of applications: foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models (Code Llama - Instruct) with 7B, 13B and 34B parameters each. All models are trained on sequences of 16k tokens and show improvements on inputs with up to 100k tokens. 7B and 13B Code Llama and Code Llama - Instruct variants support infilling based on surrounding content. Code Llama reaches state-of-the-art performance among open models on several code benchmarks, with scores of up to 53% and 55% on HumanEval and MBPP, respectively. Notably, Code Llama - Python 7B outperforms Llama 2 70B on HumanEval and MBPP, and all our models outperform every other publicly available model on MultiPL-E. We release Code Llama under a permissive license that allows for both research and commercial use.*
29+
You can find all the original Code Llama checkpoints under the [Code Llama](https://huggingface.co/collections/meta-llama/code-llama-family-661da32d0a9d678b6f55b933) collection.
3230

33-
Check out all Code Llama model checkpoints [here](https://huggingface.co/models?search=code_llama) and the officially released ones in the [Meta Llama org](https://huggingface.co/meta-llama).
31+
> [!TIP]
32+
> Click on the Code Llama models in the right sidebar for more examples of how to apply Code Llama to different coding tasks.
3433
35-
This model was contributed by [ArthurZucker](https://huggingface.co/ArthurZ). The original code of the authors can be found [here](https://github.com/facebookresearch/llama).
34+
The example below demonstrates how to generate code with [`Pipeline`], or the [`AutoModel`], and from the command line.
3635

37-
## Usage tips and examples
36+
<hfoptions id="usage">
37+
<hfoption id="Pipeline">
38+
39+
```py
40+
import torch
41+
from transformers import pipeline
3842

39-
<Tip warning={true}>
43+
pipe = pipeline(
44+
"text-generation",
45+
model="meta-llama/CodeLlama-7b-hf",
46+
torch_dtype=torch.float16,
47+
device_map=0
48+
)
4049

41-
The `Llama2` family models, on which Code Llama is based, were trained using `bfloat16`, but the original inference uses `float16`. Let's look at the different precisions:
50+
# basic code generation
51+
result = pipe("# Function to calculate the factorial of a number\ndef factorial(n):", max_new_tokens=256)
52+
print(result[0]['generated_text'])
4253

43-
* `float32`: PyTorch convention on model initialization is to load models in `float32`, no matter with which `dtype` the model weights were stored. `transformers` also follows this convention for consistency with PyTorch. This will be picked by default. If you want the `AutoModel` API to load the checkpoints with the storage weights type, you must specify `torch_dtype="auto"`, e.g. `model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto")`.
44-
* `bfloat16`: Code Llama was trained with this precision, so we recommend using it for further training or fine-tuning.
45-
* `float16`: We recommend running inference using this precision, as it's usually faster than `bfloat16`, and evaluation metrics show no discernible degradation with respect to `bfloat16`. You can also run inference using `bfloat16`, and we recommend you check inference results with both `float16` and `bfloat16` after fine-tuning.
54+
# infilling
55+
infill_result = pipe("def remove_non_ascii(s: str) -> str:\n \"\"\" <FILL_ME>\n return result", max_new_tokens=200)
56+
print(infill_result[0]['generated_text'])
57+
```
4658

47-
As mentioned above, the `dtype` of the storage weights is mostly irrelevant unless you are using `torch_dtype="auto"` when initializing a model using. The reason is that the model will first be downloaded (using the `dtype` of the checkpoints online) and then will be casted to the default `dtype` of `torch` (becomes `torch.float32`). If there is a specified `torch_dtype`, it will be used instead.
59+
</hfoption>
60+
<hfoption id="AutoModel">
61+
62+
```py
63+
import torch
64+
from transformers import AutoModelForCausalLM, AutoTokenizer
65+
66+
tokenizer = AutoTokenizer.from_pretrained("meta-llama/CodeLlama-7b-hf")
67+
model = AutoModelForCausalLM.from_pretrained(
68+
"meta-llama/CodeLlama-7b-hf",
69+
torch_dtype=torch.float16,
70+
device_map="auto",
71+
attn_implementation="sdpa"
72+
)
73+
74+
# basic code generation
75+
prompt = "# Function to calculate the factorial of a number\ndef factorial(n):"
76+
input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")
77+
78+
output = model.generate(
79+
**input_ids,
80+
max_new_tokens=256,
81+
cache_implementation="static"
82+
)
83+
print(tokenizer.decode(output[0], skip_special_tokens=True))
84+
85+
# infilling
86+
infill_prompt = "def remove_non_ascii(s: str) -> str:\n \"\"\" <FILL_ME>\n return result"
87+
input_ids = tokenizer(infill_prompt, return_tensors="pt").to(model.device)
88+
89+
filled_output = model.generate(**input_ids, max_new_tokens=200)
90+
filled_text = tokenizer.decode(filled_output[0], skip_special_tokens=True)
91+
print(filled_text)
92+
```
4893

49-
</Tip>
94+
</hfoption>
95+
<hfoption id="transformers-cli">
96+
97+
```bash
98+
echo -e "# Function to calculate the factorial of a number\ndef factorial(n):" | transformers-cli run --task text-generation --model meta-llama/CodeLlama-7b-hf --device 0
99+
```
50100

101+
</hfoption>
102+
</hfoptions>
51103

52-
Tips:
53-
- The infilling task is supported out of the box. You should be using the `tokenizer.fill_token` where you want your input to be filled.
54-
- The model conversion script is the same as for the `Llama2` family:
104+
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
55105

56-
Here is a sample usage:
106+
The example below uses [bitsandbytes](../quantization/bitsandbytes) to only quantize the weights to 4-bits.
57107

58-
```bash
59-
python src/transformers/models/llama/convert_llama_weights_to_hf.py \
60-
--input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir /output/path
61-
```
108+
```py
109+
# pip install bitsandbytes
110+
import torch
111+
from transformers import AutoModelForCausalLM, CodeLlamaTokenizer, BitsAndBytesConfig
62112

63-
Note that executing the script requires enough CPU RAM to host the whole model in float16 precision (even if the biggest versions
64-
come in several checkpoints they each contain a part of each weight of the model, so we need to load them all in RAM).
65-
66-
After conversion, the model and tokenizer can be loaded via:
67-
68-
```python
69-
>>> from transformers import LlamaForCausalLM, CodeLlamaTokenizer
70-
71-
>>> tokenizer = CodeLlamaTokenizer.from_pretrained("meta-llama/CodeLlama-7b-hf")
72-
>>> model = LlamaForCausalLM.from_pretrained("meta-llama/CodeLlama-7b-hf")
73-
>>> PROMPT = '''def remove_non_ascii(s: str) -> str:
74-
... """ <FILL_ME>
75-
... return result
76-
... '''
77-
>>> input_ids = tokenizer(PROMPT, return_tensors="pt")["input_ids"]
78-
>>> generated_ids = model.generate(input_ids, max_new_tokens=128)
79-
80-
>>> filling = tokenizer.batch_decode(generated_ids[:, input_ids.shape[1]:], skip_special_tokens = True)[0]
81-
>>> print(PROMPT.replace("<FILL_ME>", filling))
82-
def remove_non_ascii(s: str) -> str:
83-
""" Remove non-ASCII characters from a string.
84-
<BLANKLINE>
85-
Args:
86-
s: The string to remove non-ASCII characters from.
87-
<BLANKLINE>
88-
Returns:
89-
The string with non-ASCII characters removed.
90-
"""
91-
result = ""
92-
for c in s:
93-
if ord(c) < 128:
94-
result += c
95-
return result
96-
<BLANKLINE>
97-
```
113+
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True)
114+
tokenizer = CodeLlamaTokenizer.from_pretrained("meta-llama/CodeLlama-34b-hf")
115+
model = AutoModelForCausalLM.from_pretrained(
116+
"meta-llama/CodeLlama-34b-hf",
117+
torch_dtype=torch.bfloat16,
118+
device_map="auto",
119+
quantization_config=bnb_config
120+
)
98121

99-
If you only want the infilled part:
100-
```python
101-
>>> from transformers import pipeline
102-
>>> import torch
122+
prompt = "# Write a Python function to check if a string is a palindrome\ndef is_palindrome(s):"
123+
input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")
103124

104-
>>> generator = pipeline("text-generation",model="meta-llama/CodeLlama-7b-hf",torch_dtype=torch.float16, device_map="auto")
105-
>>> generator('def remove_non_ascii(s: str) -> str:\n """ <FILL_ME>\n return result', max_new_tokens = 128)
106-
[{'generated_text': 'def remove_non_ascii(s: str) -> str:\n """ <FILL_ME>\n return resultRemove non-ASCII characters from a string. """\n result = ""\n for c in s:\n if ord(c) < 128:\n result += c'}]
125+
output = model.generate(**input_ids, max_new_tokens=200, cache_implementation="static")
126+
print(tokenizer.decode(output[0], skip_special_tokens=True))
107127
```
108128

109-
Under the hood, the tokenizer [automatically splits by `<FILL_ME>`](https://huggingface.co/docs/transformers/main/model_doc/code_llama#transformers.CodeLlamaTokenizer.fill_token) to create a formatted input string that follows [the original training pattern](https://github.com/facebookresearch/codellama/blob/cb51c14ec761370ba2e2bc351374a79265d0465e/llama/generation.py#L402). This is more robust than preparing the pattern yourself: it avoids pitfalls, such as token glueing, that are very hard to debug. To see how much CPU and GPU memory you need for this model or others, try [this calculator](https://huggingface.co/spaces/hf-accelerate/model-memory-usage) which can help determine that value.
129+
Use the [AttentionMaskVisualizer](https://github.com/huggingface/transformers/blob/beb9b5b02246b9b7ee81ddf938f93f44cfeaad19/src/transformers/utils/attention_visualizer.py#L139) to better understand what tokens the model can and cannot attend to.
110130

111-
The LLaMA tokenizer is a BPE model based on [sentencepiece](https://github.com/google/sentencepiece). One quirk of sentencepiece is that when decoding a sequence, if the first token is the start of the word (e.g. "Banana"), the tokenizer does not prepend the prefix space to the string.
131+
```py
132+
from transformers.utils.attention_visualizer import AttentionMaskVisualizer
112133

113-
<Tip>
134+
visualizer = AttentionMaskVisualizer("meta-llama/CodeLlama-7b-hf")
135+
visualizer("""def func(a, b):
136+
return a + b""")
137+
```
114138

115-
Code Llama has the same architecture as the `Llama2` models, refer to [Llama2's documentation page](llama2) for the API reference.
116-
Find Code Llama tokenizer reference below.
117-
</Tip>
139+
<div class="flex justify-center">
140+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/codellama-attn-mask.png"/>
141+
</div>
118142

143+
## Notes
144+
145+
- Infilling is only available in the 7B and 13B base models, and not in the Python, Instruct, 34B, or 70B models.
146+
- Use the `<FILL_ME>` token where you want your input to be filled. The tokenizer splits this token to create a formatted input string that follows the [original training pattern](https://github.com/facebookresearch/codellama/blob/cb51c14ec761370ba2e2bc351374a79265d0465e/llama/generation.py#L402). This is more robust than preparing the pattern yourself.
147+
```py
148+
from transformers import LlamaForCausalLM, CodeLlamaTokenizer
149+
150+
tokenizer = CodeLlamaTokenizer.from_pretrained("meta-llama/CodeLlama-7b-hf")
151+
model = LlamaForCausalLM.from_pretrained("meta-llama/CodeLlama-7b-hf")
152+
PROMPT = '''def remove_non_ascii(s: str) -> str:
153+
""" <FILL_ME>
154+
return result
155+
'''
156+
input_ids = tokenizer(PROMPT, return_tensors="pt")["input_ids"]
157+
generated_ids = model.generate(input_ids, max_new_tokens=128)
158+
159+
filling = tokenizer.batch_decode(generated_ids[:, input_ids.shape[1]:], skip_special_tokens = True)[0]
160+
print(PROMPT.replace("<FILL_ME>", filling))
161+
```
162+
- Use `bfloat16` for further training or fine-tuning and `float16` for inference.
163+
- The `BOS` character is not used for infilling when encoding the prefix or suffix, but only at the beginning of each prompt.
164+
- The tokenizer is a byte-pair encoding model based on [SentencePiece](https://github.com/google/sentencepiece). During decoding, if the first token is the start of the word (for example, “Banana”), the tokenizer doesn’t prepend the prefix space to the string.
119165

120166
## CodeLlamaTokenizer
121167

0 commit comments

Comments
 (0)
Please sign in to comment.