Skip to content

Commit 2ea8e7b

Browse files
committed
Update Changelog and use gemma-2-9b-it-IQ4_XS.gguf model across all examples
1 parent 417d993 commit 2ea8e7b

File tree

4 files changed

+43
-19
lines changed

4 files changed

+43
-19
lines changed

CHANGELOG.md

+24
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,30 @@ All notable changes to this project will be documented in this file.
44
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
55
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
66

7+
## [1.21.0] - 2024-08-03
8+
9+
### Added
10+
- [Server] Add -help option
11+
- [Server] Add -chatTemplate option
12+
- [Server] Add human readable file size
13+
- [Benchmark] Add llama-bench example
14+
15+
### Changed
16+
- [Build] Update torch to 2.2.1+cu121
17+
- [Build] Update OpenBLAS to 0.3.27
18+
- [Build] Update Python to 3.12
19+
- [Server] Default KV cache type to f16
20+
- [Documentation] Use gemma-2-9b-it-IQ4_XS.gguf model across all examples
21+
22+
### Fixed
23+
- [Build] Fix CUDA build after renaming in upstream llama.cpp
24+
- [Build] Fix gguf_dump.py after renaming in upstream llama.cpp
25+
- [Build] Add missing tiktoken package to support GLM models
26+
- [Build] Fix wikitext URI
27+
28+
### Removed
29+
- [Server] Remove broken chrome startup
30+
731
## [1.20.0] - 2024-06-13
832

933
### Changed

README.md

+13-13
Original file line numberDiff line numberDiff line change
@@ -84,9 +84,9 @@ To build llama.cpp binaries for a Windows environment with the best available BL
8484
8585
### 7. Download a large language model
8686

87-
Download a large language model (LLM) with weights in the GGUF format into the `./vendor/llama.cpp/models` directory. You can for example download the [openchat-3.6-8b-20240522](https://huggingface.co/openchat/openchat-3.6-8b-20240522) 8B model in a quantized GGUF format:
87+
Download a large language model (LLM) with weights in the GGUF format into the `./vendor/llama.cpp/models` directory. You can for example download the [gemma-2-9b-it](https://huggingface.co/google/gemma-2-9b-it) model in a quantized GGUF format:
8888

89-
* https://huggingface.co/bartowski/openchat-3.6-8b-20240522-GGUF/blob/main/openchat-3.6-8b-20240522-Q5_K_M.gguf
89+
* https://huggingface.co/bartowski/gemma-2-9b-it-GGUF/resolve/main/gemma-2-9b-it-IQ4_XS.gguf
9090

9191
> [!TIP]
9292
> See the [🤗 Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) and [LMSYS Chatbot Arena Leaderboard](https://chat.lmsys.org/?leaderboard) for best in class open source LLMs.
@@ -98,7 +98,7 @@ Download a large language model (LLM) with weights in the GGUF format into the `
9898
You can easily chat with a specific model by using the [.\examples\server.ps1](./examples/server.ps1) script:
9999

100100
```PowerShell
101-
.\examples\server.ps1 -model ".\vendor\llama.cpp\models\openchat-3.6-8b-20240522-Q5_K_M.gguf"
101+
.\examples\server.ps1 -model ".\vendor\llama.cpp\models\gemma-2-9b-it-IQ4_XS.gguf"
102102
```
103103

104104
> [!NOTE]
@@ -116,12 +116,12 @@ You can now chat with the model:
116116

117117
```PowerShell
118118
./vendor/llama.cpp/build/bin/Release/llama-cli `
119-
--model "./vendor/llama.cpp/models/openchat-3.6-8b-20240522-Q5_K_M.gguf" `
119+
--model "./vendor/llama.cpp/models/gemma-2-9b-it-IQ4_XS.gguf" `
120120
--ctx-size 8192 `
121121
--threads 16 `
122122
--n-gpu-layers 33 `
123123
--reverse-prompt '[[USER_NAME]]:' `
124-
--prompt-cache "./cache/openchat-3.6-8b-20240522-Q5_K_M.gguf.prompt" `
124+
--prompt-cache "./cache/gemma-2-9b-it-IQ4_XS.gguf.prompt" `
125125
--file "./vendor/llama.cpp/prompts/chat-with-vicuna-v1.txt" `
126126
--color `
127127
--interactive
@@ -133,7 +133,7 @@ You can start llama.cpp as a webserver:
133133

134134
```PowerShell
135135
./vendor/llama.cpp/build/bin/Release/llama-server `
136-
--model "./vendor/llama.cpp/models/openchat-3.6-8b-20240522-Q5_K_M.gguf" `
136+
--model "./vendor/llama.cpp/models/gemma-2-9b-it-IQ4_XS.gguf" `
137137
--ctx-size 8192 `
138138
--threads 16 `
139139
--n-gpu-layers 33
@@ -160,14 +160,14 @@ To extend the context to 32k execute the following:
160160

161161
```PowerShell
162162
./vendor/llama.cpp/build/bin/Release/llama-cli `
163-
--model "./vendor/llama.cpp/models/openchat-3.6-8b-20240522-Q5_K_M.gguf" `
163+
--model "./vendor/llama.cpp/models/gemma-2-9b-it-IQ4_XS.gguf" `
164164
--ctx-size 32768 `
165165
--rope-freq-scale 0.25 `
166166
--rope-freq-base 40000 `
167167
--threads 16 `
168168
--n-gpu-layers 33 `
169169
--reverse-prompt '[[USER_NAME]]:' `
170-
--prompt-cache "./cache/openchat-3.6-8b-20240522-Q5_K_M.gguf.prompt" `
170+
--prompt-cache "./cache/gemma-2-9b-it-IQ4_XS.gguf.prompt" `
171171
--file "./vendor/llama.cpp/prompts/chat-with-vicuna-v1.txt" `
172172
--color `
173173
--interactive
@@ -179,11 +179,11 @@ You can enforce a specific grammar for the response generation. The following wi
179179

180180
```PowerShell
181181
./vendor/llama.cpp/build/bin/Release/llama-cli `
182-
--model "./vendor/llama.cpp/models/openchat-3.6-8b-20240522-Q5_K_M.gguf" `
182+
--model "./vendor/llama.cpp/models/gemma-2-9b-it-IQ4_XS.gguf" `
183183
--ctx-size 8192 `
184184
--threads 16 `
185185
--n-gpu-layers 33 `
186-
--prompt-cache "./cache/openchat-3.6-8b-20240522-Q5_K_M.gguf.prompt" `
186+
--prompt-cache "./cache/gemma-2-9b-it-IQ4_XS.gguf.prompt" `
187187
--prompt "The scientific classification (Taxonomy) of a Llama: " `
188188
--grammar-file "./vendor/llama.cpp/grammars/json.gbnf"
189189
--color
@@ -195,7 +195,7 @@ Execute the following to measure the perplexity of the GGML formatted model:
195195

196196
```PowerShell
197197
./vendor/llama.cpp/build/bin/Release/llama-perplexity `
198-
--model "./vendor/llama.cpp/models/openchat-3.6-8b-20240522-Q5_K_M.gguf" `
198+
--model "./vendor/llama.cpp/models/gemma-2-9b-it-IQ4_XS.gguf" `
199199
--ctx-size 8192 `
200200
--threads 16 `
201201
--n-gpu-layers 33 `
@@ -208,15 +208,15 @@ You can easily count the tokens of a prompt for a specific model by using the [.
208208

209209
```PowerShell
210210
.\examples\count_tokens.ps1 `
211-
-model ".\vendor\llama.cpp\models\openchat-3.6-8b-20240522-Q5_K_M.gguf" `
211+
-model ".\vendor\llama.cpp\models\gemma-2-9b-it-IQ4_XS.gguf" `
212212
-file ".\prompts\chat_with_llm.txt"
213213
```
214214

215215
To inspect the actual tokenization result you can use the `-debug` flag:
216216

217217
```PowerShell
218218
.\examples\count_tokens.ps1 `
219-
-model ".\vendor\llama.cpp\models\openchat-3.6-8b-20240522-Q5_K_M.gguf" `
219+
-model ".\vendor\llama.cpp\models\gemma-2-9b-it-IQ4_XS.gguf" `
220220
-prompt "Hello Word!" `
221221
-debug
222222
```

examples/server.ps1

+5-5
Original file line numberDiff line numberDiff line change
@@ -35,19 +35,19 @@ Increases the verbosity of the llama.cpp server.
3535
Shows the manual on how to use this script.
3636
3737
.EXAMPLE
38-
.\server.ps1 -model "..\vendor\llama.cpp\models\openchat-3.6-8b-20240522-Q5_K_M.gguf"
38+
.\server.ps1 -model "..\vendor\llama.cpp\models\gemma-2-9b-it-IQ4_XS.gguf"
3939
4040
.EXAMPLE
41-
.\server.ps1 -model "C:\models\openchat-3.6-8b-20240522-Q5_K_M.gguf" -chatTemplate "llama3" -parallel 4
41+
.\server.ps1 -model "C:\models\gemma-2-9b-it-IQ4_XS.gguf" -chatTemplate "llama3" -parallel 4
4242
4343
.EXAMPLE
44-
.\server.ps1 -model "C:\models\openchat-3.6-8b-20240522-Q5_K_M.gguf" -contextSize 4096 -numberOfGPULayers 10
44+
.\server.ps1 -model "C:\models\gemma-2-9b-it-IQ4_XS.gguf" -contextSize 4096 -numberOfGPULayers 10
4545
4646
.EXAMPLE
47-
.\server.ps1 -model "C:\models\openchat-3.6-8b-20240522-Q5_K_M.gguf" -port 8081 -kvCacheDataType q8_0
47+
.\server.ps1 -model "C:\models\gemma-2-9b-it-IQ4_XS.gguf" -port 8081 -kvCacheDataType q8_0
4848
4949
.EXAMPLE
50-
.\server.ps1 -model "..\vendor\llama.cpp\models\openchat-3.6-8b-20240522-Q5_K_M.gguf" -verbose
50+
.\server.ps1 -model "..\vendor\llama.cpp\models\gemma-2-9b-it-IQ4_XS.gguf" -verbose
5151
#>
5252

5353
Param (

vendor/llama.cpp

0 commit comments

Comments
 (0)