6
6
![ Python] ( https://img.shields.io/pypi/pyversions/chatglm-cpp )
7
7
[ ![ License: MIT] ( https://img.shields.io/badge/license-MIT-blue )] ( LICENSE )
8
8
9
- C++ implementation of [ ChatGLM-6B] ( https://github.com/THUDM/ChatGLM-6B ) , [ ChatGLM2-6B] ( https://github.com/THUDM/ChatGLM2-6B ) , [ ChatGLM3] ( https://github.com/THUDM/ChatGLM3 ) , [ GLM-4] ( https://github.com/THUDM/GLM-4 ) and more LLMs for real-time chatting on your MacBook.
9
+ C++ implementation of [ ChatGLM-6B] ( https://github.com/THUDM/ChatGLM-6B ) , [ ChatGLM2-6B] ( https://github.com/THUDM/ChatGLM2-6B ) , [ ChatGLM3] ( https://github.com/THUDM/ChatGLM3 ) and [ GLM-4] ( https://github.com/THUDM/GLM-4 ) for real-time chatting on your MacBook.
10
10
11
11
![ demo] ( docs/demo.gif )
12
12
@@ -22,9 +22,7 @@ Highlights:
22
22
Support Matrix:
23
23
* Hardwares: x86/arm CPU, NVIDIA GPU, Apple Silicon GPU
24
24
* Platforms: Linux, MacOS, Windows
25
- * Models: [ ChatGLM-6B] ( https://github.com/THUDM/ChatGLM-6B ) , [ ChatGLM2-6B] ( https://github.com/THUDM/ChatGLM2-6B ) , [ ChatGLM3] ( https://github.com/THUDM/ChatGLM3 ) , [ GLM-4] ( https://github.com/THUDM/GLM-4 ) , [ CodeGeeX2] ( https://github.com/THUDM/CodeGeeX2 ) , [ Baichuan-13B] ( https://github.com/baichuan-inc/Baichuan-13B ) , [ Baichuan-7B] ( https://github.com/baichuan-inc/Baichuan-7B ) , [ Baichuan-13B] ( https://github.com/baichuan-inc/Baichuan-13B ) , [ Baichuan2] ( https://github.com/baichuan-inc/Baichuan2 ) , [ InternLM] ( https://github.com/InternLM/InternLM )
26
-
27
- ** NOTE** : Baichuan & InternLM model series are deprecated in favor of [ llama.cpp] ( https://github.com/ggerganov/llama.cpp ) .
25
+ * Models: [ ChatGLM-6B] ( https://github.com/THUDM/ChatGLM-6B ) , [ ChatGLM2-6B] ( https://github.com/THUDM/ChatGLM2-6B ) , [ ChatGLM3] ( https://github.com/THUDM/ChatGLM3 ) , [ GLM-4] ( https://github.com/THUDM/GLM-4 ) , [ CodeGeeX2] ( https://github.com/THUDM/CodeGeeX2 )
28
26
29
27
## Getting Started
30
28
@@ -59,7 +57,6 @@ The original model (`-i <model_name_or_path>`) can be a Hugging Face model name
59
57
* ChatGLM3-6B: ` THUDM/chatglm3-6b `
60
58
* ChatGLM4-9B: ` THUDM/glm-4-9b-chat `
61
59
* CodeGeeX2: ` THUDM/codegeex2-6b ` , ` THUDM/codegeex2-6b-int4 `
62
- * Baichuan & Baichuan2: ` baichuan-inc/Baichuan-13B-Chat ` , ` baichuan-inc/Baichuan2-7B-Chat ` , ` baichuan-inc/Baichuan2-13B-Chat `
63
60
64
61
You are free to try any of the below quantization types by specifying ` -t <type> ` :
65
62
* ` q4_0 ` : 4-bit integer quantization with fp16 scales.
@@ -212,56 +209,6 @@ print(bubble_sort([5, 4, 3, 2, 1]))
212
209
` ` `
213
210
< /details>
214
211
215
- < details>
216
- < summary> Baichuan-13B-Chat< /summary>
217
-
218
- ` ` ` sh
219
- python3 chatglm_cpp/convert.py -i baichuan-inc/Baichuan-13B-Chat -t q4_0 -o models/baichuan-13b-chat-ggml.bin
220
- ./build/bin/main -m models/baichuan-13b-chat-ggml.bin -p 你好 --top_k 5 --top_p 0.85 --temp 0.3 --repeat_penalty 1.1
221
- # 你好!有什么我可以帮助你的吗?
222
- ` ` `
223
- < /details>
224
-
225
- < details>
226
- < summary> Baichuan2-7B-Chat< /summary>
227
-
228
- ` ` ` sh
229
- python3 chatglm_cpp/convert.py -i baichuan-inc/Baichuan2-7B-Chat -t q4_0 -o models/baichuan2-7b-chat-ggml.bin
230
- ./build/bin/main -m models/baichuan2-7b-chat-ggml.bin -p 你好 --top_k 5 --top_p 0.85 --temp 0.3 --repeat_penalty 1.05
231
- # 你好!很高兴为您提供帮助。请问有什么问题我可以帮您解答?
232
- ` ` `
233
- < /details>
234
-
235
- < details>
236
- < summary> Baichuan2-13B-Chat< /summary>
237
-
238
- ` ` ` sh
239
- python3 chatglm_cpp/convert.py -i baichuan-inc/Baichuan2-13B-Chat -t q4_0 -o models/baichuan2-13b-chat-ggml.bin
240
- ./build/bin/main -m models/baichuan2-13b-chat-ggml.bin -p 你好 --top_k 5 --top_p 0.85 --temp 0.3 --repeat_penalty 1.05
241
- # 你好!今天我能为您提供什么帮助?
242
- ` ` `
243
- < /details>
244
-
245
- < details>
246
- < summary> InternLM-Chat-7B< /summary>
247
-
248
- ` ` ` sh
249
- python3 chatglm_cpp/convert.py -i internlm/internlm-chat-7b -t q4_0 -o models/internlm-chat-7b-ggml.bin
250
- ./build/bin/main -m models/internlm-chat-7b-ggml.bin -p 你好 --top_p 0.8 --temp 0.8
251
- # 你好,我是书生·浦语,有什么可以帮助你的吗?
252
- ` ` `
253
- < /details>
254
-
255
- < details>
256
- < summary> InternLM-Chat-20B< /summary>
257
-
258
- ` ` ` sh
259
- python3 chatglm_cpp/convert.py -i internlm/internlm-chat-20b -t q4_0 -o models/internlm-chat-20b-ggml.bin
260
- ./build/bin/main -m models/internlm-chat-20b-ggml.bin -p 你好 --top_p 0.8 --temp 0.8
261
- # 你好!有什么我可以帮到你的吗?
262
- ` ` `
263
- < /details>
264
-
265
212
# # Using BLAS
266
213
267
214
BLAS library can be integrated to further accelerate matrix multiplication. However, in some cases, using BLAS may cause performance degradation. Whether to turn on BLAS should depend on the benchmarking result.
@@ -277,17 +224,17 @@ OpenBLAS provides acceleration on CPU. Add the CMake flag `-DGGML_OPENBLAS=ON` t
277
224
cmake -B build -DGGML_OPENBLAS=ON && cmake --build build -j
278
225
` ` `
279
226
280
- ** cuBLAS **
227
+ ** CUDA **
281
228
282
- cuBLAS uses NVIDIA GPU to accelerate BLAS . Add the CMake flag ` -DGGML_CUBLAS =ON` to enable it.
229
+ CUDA accelerates model inference on NVIDIA GPU . Add the CMake flag ` -DGGML_CUDA =ON` to enable it.
283
230
` ` ` sh
284
- cmake -B build -DGGML_CUBLAS =ON && cmake --build build -j
231
+ cmake -B build -DGGML_CUDA =ON && cmake --build build -j
285
232
` ` `
286
233
287
- By default, all kernels will be compiled for all possible CUDA architectures and it takes some time. To run on a specific type of device, you may specify ` CUDA_ARCHITECTURES ` to speed up the nvcc compilation. For example:
234
+ By default, all kernels will be compiled for all possible CUDA architectures and it takes some time. To run on a specific type of device, you may specify ` CMAKE_CUDA_ARCHITECTURES ` to speed up the nvcc compilation. For example:
288
235
` ` ` sh
289
- cmake -B build -DGGML_CUBLAS =ON -DCUDA_ARCHITECTURES =" 80" # for A100
290
- cmake -B build -DGGML_CUBLAS =ON -DCUDA_ARCHITECTURES =" 70;75" # compatible with both V100 and T4
236
+ cmake -B build -DGGML_CUDA =ON -DCMAKE_CUDA_ARCHITECTURES =" 80" # for A100
237
+ cmake -B build -DGGML_CUDA =ON -DCMAKE_CUDA_ARCHITECTURES =" 70;75" # compatible with both V100 and T4
291
238
` ` `
292
239
293
240
To find out the CUDA architecture of your GPU device, see [Your GPU Compute Capability](https://developer.nvidia.com/cuda-gpus).
@@ -310,9 +257,9 @@ Install from PyPI (recommended): will trigger compilation on your platform.
310
257
pip install -U chatglm-cpp
311
258
` ` `
312
259
313
- To enable cuBLAS acceleration on NVIDIA GPU:
260
+ To enable CUDA on NVIDIA GPU:
314
261
` ` ` sh
315
- CMAKE_ARGS=" -DGGML_CUBLAS =ON" pip install -U chatglm-cpp
262
+ CMAKE_ARGS=" -DGGML_CUDA =ON" pip install -U chatglm-cpp
316
263
` ` `
317
264
318
265
To enable Metal on Apple silicon devices:
@@ -426,51 +373,6 @@ python3 web_demo.py -m ../models/codegeex2-ggml.bin --temp 0 --max_length 512 --
426
373
` ` `
427
374
< /details>
428
375
429
- < details>
430
- < summary> Baichuan-13B-Chat< /summary>
431
-
432
- ` ` ` sh
433
- python3 cli_demo.py -m ../models/baichuan-13b-chat-ggml.bin -p 你好 --top_k 5 --top_p 0.85 --temp 0.3 --repeat_penalty 1.1 # CLI demo
434
- python3 web_demo.py -m ../models/baichuan-13b-chat-ggml.bin --top_k 5 --top_p 0.85 --temp 0.3 --repeat_penalty 1.1 # web demo
435
- ` ` `
436
- < /details>
437
-
438
- < details>
439
- < summary> Baichuan2-7B-Chat< /summary>
440
-
441
- ` ` ` sh
442
- python3 cli_demo.py -m ../models/baichuan2-7b-chat-ggml.bin -p 你好 --top_k 5 --top_p 0.85 --temp 0.3 --repeat_penalty 1.05 # CLI demo
443
- python3 web_demo.py -m ../models/baichuan2-7b-chat-ggml.bin --top_k 5 --top_p 0.85 --temp 0.3 --repeat_penalty 1.05 # web demo
444
- ` ` `
445
- < /details>
446
-
447
- < details>
448
- < summary> Baichuan2-13B-Chat< /summary>
449
-
450
- ` ` ` sh
451
- python3 cli_demo.py -m ../models/baichuan2-13b-chat-ggml.bin -p 你好 --top_k 5 --top_p 0.85 --temp 0.3 --repeat_penalty 1.05 # CLI demo
452
- python3 web_demo.py -m ../models/baichuan2-13b-chat-ggml.bin --top_k 5 --top_p 0.85 --temp 0.3 --repeat_penalty 1.05 # web demo
453
- ` ` `
454
- < /details>
455
-
456
- < details>
457
- < summary> InternLM-Chat-7B< /summary>
458
-
459
- ` ` ` sh
460
- python3 cli_demo.py -m ../models/internlm-chat-7b-ggml.bin -p 你好 --top_p 0.8 --temp 0.8 # CLI demo
461
- python3 web_demo.py -m ../models/internlm-chat-7b-ggml.bin --top_p 0.8 --temp 0.8 # web demo
462
- ` ` `
463
- < /details>
464
-
465
- < details>
466
- < summary> InternLM-Chat-20B< /summary>
467
-
468
- ` ` ` sh
469
- python3 cli_demo.py -m ../models/internlm-chat-20b-ggml.bin -p 你好 --top_p 0.8 --temp 0.8 # CLI demo
470
- python3 web_demo.py -m ../models/internlm-chat-20b-ggml.bin --top_p 0.8 --temp 0.8 # web demo
471
- ` ` `
472
- < /details>
473
-
474
376
** Converting Hugging Face LLMs at Runtime**
475
377
476
378
Sometimes it might be inconvenient to convert and save the intermediate GGML models beforehand. Here is an option to directly load from the original Hugging Face model, quantize it into GGML models in a minute, and start serving. All you need is to replace the GGML model path with the Hugging Face model name or path.
@@ -579,7 +481,7 @@ For CUDA support, make sure [nvidia-docker](https://github.com/NVIDIA/nvidia-doc
579
481
` ` ` sh
580
482
docker build . --network=host -t chatglm.cpp-cuda \
581
483
--build-arg BASE_IMAGE=nvidia/cuda:12.2.0-devel-ubuntu20.04 \
582
- --build-arg CMAKE_ARGS=" -DGGML_CUBLAS =ON -DCUDA_ARCHITECTURES =80"
484
+ --build-arg CMAKE_ARGS=" -DGGML_CUDA =ON -DCMAKE_CUDA_ARCHITECTURES =80"
583
485
docker run -it --rm --gpus all -v $PWD /models:/chatglm.cpp/models chatglm.cpp-cuda \
584
486
./build/bin/main -m models/chatglm-ggml.bin -p " 你好"
585
487
` ` `
@@ -631,45 +533,12 @@ ChatGLM2-6B / ChatGLM3-6B / CodeGeeX2:
631
533
632
534
ChatGLM4-9B:
633
535
634
- | | Q4_0 | Q4_1 | Q5_0 | Q5_1 | Q8_0 | F16 |
635
- | --------------------------------| -------| -------| -------| -------| -------| -------|
636
- | ms/token (CPU @ Platinum 8260) | 105 | 105 | 122 | 134 | 158 | 279 |
637
- | ms/token (CUDA @ V100 SXM2) | 12.1 | 12.5 | 13.8 | 13.9 | 17.7 | 27.7 |
638
- | file size | 5.0G | 5.5G | 6.1G | 6.6G | 9.4G | 18G |
639
-
640
- Baichuan-7B / Baichuan2-7B:
641
-
642
- | | Q4_0 | Q4_1 | Q5_0 | Q5_1 | Q8_0 | F16 |
643
- | --------------------------------| -------| -------| -------| -------| -------| -------|
644
- | ms/token (CPU @ Platinum 8260) | 85.3 | 94.8 | 103.4 | 109.6 | 136.8 | 248.5 |
645
- | ms/token (CUDA @ V100 SXM2) | 8.7 | 9.2 | 10.2 | 10.3 | 13.2 | 21.0 |
646
- | ms/token (MPS @ M2 Ultra) | 11.3 | 12.0 | N/A | N/A | 16.4 | 25.6 |
647
- | file size | 4.0G | 4.4G | 4.9G | 5.3G | 7.5G | 14G |
648
- | mem usage | 4.5G | 4.9G | 5.3G | 5.7G | 7.8G | 14G |
649
-
650
- Baichuan-13B / Baichuan2-13B:
651
-
652
- | | Q4_0 | Q4_1 | Q5_0 | Q5_1 | Q8_0 | F16 |
653
- | --------------------------------| -------| -------| -------| -------| -------| -------|
654
- | ms/token (CPU @ Platinum 8260) | 161.7 | 175.8 | 189.9 | 192.3 | 255.6 | 459.6 |
655
- | ms/token (CUDA @ V100 SXM2) | 13.7 | 15.1 | 16.3 | 16.9 | 21.9 | 36.8 |
656
- | ms/token (MPS @ M2 Ultra) | 18.2 | 18.8 | N/A | N/A | 27.2 | 44.4 |
657
- | file size | 7.0G | 7.8G | 8.5G | 9.3G | 14G | 25G |
658
- | mem usage | 7.8G | 8.8G | 9.5G | 10G | 14G | 25G |
659
-
660
- InternLM-7B:
661
-
662
- | | Q4_0 | Q4_1 | Q5_0 | Q5_1 | Q8_0 | F16 |
663
- | --------------------------------| -------| -------| -------| -------| -------| -------|
664
- | ms/token (CPU @ Platinum 8260) | 85.3 | 90.1 | 103.5 | 112.5 | 137.3 | 232.2 |
665
- | ms/token (CUDA @ V100 SXM2) | 9.1 | 9.4 | 10.5 | 10.5 | 13.3 | 21.1 |
666
-
667
- InternLM-20B:
668
-
669
- | | Q4_0 | Q4_1 | Q5_0 | Q5_1 | Q8_0 | F16 |
670
- | --------------------------------| -------| -------| -------| -------| -------| -------|
671
- | ms/token (CPU @ Platinum 8260) | 230.0 | 236.7 | 276.6 | 290.6 | 357.1 | N/A |
672
- | ms/token (CUDA @ V100 SXM2) | 21.6 | 23.2 | 25.0 | 25.9 | 33.4 | N/A |
536
+ | | Q4_0 | Q4_1 | Q5_0 | Q5_1 | Q8_0 | F16 |
537
+ | --------------------------------| ------| ------| ------| ------| ------| ------|
538
+ | ms/token (CPU @ Platinum 8260) | 105 | 105 | 122 | 134 | 158 | 279 |
539
+ | ms/token (CUDA @ V100 SXM2) | 12.1 | 12.5 | 13.8 | 13.9 | 17.7 | 27.7 |
540
+ | ms/token (MPS @ M2 Ultra) | 14.4 | 15.3 | 19.6 | 20.1 | 20.7 | 32.4 |
541
+ | file size | 5.0G | 5.5G | 6.1G | 6.6G | 9.4G | 18G |
673
542
674
543
# # Model Quality
675
544
0 commit comments