Skip to content

Commit d00be76

Browse files
authored
Merge pull request #1082 from zigabrencic/docs/open-llm-suport
Support for Open LLMs
2 parents 2bda71b + 4e7b072 commit d00be76

File tree

5 files changed

+198
-6
lines changed

5 files changed

+198
-6
lines changed

Diff for: docs/examples/open_llms/README.md

+56
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
# Test that the Open LLM is running
2+
3+
First start the server by using only CPU:
4+
5+
```bash
6+
export model_path="TheBloke/CodeLlama-13B-GGUF/codellama-13b.Q8_0.gguf"
7+
python -m llama_cpp.server --model $model_path
8+
```
9+
10+
Or with GPU support (recommended):
11+
12+
```bash
13+
python -m llama_cpp.server --model TheBloke/CodeLlama-13B-GGUF/codellama-13b.Q8_0.gguf --n_gpu_layers 1
14+
```
15+
16+
If you have more `GPU` layers available set `--n_gpu_layers` to the higher number.
17+
18+
To find the amount of available run the above command and look for `llm_load_tensors: offloaded 1/41 layers to GPU` in the output.
19+
20+
## Test API call
21+
22+
Set the environment variables:
23+
24+
```bash
25+
export OPENAI_API_BASE="http://localhost:8000/v1"
26+
export OPENAI_API_KEY="sk-xxx"
27+
export MODEL_NAME="CodeLlama"
28+
````
29+
30+
Then ping the model via `python` using `OpenAI` API:
31+
32+
```bash
33+
python examples/open_llms/openai_api_interface.py
34+
```
35+
36+
If you're not using `CodeLLama` make sure to change the `MODEL_NAME` parameter.
37+
38+
Or using `curl`:
39+
40+
```bash
41+
curl --request POST \
42+
--url http://localhost:8000/v1/chat/completions \
43+
--header "Content-Type: application/json" \
44+
--data '{ "model": "CodeLlama", "prompt": "Who are you?", "max_tokens": 60}'
45+
```
46+
47+
If this works also make sure that `langchain` interface works since that's how `gpte` interacts with LLMs.
48+
49+
## Langchain test
50+
51+
```bash
52+
export MODEL_NAME="CodeLlama"
53+
python examples/open_llms/langchain_interface.py
54+
```
55+
56+
That's it 🤓 time to go back [to](/docs/open_models.md#running-the-example) and give `gpte` a try.

Diff for: docs/examples/open_llms/langchain_interface.py

+17
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
import os
2+
3+
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
4+
from langchain_openai import ChatOpenAI
5+
6+
model = ChatOpenAI(
7+
model=os.getenv("MODEL_NAME"),
8+
temperature=0.1,
9+
callbacks=[StreamingStdOutCallbackHandler()],
10+
streaming=True,
11+
)
12+
13+
prompt = (
14+
"Provide me with only the code for a simple python function that sums two numbers."
15+
)
16+
17+
model.invoke(prompt)

Diff for: docs/examples/open_llms/openai_api_interface.py

+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
import os
2+
3+
from openai import OpenAI
4+
5+
client = OpenAI(
6+
base_url=os.getenv("OPENAI_API_BASE"), api_key=os.getenv("OPENAI_API_KEY")
7+
)
8+
9+
response = client.chat.completions.create(
10+
model=os.getenv("MODEL_NAME"),
11+
messages=[
12+
{
13+
"role": "user",
14+
"content": "Provide me with only the code for a simple python function that sums two numbers.",
15+
},
16+
],
17+
temperature=0.7,
18+
max_tokens=200,
19+
)
20+
21+
print(response.choices[0].message.content)

Diff for: docs/open_models.md

+101-6
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,119 @@
11
Using with open/local models
22
============================
33

4-
You can integrate `gpt-engineer` with open-source models by leveraging an OpenAI-compatible API. One such API is provided by the [text-generator-ui _extension_ openai](https://github.com/oobabooga/text-generation-webui/blob/main/extensions/openai/README.md).
4+
**Use `gpte` first with OpenAI models to get a feel for the `gpte` tool.**
5+
6+
**Then go play with experimental Open LLMs 🐉 support and try not to get 🔥!!**
7+
8+
At the moment the best option for coding is still the use of `gpt-4` models provided by OpenAI. But open models are catching up and are a good free and privacy-oriented alternative if you possess the proper hardware.
9+
10+
You can integrate `gpt-engineer` with open-source models by leveraging an OpenAI-compatible API.
11+
12+
We provide the minimal and cleanest solution below. What is described is not the only way to use open/local models, but the one we tested and would recommend to most users.
13+
14+
More details on why the solution below is recommended in [this blog post](https://zigabrencic.com/blog/2024-02-21).
515

616
Setup
717
-----
818

9-
To get started, first set up the API with the Runpod template, as per the [instructions](https://github.com/oobabooga/text-generation-webui/blob/main/extensions/openai/README.md).
19+
For inference engine we recommend for the users to use [llama.cpp](https://github.com/ggerganov/llama.cpp) with its `python` bindings `llama-cpp-python`.
20+
21+
We choose `llama.cpp` because:
22+
23+
- 1.) It supports the largest amount of hardware acceleration backends.
24+
- 2.) It supports the diverse set of open LLMs.
25+
- 3.) Is written in `python` and directly on top of `llama.cpp` inference engine.
26+
- 4.) Supports the `openAI` API and `langchain` interface.
27+
28+
To install `llama-cpp-python` follow the official [installation docs](https://llama-cpp-python.readthedocs.io/en/latest/) and [those docs](https://llama-cpp-python.readthedocs.io/en/latest/install/macos/) for MacOS with Metal support.
29+
30+
If you want to benefit from proper hardware acceleration on your machine make sure to set up the proper compiler flags before installing your package.
31+
32+
- `linux`: `CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"`
33+
- `macos` with Metal support: `CMAKE_ARGS="-DLLAMA_METAL=on"`
34+
- `windows`: `$env:CMAKE_ARGS = "-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"`
35+
36+
This will enable the `pip` installer to compile the `llama.cpp` with the proper hardware acceleration backend.
37+
38+
Then run:
39+
40+
```bash
41+
pip install llama-cpp-python
42+
```
43+
44+
For our use case we also need to set up the web server that `llama-cpp-python` library provides. To install:
45+
46+
```bash
47+
pip install 'llama-cpp-python[server]'
48+
```
49+
50+
For detailed use consult the [`llama-cpp-python` docs](https://llama-cpp-python.readthedocs.io/en/latest/server/).
51+
52+
Before we proceed we need to obtain the model weights in the `gguf` format. That should be a single file on your disk.
53+
54+
In case you have weights in other formats check the `llama-cpp-python` docs for conversion to `gguf` format.
55+
56+
Models in other formats `ggml`, `.safetensors`, etc. won't work without prior conversion to `gguf` file format with the solution described below!
57+
58+
Which open model to use?
59+
==================
60+
61+
Your best choice would be:
62+
63+
- CodeLlama 70B
64+
- Mixtral 8x7B
65+
66+
We are still testing this part, but the larger the model you can run the better. Sure the responses might be slower in terms of (token/s), but code quality will be higher.
67+
68+
For testing that the open LLM `gpte` setup works we recommend starting with a smaller model. You can download weights of [CodeLlama-13B-GGUF by the `TheBloke`](https://huggingface.co/TheBloke/CodeLlama-13B-GGUF) choose the largest model version you can run (for example `Q6_K`), since quantisation will degrade LLM performance.
69+
70+
Feel free to try out larger models on your hardware and see what happens.
1071

1172
Running the Example
12-
-------------------
73+
==================
74+
75+
To see that your setup works check [test open LLM setup](examples/test_open_llm/README.md).
76+
77+
If above tests work proceed 😉
78+
79+
For checking that `gpte` works with the `CodeLLama` we recommend for you to create a project with `prompt` file content:
80+
81+
```
82+
Write a python script that sums up two numbers. Provide only the `sum_two_numbers` function and nothing else.
83+
84+
Provide two tests:
1385
14-
Once the API is set up, you can find the host and the exposed TCP port by checking your Runpod dashboard.
86+
assert(sum_two_numbers(100, 10) == 110)
87+
assert(sum_two_numbers(10.1, 10) == 20.1)
88+
```
1589

16-
Then, you can use the port and host to run the following example using WizardCoder-Python-34B hosted on Runpod:
90+
Now run the LLM in separate terminal:
1791

92+
```bash
93+
python -m llama_cpp.server --model $model_path --n_batch 256 --n_gpu_layers 30
1894
```
19-
OPENAI_API_BASE=http://<host>:<port>/v1 python -m gpt_engineer.cli.main benchmark/pomodoro_timer --steps benchmark TheBloke_WizardCoder-Python-34B-V1.0-GPTQ
95+
96+
Then in another terminal window set the following environment variables:
97+
98+
```bash
99+
export OPENAI_API_BASE="http://localhost:8000/v1"
100+
export OPENAI_API_KEY="sk-xxx"
101+
export MODEL_NAME="CodeLLama"
102+
export LOCAL_MODEL=true
20103
```
21104

105+
And run `gpt-engineer` with the following command:
106+
107+
```bash
108+
gpte <project_dir> $MODEL_NAME --lite --temperature 0.1
109+
```
110+
111+
The `--lite` mode is needed for now since open models for some reason behave worse with too many instructions at the moment. Temperature is set to `0.1` to get consistent best possible results.
112+
113+
That's it.
114+
115+
*If sth. doesn't work as expected, or you figure out how to improve the open LLM support please let us know.*
116+
22117
Using Azure models
23118
==================
24119

Diff for: gpt_engineer/applications/cli/main.py

+3
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,7 @@ def load_env_if_needed():
7676
load_dotenv()
7777
if os.getenv("OPENAI_API_KEY") is None:
7878
load_dotenv(dotenv_path=os.path.join(os.getcwd(), ".env"))
79+
7980
openai.api_key = os.getenv("OPENAI_API_KEY")
8081

8182
if os.getenv("ANTHROPIC_API_KEY") is None:
@@ -480,6 +481,8 @@ def main(
480481

481482
if ai.token_usage_log.is_openai_model():
482483
print("Total api cost: $ ", ai.token_usage_log.usage_cost())
484+
elif os.getenv("LOCAL_MODEL"):
485+
print("Total api cost: $ 0.0 since we are using local LLM.")
483486
else:
484487
print("Total tokens used: ", ai.token_usage_log.total_tokens())
485488

0 commit comments

Comments
 (0)