|
1 | 1 | Using with open/local models
|
2 | 2 | ============================
|
3 | 3 |
|
4 |
| -You can integrate `gpt-engineer` with open-source models by leveraging an OpenAI-compatible API. One such API is provided by the [text-generator-ui _extension_ openai](https://github.com/oobabooga/text-generation-webui/blob/main/extensions/openai/README.md). |
| 4 | +**Use `gpte` first with OpenAI models to get a feel for the `gpte` tool.** |
| 5 | + |
| 6 | +**Then go play with experimental Open LLMs 🐉 support and try not to get 🔥!!** |
| 7 | + |
| 8 | +At the moment the best option for coding is still the use of `gpt-4` models provided by OpenAI. But open models are catching up and are a good free and privacy-oriented alternative if you possess the proper hardware. |
| 9 | + |
| 10 | +You can integrate `gpt-engineer` with open-source models by leveraging an OpenAI-compatible API. |
| 11 | + |
| 12 | +We provide the minimal and cleanest solution below. What is described is not the only way to use open/local models, but the one we tested and would recommend to most users. |
| 13 | + |
| 14 | +More details on why the solution below is recommended in [this blog post](https://zigabrencic.com/blog/2024-02-21). |
5 | 15 |
|
6 | 16 | Setup
|
7 | 17 | -----
|
8 | 18 |
|
9 |
| -To get started, first set up the API with the Runpod template, as per the [instructions](https://github.com/oobabooga/text-generation-webui/blob/main/extensions/openai/README.md). |
| 19 | +For inference engine we recommend for the users to use [llama.cpp](https://github.com/ggerganov/llama.cpp) with its `python` bindings `llama-cpp-python`. |
| 20 | + |
| 21 | +We choose `llama.cpp` because: |
| 22 | + |
| 23 | +- 1.) It supports the largest amount of hardware acceleration backends. |
| 24 | +- 2.) It supports the diverse set of open LLMs. |
| 25 | +- 3.) Is written in `python` and directly on top of `llama.cpp` inference engine. |
| 26 | +- 4.) Supports the `openAI` API and `langchain` interface. |
| 27 | + |
| 28 | +To install `llama-cpp-python` follow the official [installation docs](https://llama-cpp-python.readthedocs.io/en/latest/) and [those docs](https://llama-cpp-python.readthedocs.io/en/latest/install/macos/) for MacOS with Metal support. |
| 29 | + |
| 30 | +If you want to benefit from proper hardware acceleration on your machine make sure to set up the proper compiler flags before installing your package. |
| 31 | + |
| 32 | +- `linux`: `CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"` |
| 33 | +- `macos` with Metal support: `CMAKE_ARGS="-DLLAMA_METAL=on"` |
| 34 | +- `windows`: `$env:CMAKE_ARGS = "-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"` |
| 35 | + |
| 36 | +This will enable the `pip` installer to compile the `llama.cpp` with the proper hardware acceleration backend. |
| 37 | + |
| 38 | +Then run: |
| 39 | + |
| 40 | +```bash |
| 41 | +pip install llama-cpp-python |
| 42 | +``` |
| 43 | + |
| 44 | +For our use case we also need to set up the web server that `llama-cpp-python` library provides. To install: |
| 45 | + |
| 46 | +```bash |
| 47 | +pip install 'llama-cpp-python[server]' |
| 48 | +``` |
| 49 | + |
| 50 | +For detailed use consult the [`llama-cpp-python` docs](https://llama-cpp-python.readthedocs.io/en/latest/server/). |
| 51 | + |
| 52 | +Before we proceed we need to obtain the model weights in the `gguf` format. That should be a single file on your disk. |
| 53 | + |
| 54 | +In case you have weights in other formats check the `llama-cpp-python` docs for conversion to `gguf` format. |
| 55 | + |
| 56 | +Models in other formats `ggml`, `.safetensors`, etc. won't work without prior conversion to `gguf` file format with the solution described below! |
| 57 | + |
| 58 | +Which open model to use? |
| 59 | +================== |
| 60 | + |
| 61 | +Your best choice would be: |
| 62 | + |
| 63 | +- CodeLlama 70B |
| 64 | +- Mixtral 8x7B |
| 65 | + |
| 66 | +We are still testing this part, but the larger the model you can run the better. Sure the responses might be slower in terms of (token/s), but code quality will be higher. |
| 67 | + |
| 68 | +For testing that the open LLM `gpte` setup works we recommend starting with a smaller model. You can download weights of [CodeLlama-13B-GGUF by the `TheBloke`](https://huggingface.co/TheBloke/CodeLlama-13B-GGUF) choose the largest model version you can run (for example `Q6_K`), since quantisation will degrade LLM performance. |
| 69 | + |
| 70 | +Feel free to try out larger models on your hardware and see what happens. |
10 | 71 |
|
11 | 72 | Running the Example
|
12 |
| -------------------- |
| 73 | +================== |
| 74 | + |
| 75 | +To see that your setup works check [test open LLM setup](examples/test_open_llm/README.md). |
| 76 | + |
| 77 | +If above tests work proceed 😉 |
| 78 | + |
| 79 | +For checking that `gpte` works with the `CodeLLama` we recommend for you to create a project with `prompt` file content: |
| 80 | + |
| 81 | +``` |
| 82 | +Write a python script that sums up two numbers. Provide only the `sum_two_numbers` function and nothing else. |
| 83 | +
|
| 84 | +Provide two tests: |
13 | 85 |
|
14 |
| -Once the API is set up, you can find the host and the exposed TCP port by checking your Runpod dashboard. |
| 86 | +assert(sum_two_numbers(100, 10) == 110) |
| 87 | +assert(sum_two_numbers(10.1, 10) == 20.1) |
| 88 | +``` |
15 | 89 |
|
16 |
| -Then, you can use the port and host to run the following example using WizardCoder-Python-34B hosted on Runpod: |
| 90 | +Now run the LLM in separate terminal: |
17 | 91 |
|
| 92 | +```bash |
| 93 | +python -m llama_cpp.server --model $model_path --n_batch 256 --n_gpu_layers 30 |
18 | 94 | ```
|
19 |
| - OPENAI_API_BASE=http://<host>:<port>/v1 python -m gpt_engineer.cli.main benchmark/pomodoro_timer --steps benchmark TheBloke_WizardCoder-Python-34B-V1.0-GPTQ |
| 95 | + |
| 96 | +Then in another terminal window set the following environment variables: |
| 97 | + |
| 98 | +```bash |
| 99 | +export OPENAI_API_BASE="http://localhost:8000/v1" |
| 100 | +export OPENAI_API_KEY="sk-xxx" |
| 101 | +export MODEL_NAME="CodeLLama" |
| 102 | +export LOCAL_MODEL=true |
20 | 103 | ```
|
21 | 104 |
|
| 105 | +And run `gpt-engineer` with the following command: |
| 106 | + |
| 107 | +```bash |
| 108 | +gpte <project_dir> $MODEL_NAME --lite --temperature 0.1 |
| 109 | +``` |
| 110 | + |
| 111 | +The `--lite` mode is needed for now since open models for some reason behave worse with too many instructions at the moment. Temperature is set to `0.1` to get consistent best possible results. |
| 112 | + |
| 113 | +That's it. |
| 114 | + |
| 115 | +*If sth. doesn't work as expected, or you figure out how to improve the open LLM support please let us know.* |
| 116 | + |
22 | 117 | Using Azure models
|
23 | 118 | ==================
|
24 | 119 |
|
|
0 commit comments