|
1 | 1 | # vLLM
|
2 | 2 |
|
3 |
| -[vLLM](https://docs.vllm.ai/en/stable/) is a fast and user-friendly library for LLM inference and serving. |
| 3 | +[vLLM](https://docs.vllm.ai/en/stable/) is a fast and user-friendly library for LLM inference and serving. It provides an OpenAI-compatible server interface, allowing the use of OpenAI kinds for chat and embedding, while offering a specialized interface for completions. |
4 | 4 |
|
5 |
| -vLLM offers an `OpenAI Compatible Server`, enabling us to use the OpenAI kinds for chat and embedding. |
6 |
| -However, for completion, there are certain differences in the implementation. |
7 |
| -Therefore, we should use the `vllm/completion` kind and provide a `prompt_template` depending on the specific models. |
| 5 | +Important requirements for all model types: |
8 | 6 |
|
9 |
| -Please note that models differ in their capabilities for completion or chat. |
10 |
| -You should confirm the model's capability before employing it for chat or completion tasks. |
| 7 | +- `model_name` must exactly match the one used to run vLLM |
| 8 | +- `api_endpoint` should follow the format `http://host:port/v1` |
| 9 | +- `api_key` should be identical to the one used to run vLLM |
11 | 10 |
|
12 |
| -Additionally, there are models that can serve both as chat and completion. |
13 |
| -For detailed information, please refer to the [Model Registry](../../models/index.mdx). |
| 11 | +Please note that models differ in their capabilities for completion or chat. Some models can serve both purposes. For detailed information, please refer to the [Model Registry](../../models/index.mdx). |
14 | 12 |
|
15 |
| -Below is an example of the vLLM running at `http://localhost:8000`: |
| 13 | +## Chat model |
16 | 14 |
|
17 |
| -Please note the following requirements in each model type: |
18 |
| -1. `model_name` must exactly match the one used to run vLLM. |
19 |
| -2. `api_endpoint` should follow the format `http://host:port/v1`. |
20 |
| -3. `api_key` should be identical to the one used to run vLLM. |
| 15 | +vLLM provides an OpenAI-compatible chat API interface. |
21 | 16 |
|
22 | 17 | ```toml title="~/.tabby/config.toml"
|
23 |
| -# Chat model |
24 | 18 | [model.chat.http]
|
25 | 19 | kind = "openai/chat"
|
26 |
| -model_name = "your_model" # Please make sure to use a chat model. |
| 20 | +model_name = "your_model" # Please make sure to use a chat model |
27 | 21 | api_endpoint = "http://localhost:8000/v1"
|
28 |
| -api_key = "secret-api-key" |
| 22 | +api_key = "your-api-key" |
| 23 | +``` |
| 24 | + |
| 25 | +## Completion model |
29 | 26 |
|
30 |
| -# Completion model |
| 27 | +Due to implementation differences, vLLM uses its own completion API interface that requires a specific prompt template based on the model being used. |
| 28 | + |
| 29 | +```toml title="~/.tabby/config.toml" |
31 | 30 | [model.completion.http]
|
32 | 31 | kind = "vllm/completion"
|
33 |
| -model_name = "your_model" # Please make sure to use a completion model. |
| 32 | +model_name = "your_model" # Please make sure to use a completion model |
34 | 33 | api_endpoint = "http://localhost:8000/v1"
|
35 |
| -api_key = "secret-api-key" |
36 |
| -prompt_template = "<PRE> {prefix} <SUF>{suffix} <MID>" # Example prompt template for the CodeLlama model series. |
| 34 | +api_key = "your-api-key" |
| 35 | +prompt_template = "<PRE> {prefix} <SUF>{suffix} <MID>" # Example prompt template for the CodeLlama model series |
| 36 | +``` |
| 37 | + |
| 38 | +## Embeddings model |
37 | 39 |
|
38 |
| -# Embedding model |
| 40 | +vLLM provides an OpenAI-compatible embeddings API interface. |
| 41 | + |
| 42 | +```toml title="~/.tabby/config.toml" |
39 | 43 | [model.embedding.http]
|
40 | 44 | kind = "openai/embedding"
|
41 | 45 | model_name = "your_model"
|
42 | 46 | api_endpoint = "http://localhost:8000/v1"
|
43 |
| -api_key = "secret-api-key" |
| 47 | +api_key = "your-api-key" |
44 | 48 | ```
|
0 commit comments