|
| 1 | +--- |
| 2 | +title: KubeRay |
| 3 | +--- |
| 4 | +[](){ #deployment-kuberay } |
| 5 | + |
| 6 | +[KubeRay](https://github.com/ray-project/kuberay) provides a Kubernetes-native way to run vLLM workloads on Ray clusters. |
| 7 | +A Ray cluster can be declared in YAML, and the operator then handles pod scheduling, networking configuration, restarts, and rolling upgrades—all while preserving the familiar Kubernetes experience. |
| 8 | + |
| 9 | +--- |
| 10 | + |
| 11 | +## Why KubeRay instead of manual scripts? |
| 12 | + |
| 13 | +| Feature | Manual scripts | KubeRay | |
| 14 | +|---------|-----------------------------------------------------------|---------| |
| 15 | +| Cluster bootstrap | Manually SSH into every node and run a script | One command to create or update the whole cluster: `kubectl apply -f cluster.yaml` | |
| 16 | +| Fault-tolerance | Nodes must be restarted by hand | Pods are automatically rescheduled; head-node fail-over supported | |
| 17 | +| Autoscaling | Unsupported | Native horizontal **and** vertical autoscaling via Ray Autoscaler & Kubernetes HPA | |
| 18 | +| Upgrades | Tear down & re-create manually | Rolling updates handled by the operator | |
| 19 | +| Monitoring | ad-hoc | Distributed observability with Ray Dashboard | |
| 20 | +| Declarative config | Bash flags & environment variables | Git-ops-friendly YAML CRDs (RayCluster/RayService) | |
| 21 | + |
| 22 | +Using KubeRay reduces the operational burden and simplifies integration of Ray + vLLM with existing Kubernetes workflows (CI/CD, secrets, storage classes, etc.). |
| 23 | + |
| 24 | +--- |
| 25 | + |
| 26 | +## Quick start |
| 27 | + |
| 28 | +1. Install the KubeRay operator (via Helm or `kubectl apply`). |
| 29 | +2. Create a `RayService` that runs vLLM. |
| 30 | + |
| 31 | +```bash |
| 32 | +# FIXME create this yaml before merging PR |
| 33 | +kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/refs/heads/master/ray-operator/config/samples/vllm/ray-service.vllm.yaml |
| 34 | +``` |
| 35 | + |
| 36 | +The YAML above spins up a Ray cluster and a Ray Serve application that serves the |
| 37 | +`meta-llama/Meta-Llama-3-8B-Instruct` model using vLLM. Wait until the |
| 38 | +`RayService` reports **RUNNING**, then port-forward and query the model: |
| 39 | + |
| 40 | +```bash |
| 41 | +kubectl port-forward svc/llama-3-8b-serve-svc 8000 & |
| 42 | + |
| 43 | +curl http://localhost:8000/v1/chat/completions \ |
| 44 | + -H "Content-Type: application/json" \ |
| 45 | + -d '{ |
| 46 | + "model": "meta-llama/Meta-Llama-3-8B-Instruct", |
| 47 | + "messages": [ |
| 48 | + {"role": "system", "content": "You are a helpful assistant."}, |
| 49 | + {"role": "user", "content": "Provide a brief sentence describing the Ray open-source project."} |
| 50 | + ], |
| 51 | + "temperature": 0.7 |
| 52 | + }' |
| 53 | +``` |
| 54 | + |
| 55 | +--- |
| 56 | + |
| 57 | +## Learn more |
| 58 | + |
| 59 | +* ["Serve a Large Language Model with vLLM on Kubernetes"](https://docs.ray.io/en/latest/cluster/kubernetes/examples/vllm-rayservice.html): |
| 60 | + End-to-end walkthrough for deploying Llama-3 8B with `RayService`. |
| 61 | +* [KubeRay documentation](https://docs.ray.io/en/latest/cluster/kubernetes/index.html) |
0 commit comments