Skip to main content

Dev Mode

Dev mode runs the full stack using lightweight fake backends. No GPU is required. This is the recommended starting point for local development and testing without requiring GPU workloads. Utilizing dev mode will disable some load balancing functionalities that rely on llm performance (e.g. least kv cache)

What Runs

Instead of real vLLM instances, dev mode starts two simple HTTP servers that return a fixed JSON response. Everything else — the router, Prometheus, and Grafana — runs identically to production.

Starting the Stack

make local-up router=roundrobin

Replace roundrobin with any valid strategy: consistanthashing | leastqueue | least-kvcache.

Example request

Running one request

curl localhost:7999

Running 1000 requests

for i in {1..1000}; do
curl -s localhost:7999 &
done

Grafana Access

After running your services, you can check the Grafana dashboard:

Click the "Dashboards" option on the side bar and click on "Load Balancer & Go Runtime Metrics"

Stopping the Stack

make local-down

Ports

ServicePort
Backend 17777
Backend 27778
Router7999
Prometheus7998
Grafana8000

This checks that all services are reachable and responding.