sglang在pd分离下的router请求

gogongxt2026-04-222026-04-22

SGLang Router PD 分离模式请求转发详解

本文档详细说明 SGLang Router 在 Prefill-Decode (PD) 分离模式下，如何将客户端请求转换并分发到 prefill 和 decode 服务器。

1. 总览

PD 模式下，Router 收到一个请求后会执行 双路分发 (dual dispatch) ：

将请求（附加 bootstrap 字段）同时发给 prefill server 和 decode server
两者并发执行，prefill 完成 KV cache 预填充，decode 从 prefill 拉取 KV cache 后做增量解码
Router 只将 decode server 的响应返回给客户端，prefill 的响应仅用于合并 logprobs、prompt_tokens_details 等元信息

Client --> Router ----+----> Prefill Server (做 prefill，存 KV cache)
                      |
                      +----> Decode Server  (从 prefill 拉 KV cache，做 decode)

2. 完整请求转换流程

2.0 Router 启动命令

本文档中的示例基于以下启动命令：

sglang-router \
    --pd-disaggregation \
    --policy random \
    --prefill http://<prefill_ip>:$PREFILL_PORT none \
    --decode http://<decode_ip>:$DECODE_PORT \
    --host 0.0.0.0 \
    --port 58887 \
    --log-level debug

注意：--prefill 的第二个参数为 bootstrap 端口。写 none 时转发请求中 bootstrap_port 为 null；若写具体端口号（如 9001），则 bootstrap_port 为该端口号。

在prefill节点启动的时候也要设置这个参数，总之就是prefill的这个端口和router的端口设置要保持一致，decode需要知道prefill的端口才能建立连接通信

如果全都不设置，就都用默认的指定端口，decode也能根据默认端口和prefill通信。

2.1 用户原始请求

POST http://<ip>:<port>/v1/chat/completions
Content-Type: application/json
Authorization: Bearer sk-xxx
X-Trace-Id: trace-abc-123
{
  "model": "gpt-4",
  "messages": [{"role": "user", "content": "Hello, how are you?"}],
  "stream": false,
  "temperature": 0.7,
  "max_tokens": 100
}

2.2 发给 Prefill 的请求

POST http://<prefill_ip>:30001/v1/chat/completions
content-type: application/json
authorization: Bearer sk-test-key

{
  "bootstrap_host": "<prefill_ip>",
  "bootstrap_port": null,
  "bootstrap_room": 7073812129466405690,
  "continue_final_message": false,
  "ignore_eos": false,
  "logprobs": false,
  "max_tokens": 100,
  "messages": [
    {
      "content": "Hello, how are you?",
      "role": "user"
    }
  ],
  "model": "gpt-4",
  "no_stop_trim": false,
  "return_hidden_states": false,
  "rid": "chatcmpl-kVx856nuu99xNa4fFtGXSlQ1-10.191.16.16",
  "separate_reasoning": true,
  "skip_special_tokens": true,
  "stream": false,
  "stream_reasoning": true,
  "temperature": 0.7
}

2.3 发给 Decode 的请求

POST http://<decode_ip>:30002/v1/chat/completions
content-type: application/json
authorization: Bearer sk-test-key

{
  "bootstrap_host": "<prefill_ip>",
  "bootstrap_port": null,
  "bootstrap_room": 7073812129466405690,
  "continue_final_message": false,
  "data_parallel_rank": null,
  "data_parallel_rank_decode": null,
  "ignore_eos": false,
  "logprobs": false,
  "max_tokens": 100,
  "messages": [
    {
      "content": "Hello, how are you?",
      "role": "user"
    }
  ],
  "model": "gpt-4",
  "no_stop_trim": false,
  "return_hidden_states": false,
  "rid": "chatcmpl-kVx856nuu99xNa4fFtGXSlQ1-10.191.16.16",
  "separate_reasoning": true,
  "skip_special_tokens": true,
  "stream": false,
  "stream_reasoning": true,
  "temperature": 0.7
}

2.4 Prefill 与 Decode 请求的差异

字段	Prefill 请求	Decode 请求	说明
目标 URL	Prefill worker 的 base_url + route	Decode worker 的 base_url + route	不同
`rid`	有（相同值）	有（相同值）	Router 生成的请求 ID
`bootstrap_host`	有（相同值）	有（相同值）	Prefill worker 的 hostname
`bootstrap_port`	有（相同值）	有（相同值）	Prefill worker 的 bootstrap 端口
`bootstrap_room`	有（相同值）	有（相同值）	随机关联 ID
`data_parallel_rank`	仅 DP-aware 时有	始终有（非 DP 时为 `null`）	Prefill 的 DP rank
`data_parallel_rank_decode`	不存在	始终有（非 DP 时为 `null`）	Decode 的 DP rank

核心差异: Decode 请求始终多出 data_parallel_rank 和 data_parallel_rank_decode 两个字段。在非 DP 场景下，这两个值都是 null。

客户端请求经过 Router 的 ChatCompletionRequest 反序列化后，serde 会为未设置的字段填充默认值。以下字段会出现在转发请求中，但并非客户端原始发送的：

PD 模式专属注入字段:

字段	默认值	来源	说明
`bootstrap_host`	Prefill worker hostname	`inject_bootstrap_into_value`	Prefill 的主机名
`bootstrap_port`	Prefill bootstrap 端口	`inject_bootstrap_into_value`	若启动命令为 `none` 则为 `null`
`bootstrap_room`	随机 `[0, 2^63-1]`	`inject_bootstrap_into_value`	KV cache 关联 ID
`data_parallel_rank`	`null` / DP rank	Decode 请求专用	Prefill 的 DP rank
`data_parallel_rank_decode`	`null` / DP rank	Decode 请求专用	Decode 的 DP rank

Serde 反序列化填充的默认值字段:

字段	默认值	说明
`stream`	`false`	客户端未指定时的默认
`logprobs`	`false`	客户端未指定时的默认
`skip_special_tokens`	`true`	默认跳过特殊 token
`separate_reasoning`	`true`	SGLang 扩展，分离推理
`stream_reasoning`	`true`	SGLang 扩展，流式推理
`no_stop_trim`	`false`	不裁剪 stop token
`ignore_eos`	`false`	忽略 EOS
`continue_final_message`	`false`	继续最后一条消息
`return_hidden_states`	`false`	返回隐藏状态

3. Router 处理细节

Step 1: 生成请求 ID (rid)

Router 为每个请求生成唯一 ID，格式为： {prefix}{24位随机字符}-{本机IP后缀}

前缀根据路由路径决定：
- /v1/chat/completions → chatcmpl-
- /v1/completions → cmpl-
- /generate → gnt-
- 其他 → req-
随机字符取自 [A-Za-z0-9]，共 24 位
IP 后缀通过 UDP socket 连接 8.8.8.8:80 获取本机出口 IP

示例: chatcmpl-kVx856nuu99xNa4fFtGXSlQ1-10.191.16.16

Step 2: 注入 Bootstrap 字段

从 prefill worker 的配置中提取信息，注入三个字段：

字段	值	说明
`bootstrap_host`	Prefill worker 的 hostname	从 prefill URL 提取，如 `http://<prefill_ip>:30001` → `<prefill_ip>`
`bootstrap_port`	Prefill worker 的 bootstrap 端口	启动命令中指定的端口；若为 `none` 则为 `null`
`bootstrap_room`	随机整数	范围 `[0, 2^63-1]`，用于 prefill 和 decode 之间的 KV cache 关联

hostname 提取逻辑: 去掉 http:// / https:// 前缀，再按 : 分割取第一部分。

bootstrap_room 生成: rand::random::<u64>() & (i64::MAX as u64)，与 Python 的 random.randint(0, 2**63-1) 对齐。