sglang-overlap

gogongxt2026-03-182026-03-22

Mini-SGLang Overlap Scheduler 原理详解

在 LLM 推理系统中，CPU 侧的调度开销（接收请求、准备 batch、处理结果）往往会拖慢 GPU 的计算效率。Mini-SGLang 通过 Overlap Scheduling 将 CPU 处理和 GPU 计算重叠执行，把 CPU 瓶颈藏进 GPU 计算的空闲时间里。

核心问题

传统调度器的执行模式是这样的：

sequenceDiagram
    participant CPU
    participant GPU

    CPU->>CPU: 接收请求、准备batch
    CPU->>GPU: 发送任务
    GPU->>GPU: 执行推理 (100ms)
    GPU-->>CPU: CPU同步等待返回结果
    CPU->>CPU: 处理结果

问题在于：GPU 推理时 CPU 在等待，CPU 处理结果时 GPU 在等待。双方无法充分利用。

Overlap Scheduler 的解法

Mini-SGLang 使用双 CUDA Stream 机制实现真正的流水线：

self.stream: CPU 侧处理流（调度、消息接收、结果处理）
engine.stream: GPU 侧计算流（模型前向推理）

def run_forever(self) -> NoReturn:
    data = None
    while True:
        # 当前的ongoing_data给到下一轮
        data = self.overlap_loop(data)

# scheduler.py:83-106
def overlap_loop(self, last_data: ForwardData | None) -> ForwardData | None:
    # 1. 接收新消息（CPU）
    for msg in self.receive_msg(blocking=blocking):
        self._process_one_msg(msg)

    # 2. 调度下一批（CPU准备）
    forward_input = self._schedule_next_batch()

    # 3. 执行当前批（GPU异步执行）
    ongoing_data = None
    if forward_input is not None:
        with self.engine_stream_ctx:
            self.engine.stream.wait_stream(self.stream)
            ongoing_data = (forward_input, self._forward(forward_input))

    # 4. 处理上一批结果（等待上一次forward的GPU完成）
    self._process_last_data(last_data)

    return ongoing_data

关键设计：延迟处理

理解 Overlap Scheduler 的核心在于：每轮处理的永远是上一轮发起的 GPU 计算结果。

sequenceDiagram
    participant RoundN as Round N
    participant RoundN_1 as Round N_1
    participant GPU

    RoundN->>GPU: 发起 forward_N (异步，不等待)
    RoundN-->>RoundN_1: 返回 ongoing_data_N

    RoundN_1->>RoundN_1: 处理 last_data = ongoing_data_N<br/>(调用 synchronize 等待 GPU 完成)
    RoundN_1->>GPU: 发起 forward_N+1 (异步)
    RoundN_1-->>RoundN_2: 返回 ongoing_data_N+1

完整 N 轮与 N+1 轮时序

顺序逻辑图：

sequenceDiagram
    participant CPU
    participant GPU
    autonumber

    Note over CPU,GPU: Round N 开始，last_data 来自 Round N-1

    CPU->>CPU: 1. receive_msg() 接收新请求
    CPU->>CPU: 2. _schedule_next_batch() 调度 batch
    CPU->>GPU: 3. _forward() 发起 GPU 计算

    par CPU 与 GPU 并行
        Note over CPU: CPU 可以继续做其他事
        GPU->>GPU: 执行 forward_N (假设 100ms)
    end

    CPU->>CPU: 4. _process_last_data(last_data)
    Note over CPU: ⚠️ 同步点: copy_done.synchronize()<br/>等待上一轮 GPU 完成

    alt 请求完成
        CPU->>CPU: 释放资源、发送 DetokenizeMsg
    else 请求继续
        CPU->>CPU: 缓存 prefix 到 RadixCache
    end

    CPU-->>CPU: 返回 ongoing_data_N 给 Round N+1

Overlap Scheduler 真实执行时间线 (GPU 100% 满载)：

gantt
    title Overlap Scheduler 真实执行时间线 (GPU 100% 满载)
    dateFormat x
    axisFormat %L ms

    section GPU
    Forward N (100ms) :gpu1, 0, 100ms
    Forward N+1 (100ms) :gpu2, 100, 100ms
    Forward N+2 (100ms) :gpu3, 200, 100ms

    section CPU
    Proc  N-1 (30ms) :prep1, 0, 30ms
    Prep  N+1 (20ms) :prep2, 30, 20ms
    sync N (50ms) :wait1, 50, 50ms
    Proc N (30ms) :proc1, 100, 30ms
    Prep  N+2 (20ms) :prep3, 130, 20ms
    sync N+1 (50ms) :wait2, 150, 50ms
    Proc N+1 (30ms) :proc2, 200, 30ms
    Prep  N+3 (20ms) :prep2, 230, 20ms
    sync N+2 (50ms) :wait3, 250, 50ms

数据流详解

变量	含义	生命周期
`last_data`	上一轮(N-1)的 GPU 计算结果	当前轮处理后丢弃
`forward_input_N`	当前轮(N)准备发给 GPU 的输入	发送给 GPU
`forward_output_N`	当前轮(N)的 GPU 计算结果	返回给下一轮(N+1)作为 last_data
`ongoing_data`	(forward_input, forward_output) 元组	跨轮传递

同步点分析

代码中有两个关键同步点：

# scheduler.py:101-103
with self.engine_stream_ctx:  # 切换到 engine 的 stream
    self.engine.stream.wait_stream(self.stream)  # 同步点1: 确保 GPU 侧准备好
    ongoing_data = (forward_input, self._forward(forward_input))

# scheduler.py:138-143
def _process_last_data(self, last_data: ForwardData | None) -> None:
    if last_data is None:
        return

    batch, (_, next_tokens_cpu, copy_done) = last_data[0].batch, last_data[1]
    copy_done.synchronize()  # 同步点2: 等待 GPU 计算完成
    # ... 处理结果

同步点	作用
wait_stream	确保 engine stream 已经接收到新的forward需要的内容
synchronize	确保上一轮的 GPU 计算已完成，CPU 可以安全读取结果

完成判断逻辑

在 _process_last_data 中判断请求是否完成：

# scheduler.py:153-156
next_token = int(next_token.item())
finished = not req.can_decode  # 无法继续 decode
if not req.sampling_params.ignore_eos:
    finished |= next_token == self.eos_token_id  # 遇到 EOS

两个条件满足任一即认为完成：

req.can_decode = False: prefill 阶段已完成，无法继续 decode
next_token == eos_token_id: 生成的 token 是结束符

与普通调度器的对比

flowchart LR
    subgraph Normal["普通调度 (normal_loop)"]
        N1[接收消息] --> N2[调度]
        N2 --> N3[执行GPU]
        N3 --> N4[处理结果]
    end

    subgraph Overlap["Overlap调度 (overlap_loop)"]
        O1[接收消息] --> O2[调度]
        O2 --> O3[执行GPU]
        O3 --> O4[处理上轮结果]
    end

    Normal --- Overlap

区别：

Normal: 串行执行 [接收 → 调度 → 执行 → 处理结果]
Overlap: 交叉执行，当前轮的 CPU 处理和上一轮的 GPU 计算并行

性能收益

通过这种方式，CPU 侧的调度开销被 GPU 推理时间掩盖：

CPU 永远在处理”上一轮”的结果，而 GPU 永远在执行”当前轮”的计算，双方交替忙碌，GPU 利用率显著提升。

总结

Overlap Scheduler 的核心 tricks：

双 Stream: CPU 和 GPU 使用不同的 CUDA stream
延迟处理: 本轮处理上一轮的 GPU 结果，本轮的 GPU 结果交给下一轮处理
显式同步: 通过 synchronize() 确保结果可安全读取
流水线: 形成 CPU-GPU-CPU-GPU 的流水线执行