2-qwen3-next支持piecewisegraph

gogongxt2026-01-192026-01-19

Qwen3-Next Piecewise CUDA Graph 支持实现技术文档

文档信息

项目	说明
目标	详细分析 commit d64bf6c 如何为 Qwen3-Next 模型添加 Piecewise CUDA Graph 支持
Commit	d64bf6c6ce703389cbeaaa44fe5ee3c699397d0d
PR	#13081
日期	2025-11-25

1. 概述

1.1 Commit 目标

为 Qwen3-Next 模型添加 Piecewise CUDA Graph 支持，使其能够在使用 Piecewise CUDA Graph 时获得性能提升。

1.2 修改文件清单

文件	修改内容	行数变化
`python/sglang/srt/compilation/backend.py`	添加 `gdn_with_output` 到 SPLIT_OPS	+1
`python/sglang/srt/layers/attention/fla/chunk_o.py`	修复 uninitialized memory 问题	-1 +1
`python/sglang/srt/mem_cache/memory_pool.py`	修复 dataclass 内存计算	-3 +6
`python/sglang/srt/model_executor/model_runner.py`	添加 hybrid attention layer 支持	+5
`python/sglang/srt/models/qwen3_next.py`	实现 GDN split op 模式	+63
`test/srt/models/test_qwen3_next_models.py`	添加 piecewise graph 测试	+38

1.3 关键技术点

使用 _with_output 模式：类似于 unified_attention_with_output
Custom Op 注册：通过 direct_register_custom_op 注册
Forward Context 传递：使用 get_forward_context() 访问实际 layer
Bug 修复：修复 chunk_o.py 和 memory_pool.py 的潜在问题

2. 背景：Qwen3-Next 架构特点

2.1 Qwen3-Next 模型架构

Qwen3-Next 是一个混合架构模型，结合了多种注意力机制：

Qwen3-Next Layer
├── Attention (标准注意力)
├── Linear Attention (线性注意力)
└── GDN (Gated Delta Network，Mamba 类线性 RNN)

2.2 GDN (Gated Delta Network) 详解

GDN 是 Qwen3-Next 的核心组件之一，是一种 Mamba 类的线性 RNN 架构。

代码位置： python/sglang/srt/models/qwen3_next.py:Qwen3GatedDeltaNet

主要特点：

基于状态空间模型 (State Space Model)
支持长序列的高效推理
包含复杂的 Triton kernel 实现
使用 dual stream 优化（在非 piecewise graph 模式下）

架构示意：

Input → QKVZ Projection → Conv1D → SSM → Output
         ↑               ↑
      dual_stream optimization (非 piecewise 模式)

2.3 与 Piecewise Graph 的兼容性问题

问题 1：复杂的计算流程

GDN 包含多个 Triton kernel 调用
使用 stream synchronization 优化
这些都与 CUDA Graph 的静态执行模式冲突

问题 2：动态控制流

根据序列长度选择不同的 kernel fusion 策略
条件分支与 CUDA Graph 冲突

问题 3：内存分配

内部使用 torch.empty_like() 创建中间 tensor
在 CUDA Graph 中需要静态内存布局

3. 实现挑战

3.1 为什么需要特殊处理

对比其他已支持的 layer：

Layer Type	Piecewise Graph 支持	实现方式
标准 Attention	✓	直接在 graph 内
Unified Attention	✓	`unified_attention_with_output` split op
MoE	✓	`moe_forward_piecewise_cuda_graph_impl` split op
GDN (Qwen3-Next)	✗ → ✓	本 PR 新增 `gdn_with_output` split op

3.2 技术难点

GDN Layer 未继承 CustomOp
- 标准 CustomOp 通过 enter_torch_compile() 控制行为
- GDN 有自己的 forward 逻辑，需要特殊处理
Hybrid Attention Layer
- Qwen3-Next 的 layer 有 attn 和 linear_attn 两个属性
- 需要在 model_runner.py 中识别并注册
内存初始化问题
- Triton kernel chunk_fwd_kernel_o 使用 torch.empty_like()
- 导致未初始化内存，影响 reproducibility

3.3 设计决策

决策 1：使用 _with_output 模式

与 unified_attention_with_output 保持一致
输出 tensor 由调用者预先分配
GDN op 只负责填充输出

决策 2：Split Op 边界

将 GDN 完整隔离在 split op 之外
Graph 在 GDN 之前和之后分别编译

决策 3：Forward Context 传递

使用 get_forward_context() 获取实际 layer 对象
调用 layer 的 _forward() 方法执行实际计算

4. 核心实现方案

4.1 整体架构

┌─────────────────────────────────────────────────────────────────┐
│              Piecewise CUDA Graph 执行流程 (Qwen3-Next)          │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│                    前置计算 (Subgraph 0)                         │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │  - Layernorm                                                │ │
│  │  - 其他 linear transformations                              │ │
│  └────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
                              ↓
                    ┌─────────────────┐
                    │   Graph Split   │  ← gdn_with_output split op
                    └─────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│                    GDN 计算隔离执行                             │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │  1. forward() 检测到 piecewise 模式                        │ │
│  │  2. 调用 torch.ops.sglang.gdn_with_output                 │ │
│  │  3. gdn_with_output 获取 forward context                  │ │
│  │  4. 调用 attention_layer._forward()                        │ │
│  │  5. 结果写入预分配的 output tensor                         │ │
│  └────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│                    后续计算 (Subgraph 1)                         │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │  - Residual connection                                     │ │
│  │  - MoE layer (if applicable)                               │ │
│  │  - 其他后续层                                              │ │
│  └────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

4.2 关键设计模式

4.2.1 `_with_output` 模式

与传统 forward 的区别：

# 传统方式 (无法用于 piecewise graph)
def forward(self, hidden_states, forward_batch):
    # ... 计算逻辑 ...
    return output  # 内部分配 output tensor

# _with_output 模式 (适用于 piecewise graph)
def forward(self, hidden_states, forward_batch):
    output = torch.empty_like(hidden_states)  # 预分配
    if forward_batch.forward_mode.is_extend() and get_forward_context() is not None:
        torch.ops.sglang.gdn_with_output(
            hidden_states,
            output,      # ← 传入预分配的输出 tensor
            self.layer_id,
        )
        return output
    else:
        return self._forward(hidden_states, forward_batch)

优势：

输出 tensor 内存地址固定
CUDA Graph 可以静态规划内存
避免动态内存分配

4.2.2 Custom Op 注册模式

完整的注册流程：

# 1. 实现实际执行函数
def gdn_with_output(hidden_states, output, layer_id):
    context = get_forward_context()
    attention_layer = context.attention_layers[layer_id]
    ret = attention_layer._forward(hidden_states, forward_batch)
    output.copy_(ret)
    return

# 2. 实现 fake 函数 (用于 torch.compile)
def gdn_with_output_fake(hidden_states, output, layer_id):
    return  # 空实现

# 3. 注册 custom op
direct_register_custom_op(
    op_name="gdn_with_output",
    op_func=gdn_with_output,
    mutates_args=["output"],  # 标记 output 为可变参数
    fake_impl=gdn_with_output_fake,
)

4.3 与现有 OP 隔离机制的集成

添加到全局 SPLIT_OPS 列表：

# backend.py:28-32
SPLIT_OPS = [
    "sglang.unified_attention_with_output",
    "sglang.inplace_all_reduce",
    "sglang.gdn_with_output",  # ← 新增
]

自动触发 Graph Splitting：

torch.compile(dump graph)
      ↓
  split_graph(backend.py)
      ↓
  检测到 gdn_with_output
      ↓
  在该 op 处切分 graph
      ↓
  Subgraph 0 → gdn_with_output → Subgraph 1

5. 详细代码分析

5.1 Qwen3GatedDeltaNet 修改

5.1.1 Dual Stream 优化禁用

文件： python/sglang/srt/models/qwen3_next.py:348-365

def _forward_input_proj(self, hidden_states: torch.Tensor):
    # 关键修改：在 piecewise graph 模式下禁用 dual stream
    if _is_npu or get_global_server_args().enable_piecewise_cuda_graph:
        DUAL_STREAM_TOKEN_THRESHOLD = 0  # ← 禁用 dual stream
    else:
        DUAL_STREAM_TOKEN_THRESHOLD = 1024  # ← 非 piecewise 模式启用

    seq_len, _ = hidden_states.shape
    if seq_len < DUAL_STREAM_TOKEN_THRESHOLD:
        # Dual stream 优化：使用两个 CUDA stream 并行执行
        current_stream = torch.cuda.current_stream()
        self.alt_stream.wait_stream(current_stream)
        projected_states_qkvz, _ = self.in_proj_qkvz(hidden_states)
        with torch.cuda.stream(self.alt_stream):
            projected_states_ba, _ = self.in_proj_ba(hidden_states)
        current_stream.wait_stream(self.alt_stream)
    else:
        # 单 stream 执行
        projected_states_qkvz, _ = self.in_proj_qkvz(hidden_states)
        projected_states_ba, _ = self.in_proj_ba(hidden_states)
    return projected_states_qkvz, projected_states_ba

原因分析：

模式	Dual Stream	原因
Piecewise Graph	✗	CUDA Graph 要求单 stream，无法动态切换
NPU	✗	NPU 不支持多 stream
Normal Eager	✓	并行优化，提升性能

5.1.2 Forward 方法重构

文件： python/sglang/srt/models/qwen3_next.py:367-382

def forward(
    self,
    hidden_states: torch.Tensor,
    forward_batch: ForwardBatch,
):
    # 预分配输出 tensor
    output = torch.empty_like(hidden_states)

    # 检测 piecewise graph 模式
    if forward_batch.forward_mode.is_extend() and get_forward_context() is not None:
        # 使用 custom op split 模式
        torch.ops.sglang.gdn_with_output(
            hidden_states,
            output,
            self.layer_id,
        )
        return output
    else:
        # 正常执行路径
        return self._forward(hidden_states, forward_batch)

关键点：

is_extend(): 只在 EXTEND 模式使用 piecewise graph
get_forward_context() is not None: 确保在 piecewise graph 上下文中
预分配 output：确保内存地址稳定

5.1.3 原始计算逻辑保持不变

文件： python/sglang/srt/models/qwen3_next.py:383-440

def _forward(
    self,
    hidden_states: torch.Tensor,
    forward_batch: ForwardBatch,
):
    # 原始的完整计算逻辑
    seq_len, _ = hidden_states.shape
    is_cuda_graph = forward_batch.forward_mode.is_cuda_graph()

    projected_states_qkvz, projected_states_ba = self._forward_input_proj(
        hidden_states
    )

    # ... 其余计算逻辑保持不变 ...

设计思路：

forward(): 薄包装层，处理 piecewise graph 模式判断
_forward(): 实际计算逻辑，与修改前完全一致

5.2 Custom Op 实现

5.2.1 gdn_with_output 函数

文件： python/sglang/srt/models/qwen3_next.py:1049-1066

def gdn_with_output(
    hidden_states: torch.Tensor,
    output: torch.Tensor,
    layer_id: int,
) -> None:
    # 1. 获取 forward context (由 piecewise context manager 设置)
    context = get_forward_context()
    forward_batch = context.forward_batch
    attention_layers = context.attention_layers

    # 2. 根据 layer_id 获取实际的 GDN layer 对象
    attention_layer = attention_layers[layer_id]

    # 3. 调用实际的 _forward 方法执行计算
    ret = attention_layer._forward(hidden_states, forward_batch)

    # 4. 将结果写入预分配的 output tensor
    assert (
        output.numel() == ret.numel()
    ), f"Output tensor element mismatch: {output.numel()} != {ret.numel()}"

    output.view(ret.shape).copy_(ret)
    return

关键设计点：

设计点	说明
通过 layer_id 查找	多个 GDN layer 共用一个 custom op 实现
使用 _forward	避免递归调用 forward()
Copy 而非 return	符合 `_with_output` 模式，修改传入的 tensor

5.2.2 Fake 实现

文件： python/sglang/srt/models/qwen3_next.py:1069-1074

def gdn_with_output_fake(
    hidden_states: torch.Tensor,
    output: torch.Tensor,
    layer_id: int,
) -> None:
    return  # 空实现

作用：

用于 torch.compile 的 tracing 阶段
避免在 tracing 时执行实际计算
确保 shape inference 正确

5.2.3 Custom Op 注册

文件： python/sglang/srt/models/qwen3_next.py:1077-1082

direct_register_custom_op(
    op_name="gdn_with_output",
    op_func=gdn_with_output,
    mutates_args=["output"],  # ← 标记 output 为可变参数
    fake_impl=gdn_with_output_fake,
)

参数说明：

参数	值	说明
`op_name`	`"gdn_with_output"`	注册到 `torch.ops.sglang.*` namespace
`op_func`	`gdn_with_output`	实际执行函数
`mutates_args`	`["output"]`	标记 output 为 in-place 修改参数
`fake_impl`	`gdn_with_output_fake`	编译期的 fake 实现

5.3 Model Runner 修改

5.3.1 Hybrid Attention Layer 支持

文件： python/sglang/srt/model_executor/model_runner.py:363-370

# 原有代码处理标准 attention layer
if hasattr(layer, "self_attn"):
    if hasattr(layer.self_attn, "attn"):
        self.attention_layers.append(layer.self_attn.attn)
    elif hasattr(layer.self_attn, "attn_mqa"):
        self.attention_layers.append(layer.self_attn.attn_mqa)

# 新增：处理 hybrid attention layer (Qwen3-Next)
elif hasattr(layer, "attn"):
    self.attention_layers.append(layer.attn)
elif hasattr(layer, "linear_attn"):
    self.attention_layers.append(layer.linear_attn)

Qwen3-Next Decoder Layer 结构：

class Qwen3NextDecoderLayer:
    self.attn = Qwen3Attention()        # 标准 attention
    self.linear_attn = Qwen3GatedDeltaNet()  # 线性 attention (GDN)
    # ...

修改原因：

gdn_with_output 需要通过 attention_layers[layer_id] 访问实际 layer
原有代码只处理 self_attn 属性
需要扩展支持直接在 layer 上的 attn 和 linear_attn

5.4 Bug 修复

5.4.1 Chunk O Kernel 未初始化内存

文件： python/sglang/srt/layers/attention/fla/chunk_o.py:146

# 修改前
o = torch.empty_like(v)  # ← 未初始化，包含垃圾值

# 修改后
o = torch.zeros_like(v)  # ← 初始化为 0

问题分析：

torch.empty_like() 分配内存但不初始化
Triton kernel 可能不写所有输出元素
导致 output 包含未定义的值

影响：

确定性测试失败
输出不稳定
与 CUDA Graph replay 行为不一致

5.4.2 Memory Pool Dataclass 内存计算

文件： python/sglang/srt/mem_cache/memory_pool.py:140-144

# 修改前
def mem_usage_bytes(self):
    return sum(get_tensor_size_bytes(t) for t in vars(self).values())
    # ↑ vars() 不包含 frozen dataclass 的字段

# 修改后
def mem_usage_bytes(self):
    return sum(
        get_tensor_size_bytes(getattr(self, f.name))
        for f in dataclasses.fields(self)  # ← 使用 dataclasses.fields()
    )

问题分析：

MambaPool.State 是 @dataclass(frozen=True)
vars() 不包含 frozen dataclass 的字段
导致内存计算错误

影响：

内存统计不准确
可能导致 OOM 或内存浪费

6. 与其他组件的交互

6.1 Piecewise Context Manager

Forward Context 的生命周期：

# 1. 设置 context (在 piecewise_cuda_graph_runner.py)
with set_forward_context(
    forward_batch,
    self.attention_layers,  # ← 包含所有 GDN layers
    self.quant_config,
    self.moe_layers,
):
    output = self.model_runner.model.forward(...)

# 2. 在 gdn_with_output 中访问 (在 qwen3_next.py)
context = get_forward_context()
attention_layer = context.attention_layers[layer_id]

# 3. Context 自动清理

相关代码： python/sglang/srt/compilation/piecewise_context_manager.py:82-99

@contextmanager
def set_forward_context(
    forward_batch: ForwardBatch,
    attention_layers: List[Any],
    quant_config: Any,
    moe_layers: List[Any],
):
    global _forward_context
    _forward_context = ForwardContext()
    _forward_context.set_forward_batch(forward_batch)
    _forward_context.set_attention_layers(attention_layers)
    # ...
    try:
        yield
    finally:
        _forward_context = None  # ← 自动清理

6.2 Graph Splitting 流程

完整流程图：

1. torch.compile 入口
   ↓
2. SGLangBackend.__call__(graph)  (backend.py:396)
   ↓
3. split_graph(graph, SPLIT_OPS)  (backend.py:424)
   ├─ 检测到 "sglang.gdn_with_output"
   ├─ 在该 node 处增加 subgraph_id
   └─ 调用 torch.fx.split_module()
   ↓
4. 生成 split_gm 和 piecewise_graphs
   ├─ submod_0: gdn_with_output 之前的计算
   ├─ submod_1: gdn_with_output (is_splitting_graph=True)
   └─ submod_2: gdn_with_output 之后的计算
   ↓
5. PiecewiseCompileInterpreter.run()  (backend.py:443)
   ├─ 编译 submod_0 (need compile)
   ├─ 跳过 submod_1 (is_splitting_graph)
   └─ 编译 submod_2 (need compile)
   ↓
6. 每个 submod 创建 CUDAPiecewiseBackend
   ↓
7. Capture CUDA Graph for 每个 capture_size

6.3 与 Unified Attention 的对比

相似的模式：

# unified_attention.py (已存在)
def forward(self, hidden_states, forward_batch):
    if is_extend() and get_forward_context() is not None:
        torch.ops.sglang.unified_attention_with_output(
            hidden_states,
            self.output_buffer,
            self.layer_id,
        )
        return self.output_buffer
    else:
        return self._forward(hidden_states, forward_batch)

# qwen3_next.py (本 PR 新增)
def forward(self, hidden_states, forward_batch):
    if is_extend() and get_forward_context() is not None:
        torch.ops.sglang.gdn_with_output(
            hidden_states,
            output,  # 预分配
            self.layer_id,
        )
        return output
    else:
        return self._forward(hidden_states, forward_batch)

差异：

特性	Unified Attention	GDN (Qwen3-Next)
输出 buffer	成员变量 `self.output_buffer`	局部变量 `torch.empty_like()`
Layer 类型	标准 Attention	线性 RNN (Mamba-like)
计算复杂度	O(n²)	O(n)
使用场景	所有支持 attention 的模型	仅 Qwen3-Next

7. 测试验证

7.1 测试用例

文件： test/srt/models/test_qwen3_next_models.py:138-176

class TestQwen3NextPiecewiseCudaGraph(CustomTestCase):

    @classmethod
    def setUpClass(cls):
        cls.model = "Qwen/Qwen3-Next-80B-A3B-Instruct"
        cls.base_url = DEFAULT_URL_FOR_TEST
        cls.process = popen_launch_server(
            cls.model,
            cls.base_url,
            timeout=DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH,
            other_args=[
                "--tp", "4",
                "--enable-piecewise-cuda-graph",          # ← 启用 piecewise graph
                "--piecewise-cuda-graph-compiler", "eager",  # ← 使用 eager 编译器
            ],
        )

    def test_gsm8k(self):
        args = SimpleNamespace(
            num_shots=5,
            num_questions=200,
            max_new_tokens=512,
            parallel=128,
            # ...
        )
        metrics = run_eval(args)
        self.assertGreater(metrics["accuracy"], 0.93)  # ← 验证正确性

测试要点：

正确性验证
- 使用 GSM8K 数据集
- 期望准确率 > 93%
- 确保性能提升的同时不影响正确性
配置选择
- TP=4 (Tensor Parallelism)
- Eager compiler (快速编译)
- 80B 模型 (大规模验证)

7.2 性能预期

理论分析：

场景	Eager	Piecewise Graph	提升
小 batch (1-4 tokens)	1.0x	1.5-2.0x	显著
中 batch (8-32 tokens)	1.0x	1.3-1.7x	明显
大 batch (64+ tokens)	1.0x	1.1-1.3x	轻微

关键因素：

GDN 计算密度高，受益于 kernel launch 减少
Qwen3-Next 是混合架构，多个 layer 都能受益
80B 模型，communication overhead 大，graph 优势明显

7.3 调试技巧

启用调试模式：

python -m sglang.launch_server \
    --model Qwen/Qwen3-Next-80B-A3B-Instruct \
    --tp 4 \
    --enable-piecewise-cuda-graph \
    --enable-torch-compile-debug-mode  # ← 检查输入地址一致性

检查捕获的 graph：

ls ~/.cache/sglang/torch_compile_cache/*/rank_0_0_backbone/
# 输出：
# computation_graph_1234567890.12.py

查看 split 后的 graph：

# 在代码中添加
from torch._dynamo.utils import lazy_format_graph_code
lazy_format_graph_code("after split", split_gm)  # 打印切分后的 graph

8. 总结与最佳实践

8.1 实现总结

核心贡献：

为 Qwen3-Next 添加了 Piecewise CUDA Graph 支持
- 通过 gdn_with_output custom op 实现
- 与现有的 unified_attention_with_output 模式保持一致
修复了两个潜在 bug
- chunk_o.py 的未初始化内存问题
- memory_pool.py 的 dataclass 内存计算问题
扩展了 hybrid layer 支持
- 在 model_runner.py 中添加了对 attn 和 linear_attn 的识别

8.2 实现模式总结

为新的特殊 layer 添加 piecewise graph 支持的步骤：

1. 确定需要隔离的 layer
   ↓
2. 实现 _with_output custom op
   ├─ 实际执行函数
   ├─ fake 函数
   └─ direct_register_custom_op
   ↓
3. 修改 layer 的 forward 方法
   ├─ 预分配 output tensor
   ├─ 检测 piecewise 模式
   └─ 调用 custom op 或 _forward
   ↓
4. 添加到 SPLIT_OPS 列表
   ↓
5. 更新 model_runner.py (如需要)
   └─ 注册 layer 到 attention_layers
   ↓
6. 编写测试验证正确性

8.3 最佳实践

8.3.1 Custom Op 设计

DO:

# ✓ 使用 _with_output 模式
def forward(self, x, ...):
    output = torch.empty_like(x)
    if get_forward_context() is not None:
        torch.ops.sglang.my_op_with_output(x, output, ...)
        return output
    return self._forward(x, ...)

# ✓ 通过 layer_id 查找实际 layer
def my_op_with_output(x, output, layer_id):
    layer = get_forward_context().attention_layers[layer_id]
    ret = layer._forward(x)
    output.copy_(ret)

DON’T:

# ✗ 不要在 custom op 中创建新 tensor
def my_op(x, layer_id):
    layer = get_forward_context().attention_layers[layer_id]
    return layer._forward(x)  # ← 返回新 tensor，内存地址不固定

# ✗ 不要硬编码 layer 引用
def my_op_with_output(x, output):
    global my_layer  # ← 避免全局变量
    ret = my_layer._forward(x)
    output.copy_(ret)

8.3.2 性能优化

1. 合理选择 capture sizes

# Qwen3-Next 推荐配置
--piecewise-cuda-graph-tokens 1,2,4,8,16,32,64,128,256

2. 根据场景选择编译器

# 开发/调试：使用 eager (快速编译)
--piecewise-cuda-graph-compiler eager

# 生产环境：使用 inductor (更好性能)
--piecewise-cuda-graph-compiler inductor

3. 监控内存使用

# 如果 OOM，降低静态内存比例
--mem-fraction-static 0.8

# 或减少最大 capture size
--piecewise-cuda-graph-max-tokens 512

8.3.3 调试技巧

1. 验证 graph splitting

import logging
logging.basicConfig(level=logging.DEBUG)
# 查看 split 后的 subgraph 信息

2. 检查 forward context

context = get_forward_context()
assert context is not None, "Not in piecewise graph context"
assert context.attention_layers is not None
assert len(context.attention_layers) > layer_id

3. 性能 profiling

nsys profile -o profile.qdrep python -m sglang.launch_server ...

8.4 未来改进方向

自动生成 custom op
- 通过 annotation 标记需要 split 的 layer
- 自动生成 _with_output wrapper
更智能的 split 策略
- 根据性能 profiling 自动选择 split 点
- 动态调整 split ops 列表
支持更多模型
- 将此模式应用到其他混合架构模型
- 如 Jamba、Phi-3 等

附录

A. 相关文档

SGLANG_PIECEWISE_ANALYSIS.md - Piecewise CUDA Graph 总体技术文档
Qwen3-Next Model Card
PR #13081

B. 代码位置索引

文件	关键行号	说明
`compilation/backend.py`	28-32	SPLIT_OPS 定义
`models/qwen3_next.py`	348-365	Dual stream 禁用逻辑
`models/qwen3_next.py`	367-382	GDN forward 方法
`models/qwen3_next.py`	383-440	GDN _forward 实现
`models/qwen3_next.py`	1049-1066	gdn_with_output 函数
`models/qwen3_next.py`	1077-1082	Custom op 注册
`model_executor/model_runner.py`	363-370	Hybrid layer 支持
`layers/attention/fla/chunk_o.py`	146	Empty to zeros 修复
`mem_cache/memory_pool.py`	140-144	Dataclass 内存计算修复

C. 配置参数速查

参数	推荐值 (Qwen3-Next)	说明
`--enable-piecewise-cuda-graph`	True	启用 piecewise graph
`--piecewise-cuda-graph-tokens`	1,2,4,8,16,32,64,128,256	Capture sizes
`--piecewise-cuda-graph-compiler`	eager/inductor	编译器选择
`--enable-torch-compile-debug-mode`	False (生产) / True (调试)	调试模式
`--mem-fraction-static`	0.8-0.9	静态内存比例
`--tp`	≥2	Tensor parallelism