eagle参数配置

gogongxt2026-01-142026-01-14

EAGLE Speculative Decoding 参数与原理

本文对 EAGLE Speculative Decoding 的核心参数、指标与整体流程进行结构化说明，并结合关键代码位置解释其实现细节。

NOTE

EAGLE原始文档：https://docs.sglang.io/advanced_features/speculative_decoding.html#EAGLE-Decoding

To enable EAGLE speculative decoding the following parameters are relevant:

speculative_draft_model_path: Specifies draft model. This parameter is required.
speculative_num_steps: Depth of autoregressive drafting. Increases speculation range but risks rejection cascades. Default is 5.
speculative_eagle_topk: Branching factor per step. Improves candidate diversity, will lead to higher acceptance rate, but more lead to higher memory/compute consumption. Default is 4.
speculative_num_draft_tokens: Maximum parallel verification capacity. Allows deeper tree evaluation but will lead to higher GPU memory usage. Default is 8.

These parameters are the same for EAGLE-2 and EAGLE-3.

You can find the best combinations of these parameters with bench_speculative.py.

In the documentation below, we set --cuda-graph-max-bs to be a small value for faster engine startup. For your own workloads, please tune the above parameters together with --cuda-graph-max-bs, --max-running-requests, --mem-fraction-static for the best performance.

一、核心参数详解

1. `speculative_num_steps`（默认：5）

含义：自回归草稿（draft）的深度，即 draft 模型一次向前推理的步数
原理：
- 在每个验证周期中，draft 模型会连续预测多个 token
- 这些 token 形成一条 speculation chain
权衡：
- ✅ 增大该值可扩大推测范围，理论上提升吞吐量
- ❌ 过大将导致 accept rate 下降
- 一旦某一步被拒绝，其后的所有 token 都会作废（rejection cascade）

关键代码（eagle_worker.py:599–609）：

for i in range(self.speculative_num_steps):
    input_ids, hidden_states, scores, tree_info = select_top_k_tokens(
        i, topk_p, topk_index, hidden_states, scores, self.topk
    )
    # ...
    if i == self.speculative_num_steps - 1:
        break  # 最后一步不需要 forward

2. `speculative_eagle_topk`（默认：4）

含义：每一步的分支因子
- 即从 draft 模型 logits 中选择的 top-k 候选 token 数
原理：
- EAGLE 使用 树形推测结构
- topk 决定每个节点的分支数
权衡：
- ✅ 增大 topk 可提升候选多样性，提高 accept rate
- ❌ 显著增加 GPU 内存和计算开销（KV Cache 按 topk 倍数增长）

关键代码（eagle_worker.py:378）：

req.kv_allocated_len += self.speculative_num_steps * self.topk

3. `speculative_num_draft_tokens`（默认：8）

含义：
- Target 模型一次并行验证的 draft token 总上限

经验计算：

当 topk = 1 时，通常：

speculative_num_draft_tokens ≈ speculative_num_steps + 1

原理：
- 决定 verify 阶段的 batch size
权衡：
- ✅ 更大值允许更深/更宽的树形验证
- ❌ 增加 GPU 内存占用

二、Accept Length 指标详解

1. 计算公式

代码位置（scheduler_metrics_mixin.py:276–278）：

spec_accept_length = (
    self.spec_num_accepted_tokens / self.spec_num_forward_ct
)

2. Accept Length 的含义

定义：
- 每次 forward 平均被接受的 token 数（包含 bonus token）
组成：
- 分子：接受的 token 总数
- 分母：forward 次数（batch 级）

3. 物理意义

accept_length > 1：
- speculative decoding 有效加速
- 每次 forward 平均生成多个 token
accept_length = 1：
- 与普通自回归 decode 等价（无加速）
accept_length < 1：
- 性能退化（理论上极少出现）

4. 是否是 per-request 指标？

❌ 不是
- 这是 batch 级聚合指标
✅ 底层存在 per-request 统计：
- accept_length_per_req_cpu
- 代码位置：eagle_info.py:220

每个请求的 accept 统计（eagle_info.py:399–401）：

req.spec_accepted_tokens += (
    sum(1 for idx in accept_index_row if idx != -1) - 1
)

5. Accept Rate vs Accept Length

Accept Rate：

接受的 token 数 / draft token 总数

Accept Length：
```
1
```
```
接受的 token 数 / forward 次数
```

Accept Rate 代码（scheduler_metrics_mixin.py:279–290）：

spec_accept_rate = (
    self.spec_num_accepted_tokens / total_draft_tokens
    if total_draft_tokens > 0
    else 0
)

三、EAGLE Speculative Decoding 核心流程

参考代码（eagle_worker.py:259–313）

1. Draft Phase（`draft()`）

使用轻量级 draft 模型 生成候选 token 树
每一步：
- 从 logits 中选择 topk 个 token
- 向前推理 num_steps 步
构建 tree-shaped attention mask，用于并行验证

2. Verify Phase（`verify()`）

使用 target 模型 并行验证所有候选 token
支持：
- 贪婪采样
- 拒绝采样
验证与接受逻辑详见：eagle_info.py:191–351

3. Update Phase

释放被拒绝 token 的 KV Cache
更新 draft 模型的 hidden states
为下一轮 speculative decoding 做准备

四、参数自动选择逻辑

代码位置（server_args.py:4267–4295）：

def auto_choose_speculative_params(self: ServerArgs):
    if arch in ["LlamaForCausalLM"]:
        return (5, 4, 8)  # num_steps=5, topk=4, draft_tokens=8
    elif arch in ["DeepseekV2ForCausalLM", "GptOssForCausalLM", ...]:
        return (3, 1, 4)  # 更保守的参数
    else:
        return (5, 4, 8)

五、监控与调参建议

当 Accept Length 偏低时的优化方向

降低 speculative_num_steps
提高 speculative_eagle_topk
检查 draft model 与 target model 的匹配度