mamba-radix-cache-overview

gogongxt2026-04-202026-04-20

SGLang Mamba Radix Cache 概述

背景

Qwen3-Next 等混合架构模型同时包含 Linear Attention (Mamba) 和 Softmax Attention。对于 Linear Attention 的前缀匹配，需要缓存 SSM 状态和 Conv 状态。SGLang 实现了 Mamba Radix Cache 来支持这一功能。

PR	Commit	提交时间	描述
#11214	a55cf5304	2025-10-13	[Feature] Support mamba radix cache v0 - Mamba Radix Cache 的初始实现，建立了基础的 `no_buffer` 模式
#14792	e61dabf5e	2025-12-14	[Qwen3-next] support mamba radix cache for overlap scheduler - 引入 `extra_buffer` 模式，支持 overlap scheduler 和 Ping-Pong 缓冲机制
#15180	36fcf71ff	2025-12-16	[Qwen3-next] Add PD disaggregation support for mamba with extra_buffer - 为 `extra_buffer` 添加 Prefill-Decode disaggregation 支持

两种调度策略对比

SGLang 提供两种 Mamba scheduler 策略，通过 --mamba-scheduler-strategy 参数选择：

策略	内存开销	状态复制	适用场景
`no_buffer`	基准	每次缓存都复制	内存敏感、低命中率
`extra_buffer`	3x	仅在 buffer 用尽时复制	高命中率、追求性能

`no_buffer` 模式

不使用额外 buffer，每次缓存时通过 fork_from 复制 Mamba 状态到新 slot。

优点：内存开销小缺点：每次缓存都需要复制状态

`extra_buffer` 模式

使用 Ping-Pong 缓冲机制，每个请求额外分配 2 个 slot 用于追踪中间状态。缓存时直接转移 buffer 所有权，无需复制。

优点：避免频繁状态复制，支持 overlap scheduler 缺点：内存开销是 no_buffer 的 3 倍

关键代码文件

文件	说明
`python/sglang/srt/mem_cache/mamba_radix_cache.py`	Radix Cache 核心逻辑，前缀匹配、缓存插入/驱逐
`python/sglang/srt/mem_cache/memory_pool.py`	MambaPool 状态存储，slot 分配/释放/复制
`python/sglang/srt/server_args.py`	`--mamba-scheduler-strategy` 等参数定义
`python/sglang/srt/managers/schedule_batch.py`	状态追踪逻辑 (`_init_mamba_tracking`, `_mamba_prefix_cache_update`)

配置参数

参数	默认值	说明
`--mamba-scheduler-strategy`	`auto`	调度策略：`no_buffer` 或 `extra_buffer`
`--mamba-track-interval`	`256`	Decode 阶段的状态追踪间隔
`--max-mamba-cache-size`	`None`	Mamba cache 的最大大小
`--mamba-ssm-dtype`	`float32`	SSM 状态的数据类型
`--mamba-full-memory-ratio`	`0.2`	Mamba 状态内存与 KV cache 内存的比率

启动示例

# extra_buffer 模式（推荐用于高前缀命中率场景）
python -m sglang.launch_server \
    --model-path /path/to/Qwen3-Next \
    --mamba-scheduler-strategy extra_buffer \
    --mamba-track-interval 256

# no_buffer 模式（推荐用于内存紧张场景）
python -m sglang.launch_server \
    --model-path /path/to/Qwen3-Next \
    --mamba-scheduler-strategy no_buffer

SGLang Mamba Radix Cache 概述

背景

相关 Commit 和 PR

两种调度策略对比

no_buffer 模式

extra_buffer 模式

关键代码文件

配置参数

启动示例

gogongxt

`no_buffer` 模式

`extra_buffer` 模式