TokenizerManager
流式响应架构详解
总体概览:
sequenceDiagram
participant User as 用户
participant API as FastAPI Server
participant TM as TokenizerManager
participant Router as Router进程
participant Model as Model RPC
participant Detok as Detokenizer
User->>API: POST /generate (GenerateReqInput)
API->>API: obj.post_init()
API->>TM: generate_request(obj)
TM->>TM: 第一次请求创建handle_loop
TM->>TM: tokenizer.encode(text)
TM->>TM: ...
mlp计算流程
先来看一下上图经典mlp的计算:
gate和up的proj,可以cat起来一起算
gate后有一个silu激活,激活后的值和up后的进行点乘,这两个操作也是一起做的
点乘结果给到down_proj就是最后的输出
对于非moe的mlp计算,qwen2和qwen3都一样的用的类Qwen2MLP
核心计算MergedColumnParallelLinear和RowParallelLinear就是使用torch.linear的计算,如果是tp,就是直接进行矩阵分块
12345678910111213141516171819202122232425262728293031323334353637383940414243class Qwen2MLP(nn.Module):
def __init__(
self,
hidden_size: int,
intermediate_size: int,
hidden_act: str,
quant_config: Optional[Qu ...
SGLang源码解析封面
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196 ...
143a5a50a47c0b7d448749907319f4bcf2a14f711491a3ab6bc352526485536289a132d6d21be13a9d4d1189bad89330e00b9d7e0439b7a06816010b2ae3475809e63a0f1abb6079e7ba6a83bd2eaf14c791e47c16290be32a2b7f52d985c3aa618c648d2b95b1a0a3c40532b6415e6152b3eb8d1d77f9887da8eb22dda4093d5bddef2b2ee118753f680cfdb5fc2661cfc6888ea89d3d12bec0b3bf9d7a844e8ec4281d8e8c792aa408caf429a7a9b0f07c2b90f4af87c31f5e572d8cd556065d0083768560e27703b04dcc3ee1029ea0bf98c458cc854c6e5655818d622dbf2a227e129a410a7381754f6687d93ff1f1c1f89aa277a77f7 ...
进程结构
tp=1的情况 总共三个进程
启动命令:python3 -m sglang.launch_server --model-path /tmp-data/models/llama-2-7b --port 30000 --mem-fraction-static 0.8 --tp 1
查看进程树:ps -aux —-forest
123luban 3049112 21.6 0.0 7700748 758940 pts/6 Sl+ 11:48 0:09 | \_ python3 -m sglang.launch_server --model-path /tmp-data/models/llama-2-7b --port 30000 --mem-fraction-static 0.8 --tp 1
luban 3052085 32.5 0.0 56482984 769368 pts/6 Sl+ 11:48 0:06 | \_ python3 -m sglang.launch_server --model-path /tm ...
143a5a50a47c0b7d448749907319f4bcf2a14f711491a3ab6bc352526485536289a132d6d21be13a9d4d1189bad89330e00b9d7e0439b7a06816010b2ae3475809e63a0f1abb6079e7ba6a83bd2eaf14c791e47c16290be32a2b7f52d985c3aa618c648d2b95b1a0a3c40532b6415e6152b3eb8d1d77f9887da8eb22dda4093d5bddef2b2ee118753f680cfdb5fc26619293562d86d1fb84b64e14989313f1d3a1ced04969015c0642c6e7d68dbda061d2753222f76fd3b247f9d27c3bb8e33770b085376b3ffa8cc96cceca755ec5cb36273302b86e6821bce3ab9c78465fd77d98e480a260b29d6161baa952b22c2ab1e92d2e2bbdae213 ...
12345678910111213141516171819202122232425262728293031323334353637383940414243{
"architectures": [
"Qwen3NextForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"decoder_sparse_step": 1,
"eos_token_id": 151645,
"full_attention_interval": 4,
"head_dim": 256,
"hidden_act": "silu",
"hidden_size": 2048,
"initializer_range": 0.02,
"intermediate_size": 5120,
"linear_conv_kernel_dim": 4,
"linear_key_head_dim": 128,
"linear_num_key_heads": 16,
"line ...
qwen3-next结构
模型共计 48 层,被划分为 12 组(每组 4 层)。
前 3 层 采用 Gated DeltaNet Linear
Attention,能够显著提升计算效率并降低显存占用。
第 4 层 为传统的 Full Self Softmax
Attention,在输出阶段额外加入了一道门控。
Gated DeltaNet 分析
关于Gated DeltaNet计算可以看这张图详细一点:
计算顺序:
五个linear矩阵乘
计算一维卷积
计算SiLU激活函数
计算Gated Delta Rule
Gated Delta Rule输入进行Zero-Centered RMSNrom
RMSNorm输出和的结果进行点乘
下面的代码主要来自transformers库,做了部分删减
1. 计算linear矩阵乘
1234567891011121314151617181920212223242526272829303132self.config = config
self.h ...
sgl-router的smart router策略
为什么要有sgl-router
flowchart LR
subgraph Client["💻 Clients"]
C1["Client 3"]
C2["Client 1"]
C3["Client 2"]
end
C1 --> RT
C2 --> RT
C3 --> RT
subgraph Router["sglang-router"]
RT["🔀 Router"]
end
subgraph PCluster["Prefill Workers (P-nodes)"]
P1["P1"]
P2["P2"]
P3["P3"]
end
subgraph DCluster["Decode Workers (D-nodes)"]
D1["D1"]
D2["D2"]
D3 ...
Git Commit 合并操作指南
场景描述
当需要将当前staged的代码合并到历史中的某个特定commit(而不是最新的commit)时,标准的git commit --amend无法满足需求,因为它只能修改最新的commit。
操作步骤详解
1. 查看当前状态
12345# 查看当前staged的文件
git status
# 查看commit历史,确定目标commit
git log --oneline -5
输出示例:
1234534e0fcd add click link to avatar # 最新commit
f42ce7a change author description
af1ae21 add head href link # 目标commit
bdc4b45 fix hexo-blog-encrypt show giscus comments
fbb3532 add comment giscus
2. 回退到目标commit
首先需要回到包含目标commit的状态:
12# 回退到目标commit
gi ...













