调试精度dump-tensor

AI-摘要
User GPT
AI初始化中...
介绍自己 🙈
生成本文简介 👋
推荐相关文章 📖
前往主页 🏠
前往爱发电购买
调试精度dump-tensor
gogongxt保存每一层 Hidden States 进行精度对比
目标
对比两套不同启动方法的每一层 hidden states 输出,找出精度差异的根源。
方案概述
使用 SGLang 现有的 tensor_dump_forward_hook
机制自动保存所有中间张量,然后使用 dump_comparator
工具进行对比。
关键文件
python/sglang/srt/debug_utils/tensor_dump_forward_hook.py- 自动保存机制python/sglang/srt/debug_utils/dump_comparator.py- 对比工具python/sglang/srt/server_args.py- 命令行参数定义
实现步骤
第一步:准备测试脚本
创建测试脚本 test_hidden_states.py:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import requests
import json
url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}
data = {
"model": "qwen3next",
"messages": [
{"role": "user", "content": "你好,请介绍一下你自己。"}
],
"max_tokens": 3,
"temperature": 0.0, # 使用确定性采样
"top_p": 1.0
}
response = requests.post(url, headers=headers, json=data, stream=False)
print(response.json())第二步:启动方法1(Baseline)
1
2
3
4
5
6
7
8
9
10
# 启动 SGLang 服务
python3 -m sglang.launch_server --model-path /data/models/Qwen3-Next-80B-A3B-Instruct \
--host 0.0.0.0 --port 8055 \
--trust-remote-code --log-requests --collect-tokens-histogram --enable-metrics --enable-cache-report \
--tensor-parallel-size 4 --mem-fraction-static 0.8 \
--chunked-prefill-size 8192 \
--tool-call-parser qwen \
--log-requests-level 0 \
--disable-cuda-graph \
--debug-tensor-dump-output-folder ~/sglang_baseline重要参数说明:
--debug-tensor-dump-output-folder: 指定dump输出目录--disable-cuda-graph: 必须禁用CUDA graph,否则tensor dump不工作--skip-server-warmup: 跳过warmup避免不必要的forward pass(其实也不用)- (可选)
--debug-tensor-dump-layers 0 1 2 3 ...: 只dump特定层,减少数据量
在另一个终端发送测试请求:
1
python test_hidden_states.py可以看到日志:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
[2026-01-23 12:32:34 TP0] Dump 00000th pass to /root/sglang/TP0_PP0_Rank0_pid1309268/Pass00000.pt
[2026-01-23 12:32:34 TP3] Dump 00000th pass to /root/sglang/TP3_PP0_Rank3_pid1309271/Pass00000.pt
[2026-01-23 12:32:34 TP2] Dump 00000th pass to /root/sglang/TP2_PP0_Rank2_pid1309270/Pass00000.pt
[2026-01-23 12:32:34 TP1] Dump 00000th pass to /root/sglang/TP1_PP0_Rank1_pid1309269/Pass00000.pt
[2026-01-23 12:32:37 TP0] Prefill batch, #new-seq: 1, #new-token: 2048, #cached-token: 0, full token usage: 0.00, mamba usage: 0.00, #running-req: 0, #queue-req: 0,
[2026-01-23 12:32:40 TP0] Dump 00001th pass to /root/sglang/TP0_PP0_Rank0_pid1309268/Pass00001.pt
[2026-01-23 12:32:40 TP2] Dump 00001th pass to /root/sglang/TP2_PP0_Rank2_pid1309270/Pass00001.pt
[2026-01-23 12:32:40 TP3] Dump 00001th pass to /root/sglang/TP3_PP0_Rank3_pid1309271/Pass00001.pt
[2026-01-23 12:32:40 TP1] Dump 00001th pass to /root/sglang/TP1_PP0_Rank1_pid1309269/Pass00001.pt
[2026-01-23 12:32:43 TP0] Prefill batch, #new-seq: 1, #new-token: 2048, #cached-token: 0, full token usage: 0.00, mamba usage: 0.00, #running-req: 0, #queue-req: 0,
[2026-01-23 12:32:45 TP2] Dump 00002th pass to /root/sglang/TP2_PP0_Rank2_pid1309270/Pass00002.pt
[2026-01-23 12:32:45 TP1] Dump 00002th pass to /root/sglang/TP1_PP0_Rank1_pid1309269/Pass00002.pt
[2026-01-23 12:32:45 TP3] Dump 00002th pass to /root/sglang/TP3_PP0_Rank3_pid1309271/Pass00002.pt
[2026-01-23 12:32:45 TP0] Dump 00002th pass to /root/sglang/TP0_PP0_Rank0_pid1309268/Pass00002.pt
[2026-01-23 12:32:48 TP0] Prefill batch, #new-seq: 1, #new-token: 2048, #cached-token: 0, full token usage: 0.01, mamba usage: 0.00, #running-req: 0, #queue-req: 0,
[2026-01-23 12:32:50 TP2] Dump 00003th pass to /root/sglang/TP2_PP0_Rank2_pid1309270/Pass00003.pt
[2026-01-23 12:32:50 TP1] Dump 00003th pass to /root/sglang/TP1_PP0_Rank1_pid1309269/Pass00003.pt
[2026-01-23 12:32:50 TP3] Dump 00003th pass to /root/sglang/TP3_PP0_Rank3_pid1309271/Pass00003.pt
[2026-01-23 12:32:50 TP0] Dump 00003th pass to /root/sglang/TP0_PP0_Rank0_pid1309268/Pass00003.pt
[2026-01-23 12:32:53 TP0] Prefill batch, #new-seq: 1, #new-token: 126, #cached-token: 0, full token usage: 0.01, mamba usage: 0.00, #running-req: 0, #queue-req: 0,
[2026-01-23 12:32:54 TP0] Dump 00004th pass to /root/sglang/TP0_PP0_Rank0_pid1309268/Pass00004.pt
[2026-01-23 12:32:54 TP2] Dump 00004th pass to /root/sglang/TP2_PP0_Rank2_pid1309270/Pass00004.pt
[2026-01-23 12:32:54 TP1] Dump 00004th pass to /root/sglang/TP1_PP0_Rank1_pid1309269/Pass00004.pt
[2026-01-23 12:32:54 TP3] Dump 00004th pass to /root/sglang/TP3_PP0_Rank3_pid1309271/Pass00004.pt
[2026-01-23 12:32:54 TP2] Dump 00005th pass to /root/sglang/TP2_PP0_Rank2_pid1309270/Pass00005.pt
[2026-01-23 12:32:54 TP3] Dump 00005th pass to /root/sglang/TP3_PP0_Rank3_pid1309271/Pass00005.pt
[2026-01-23 12:32:54 TP1] Dump 00005th pass to /root/sglang/TP1_PP0_Rank1_pid1309269/Pass00005.pt
[2026-01-23 12:32:54 TP0] Dump 00005th pass to /root/sglang/TP0_PP0_Rank0_pid1309268/Pass00005.pt
[2026-01-23 12:32:55 TP2] Dump 00006th pass to /root/sglang/TP2_PP0_Rank2_pid1309270/Pass00006.pt
[2026-01-23 12:32:55 TP1] Dump 00006th pass to /root/sglang/TP1_PP0_Rank1_pid1309269/Pass00006.pt
[2026-01-23 12:32:55 TP3] Dump 00006th pass to /root/sglang/TP3_PP0_Rank3_pid1309271/Pass00006.pt
[2026-01-23 12:32:55 TP0] Dump 00006th pass to /root/sglang/TP0_PP0_Rank0_pid1309268/Pass00006.pt
[2026-01-23 12:32:55 TP2] Dump 00007th pass to /root/sglang/TP2_PP0_Rank2_pid1309270/Pass00007.pt
[2026-01-23 12:32:55 TP1] Dump 00007th pass to /root/sglang/TP1_PP0_Rank1_pid1309269/Pass00007.pt
[2026-01-23 12:32:55 TP3] Dump 00007th pass to /root/sglang/TP3_PP0_Rank3_pid1309271/Pass00007.pt
[2026-01-23 12:32:55 TP0] Dump 00007th pass to /root/sglang/TP0_PP0_Rank0_pid1309268/Pass00007.pt
[2026-01-23 12:32:55 TP2] Dump 00008th pass to /root/sglang/TP2_PP0_Rank2_pid1309270/Pass00008.pt
[2026-01-23 12:32:55 TP3] Dump 00008th pass to /root/sglang/TP3_PP0_Rank3_pid1309271/Pass00008.pt
[2026-01-23 12:32:55 TP1] Dump 00008th pass to /root/sglang/TP1_PP0_Rank1_pid1309269/Pass00008.pt
[2026-01-23 12:32:55 TP0] Dump 00008th pass to /root/sglang/TP0_PP0_Rank0_pid1309268/Pass00008.pt
[2026-01-23 12:32:55 TP2] Dump 00009th pass to /root/sglang/TP2_PP0_Rank2_pid1309270/Pass00009.pt
[2026-01-23 12:32:55 TP3] Dump 00009th pass to /root/sglang/TP3_PP0_Rank3_pid1309271/Pass00009.pt
[2026-01-23 12:32:55 TP1] Dump 00009th pass to /root/sglang/TP1_PP0_Rank1_pid1309269/Pass00009.pt
[2026-01-23 12:32:55 TP0] Dump 00009th pass to /root/sglang/TP0_PP0_Rank0_pid1309268/Pass00009.pt
[2026-01-23 12:32:55 TP2] Dump 00010th pass to /root/sglang/TP2_PP0_Rank2_pid1309270/Pass00010.pt
[2026-01-23 12:32:55 TP1] Dump 00010th pass to /root/sglang/TP1_PP0_Rank1_pid1309269/Pass00010.pt
[2026-01-23 12:32:55 TP3] Dump 00010th pass to /root/sglang/TP3_PP0_Rank3_pid1309271/Pass00010.pt
[2026-01-23 12:32:55 TP0] Dump 00010th pass to /root/sglang/TP0_PP0_Rank0_pid1309268/Pass00010.pt
[2026-01-23 12:32:55 TP1] Dump 00011th pass to /root/sglang/TP1_PP0_Rank1_pid1309269/Pass00011.pt
[2026-01-23 12:32:55 TP2] Dump 00011th pass to /root/sglang/TP2_PP0_Rank2_pid1309270/Pass00011.pt
[2026-01-23 12:32:55 TP3] Dump 00011th pass to /root/sglang/TP3_PP0_Rank3_pid1309271/Pass00011.pt
[2026-01-23 12:32:55 TP0] Dump 00011th pass to /root/sglang/TP0_PP0_Rank0_pid1309268/Pass00011.pt
[2026-01-23 12:32:55 TP2] Dump 00012th pass to /root/sglang/TP2_PP0_Rank2_pid1309270/Pass00012.pt
[2026-01-23 12:32:55 TP1] Dump 00012th pass to /root/sglang/TP1_PP0_Rank1_pid1309269/Pass00012.pt
[2026-01-23 12:32:55 TP3] Dump 00012th pass to /root/sglang/TP3_PP0_Rank3_pid1309271/Pass00012.pt
[2026-01-23 12:32:55 TP0] Dump 00012th pass to /root/sglang/TP0_PP0_Rank0_pid1309268/Pass00012.pt
[2026-01-23 12:32:56 TP2] Dump 00013th pass to /root/sglang/TP2_PP0_Rank2_pid1309270/Pass00013.pt
[2026-01-23 12:32:56 TP1] Dump 00013th pass to /root/sglang/TP1_PP0_Rank1_pid1309269/Pass00013.pt
[2026-01-23 12:32:56 TP3] Dump 00013th pass to /root/sglang/TP3_PP0_Rank3_pid1309271/Pass00013.pt
[2026-01-23 12:32:56 TP0] Dump 00013th pass to /root/sglang/TP0_PP0_Rank0_pid1309268/Pass00013.pt
[2026-01-23 12:32:56] Finish: obj=GenerateReqInput(validation_time=4.4018030166625977e-05, received_time=1769142750.2554507, received_time_perf=17874627.972324956, rid='00793417c23c4b369a80c20ff01f61e9', http_worker_ipc=None, video_data=None, return_logprob=False, logprob_start_len=-1, top_logprobs_num=0, token_ids_logprob=None, return_text_in_logprobs=True, stream=False, log_metrics=True, return_hidden_states=False, modalities=[], session_params=None, lora_id=None, custom_logit_processor=None, bootstrap_host=None, bootstrap_port=None, bootstrap_room=None, bootstrap_pair_key=None, decode_tp_size=None, reasoning=False, data_parallel_rank=None, background=False, conversation_id=None, priority=None, extra_key=None, no_logs=False, custom_labels=None, return_bytes=False, return_entropy=False), out={'meta_info': {'id': '00793417c23c4b369a80c20ff01f61e9', 'finish_reason': {'type': 'length', 'length': 10}, 'prompt_tokens': 8318, 'weight_version': 'default', 'total_retractions': 0, 'queue_time': 0.0021848902106285095, 'prefill_launch_delay': 23.25320667028427, 'prefill_launch_latency': 1.1308907568454742, 'completion_tokens': 10, 'cached_tokens': 0, 'e2e_latency': 25.879807472229004, 'request_received_ts': 1769142750.2554507, 'request_sent_to_scheduler_ts': 1769142750.2834263, 'decode_finished_ts': 1769142776.1352582, 'inference_time': 25.848451234400272, 'response_sent_to_client_ts': 1769142776.139939}}可以看出,总共是4个TP的日志,然后总共是14次pt输出:
- 前5次是prefill,因为chunked-prefill-size是2048,我的请求prompt_tokens是8318=4*2048+126
- 后面9次是decode,因为请求的max_tokens设置的是10
如果后续继续发请求,同样会累计计数,就是从14开始往后生成
我们也可以去查看目录:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
$ tree ~/sglang_baseline
/root/sglang
├── TP0_PP0_Rank0_pid1309268
│ ├── Pass00000.pt
│ ├── Pass00001.pt
│ ├── Pass00002.pt
│ ├── Pass00003.pt
│ ├── Pass00004.pt
│ ├── Pass00005.pt
│ ├── Pass00006.pt
│ ├── Pass00007.pt
│ ├── Pass00008.pt
│ ├── Pass00009.pt
│ ├── Pass00010.pt
│ ├── Pass00011.pt
│ ├── Pass00012.pt
│ └── Pass00013.pt
├── TP1_PP0_Rank1_pid1309269
│ ├── Pass00000.pt
│ ├── Pass00001.pt
│ ├── Pass00002.pt
│ ├── Pass00003.pt
│ ├── Pass00004.pt
│ ├── Pass00005.pt
│ ├── Pass00006.pt
│ ├── Pass00007.pt
│ ├── Pass00008.pt
│ ├── Pass00009.pt
│ ├── Pass00010.pt
│ ├── Pass00011.pt
│ ├── Pass00012.pt
│ └── Pass00013.pt
├── TP2_PP0_Rank2_pid1309270
│ ├── Pass00000.pt
│ ├── Pass00001.pt
│ ├── Pass00002.pt
│ ├── Pass00003.pt
│ ├── Pass00004.pt
│ ├── Pass00005.pt
│ ├── Pass00006.pt
│ ├── Pass00007.pt
│ ├── Pass00008.pt
│ ├── Pass00009.pt
│ ├── Pass00010.pt
│ ├── Pass00011.pt
│ ├── Pass00012.pt
│ └── Pass00013.pt
└── TP3_PP0_Rank3_pid1309271
├── Pass00000.pt
├── Pass00001.pt
├── Pass00002.pt
├── Pass00003.pt
├── Pass00004.pt
├── Pass00005.pt
├── Pass00006.pt
├── Pass00007.pt
├── Pass00008.pt
├── Pass00009.pt
├── Pass00010.pt
├── Pass00011.pt
├── Pass00012.pt
└── Pass00013.pt
5 directories, 56 files第三步:启动对比方法2
修改启动方法,和上面的启动和请求相同,但是要换一个pt输出目录
第四步:对比两套输出
使用 dump_comparator 工具对比:
1
2
3
4
5
6
python3 -m sglang.srt.debug_utils.dump_comparator \
--baseline-path ~/sglang_baseline \
--target-path ~/sglang_target \
--start-id 0 \
--end-id 3 \
--diff-threshold 1e-3输出说明:
- 会对比每个Pass(forward pass)的所有中间张量
- 显示shape、dtype、统计信息(mean, std, min, max)
- 显示相对差异和绝对差异
- 使用 ✅/❌ 标记差异是否超过阈值
第五步:解读对比结果
dump_comparator 会输出类似:
1
2
3
4
5
6
7
8
Check: target=.../Pass00002.pt baseline=.../Pass00002.pt
Raw [shape] torch.Size([1, 3, 4096]) vs torch.Size([1, 3, 4096]) [✅dtype] torch.float32 vs torch.float32
After preprocessor [shape] torch.Size([1, 3, 4096]) vs torch.Size([1, 3, 4096]) [dtype] torch.float32 vs torch.float32
[mean] 0.0123 vs 0.0125 (diff: 0.0002)
[std] 0.4567 vs 0.4571 (diff: 0.0004)
[min] -2.3456 vs -2.3458 (diff: -0.0002)
[max] 3.4567 vs 3.4569 (diff: 0.0002)
✅ rel_diff=0.0001 max_abs_diff=0.0005 mean_abs_diff=0.0002关键信息:
- Pass编号: 对应第几个forward pass(0=prefill, 1=第1个token decode, 2=第2个token decode…)
- 张量名称: 类似
model.layers.5表示第5层的输出 - 差异指标: rel_diff(相对差异)、max_abs_diff(最大绝对差异)
第六步:定位问题层
如果发现某个Pass的某个层有差异,可以手动加载该张量详细查看:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
import torch
# 加载特定层的张量
baseline = torch.load("~/sglang_baseline/TP0_PP0_Rank0_pid<pid>/Pass00002.pt")
target = torch.load("~/sglang_baseline/TP0_PP0_Rank0_pid<pid>/Pass00002.pt")
print(baseline.keys()) # 可以看到具体的keys,例如:dict_keys(['model.embed_tokens', 'model.layers.0.input_layernorm', 'model.layers.0.linear_attn.in_proj_qkvz', ...
# 查看每一层的输出
for key in sorted(baseline.keys()):
if "layers" in key:
diff = (target[key] - baseline[key]).abs()
print(f"{key}: max_diff={diff.max().item():.6f}, mean_diff={diff.mean().item():.6f}")可选优化
1. 只保存特定层
如果只想关注某些层(如前10层):
1
2
3
4
5
python -m sglang serve \
--model-path /path/to/qwen3next \
--debug-tensor-dump-output-folder ~/sglang \
--debug-tensor-dump-layers 0 1 2 3 4 5 6 7 8 9 10 \
--disable-cuda-graph2. 查看保存的张量列表
1
2
3
4
import torch
data = torch.load("~/sglang/TP0_PP0_Rank0_pid<pid>/Pass00002.pt")
print("\n".join(sorted(data.keys()))) 评论
匿名评论隐私政策




