无论是基于提示词还是基于api的function
call,本质上都是token的处理
基于提示词的function call
是把结构化的输出要求放到system prompts中,再对回复做function
call的正则匹配
基于api的function call
则是交给推理框架处理,请求时带上对应的tools字段,推理框架会把tools的内容做tokenizer和prompts放到一起,总之输入肯定也就是tokens,对输出则是推理框架去通过正则匹配function
call的结构化输出,匹配上了就认为是function
call的调用,返回响应finish_reason对应为function_call,如果没匹配上,就认为是普通文本输出
NOTE
下面将以sglang+qwen+非流式讲一讲sglang是怎么处理parse
function call的请求和返回的
首先是sglang启动qwen模型有加上
--tool-ca ...
NOTE
本文参考自 https://oigi8odzc5w.feishu.cn/wiki/LWqEwXNkBibT0ykrbI0cvptBnAf
基于API的Function Call大模型调用示例代码
代码示例:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014 ...
使用docker部署easyimage图床
使用docker-compose部署
1234567891011121314151617version: '3.3'
services:
easyimage:
image: docker.1ms.run/ddsderek/easyimage:latest
container_name: easyimage
ports:
- '61021:80'
environment:
- TZ=Asia/Shanghai
- PUID=1000
- PGID=1000
- DEBUG=false
volumes:
- './data/config:/app/web/config'
- './data/i:/app/web/i'
restart: unless-stopped
typora通过picgo-core使用easyimage图床
配置picgo-core
安装和配置picgo-core:
12npm ...
TokenizerManager
流式响应架构详解
总体概览:
sequenceDiagram
participant User as 用户
participant API as FastAPI Server
participant TM as TokenizerManager
participant Router as Router进程
participant Model as Model RPC
participant Detok as Detokenizer
User->>API: POST /generate (GenerateReqInput)
API->>API: obj.post_init()
API->>TM: generate_request(obj)
TM->>TM: 第一次请求创建handle_loop
TM->>TM: tokenizer.encode(text)
TM->>TM: ...
mlp计算流程
先来看一下上图经典mlp的计算:
gate和up的proj,可以cat起来一起算
gate后有一个silu激活,激活后的值和up后的进行点乘,这两个操作也是一起做的
点乘结果给到down_proj就是最后的输出
对于非moe的mlp计算,qwen2和qwen3都一样的用的类Qwen2MLP
核心计算MergedColumnParallelLinear和RowParallelLinear就是使用torch.linear的计算,如果是tp,就是直接进行矩阵分块
12345678910111213141516171819202122232425262728293031323334353637383940414243class Qwen2MLP(nn.Module):
def __init__(
self,
hidden_size: int,
intermediate_size: int,
hidden_act: str,
quant_config: Optional[Qu ...
SGLang源码解析封面
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196 ...
143a5a50a47c0b7d448749907319f4bcf2a14f711491a3ab6bc352526485536289a132d6d21be13a9d4d1189bad89330e00b9d7e0439b7a06816010b2ae3475809e63a0f1abb6079e7ba6a83bd2eaf14c791e47c16290be32a2b7f52d985c3aa618c648d2b95b1a0a3c40532b6415e6152b3eb8d1d77f9887da8eb22dda4093d5bddef2b2ee118753f680cfdb5fc2661cfc6888ea89d3d12bec0b3bf9d7a844e8ec4281d8e8c792aa408caf429a7a9b0f07c2b90f4af87c31f5e572d8cd556065d0083768560e27703b04dcc3ee1029ea0bf98c458cc854c6e5655818d622dbf2a227e129a410a7381754f6687d93ff1f1c1f89aa277a77f7 ...
进程结构
tp=1的情况 总共三个进程
启动命令:python3 -m sglang.launch_server --model-path /tmp-data/models/llama-2-7b --port 30000 --mem-fraction-static 0.8 --tp 1
查看进程树:ps -aux —-forest
123luban 3049112 21.6 0.0 7700748 758940 pts/6 Sl+ 11:48 0:09 | \_ python3 -m sglang.launch_server --model-path /tmp-data/models/llama-2-7b --port 30000 --mem-fraction-static 0.8 --tp 1
luban 3052085 32.5 0.0 56482984 769368 pts/6 Sl+ 11:48 0:06 | \_ python3 -m sglang.launch_server --model-path /tm ...
143a5a50a47c0b7d448749907319f4bcf2a14f711491a3ab6bc352526485536289a132d6d21be13a9d4d1189bad89330e00b9d7e0439b7a06816010b2ae3475809e63a0f1abb6079e7ba6a83bd2eaf14c791e47c16290be32a2b7f52d985c3aa618c648d2b95b1a0a3c40532b6415e6152b3eb8d1d77f9887da8eb22dda4093d5bddef2b2ee118753f680cfdb5fc26619293562d86d1fb84b64e14989313f1d3a1ced04969015c0642c6e7d68dbda061d2753222f76fd3b247f9d27c3bb8e33770b085376b3ffa8cc96cceca755ec5cb36273302b86e6821bce3ab9c78465fd77d98e480a260b29d6161baa952b22c2ab1e92d2e2bbdae213 ...
qwen3-next结构
模型共计 48 层,被划分为 12 组(每组 4 层)。
前 3 层 采用 Gated DeltaNet Linear
Attention,能够显著提升计算效率并降低显存占用。
第 4 层 为传统的 Full Self Softmax
Attention,在输出阶段额外加入了一道门控。
Gated DeltaNet 分析
关于Gated DeltaNet计算可以看这张图详细一点:
计算顺序:
五个linear矩阵乘
计算一维卷积
计算SiLU激活函数
计算Gated Delta Rule
Gated Delta Rule输入进行Zero-Centered RMSNrom
RMSNorm输出和的结果进行点乘
下面的代码主要来自transformers库,做了部分删减
1. 计算linear矩阵乘
1234567891011121314151617181920212223242526272829303132self.config = config
self.h ...







