6-AWQ量化模型推理

gogongxt2026-01-032026-01-06

NOTE

对一个推理框架来说，要推理一个量化后的模型，会需要知道几件事情:

怎么知道这是一个量化模型
怎么知道这个模型量化了哪些层
具体是怎么多态实现到量化上的
量化的算子计算实现是什么

后续的内容都是以llama2-7b-awq举例，其它模型或者量化方法本质上都是相似的

怎么知道这是一个量化模型

观察config.json可以看到里面有quantization的配置：

LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "dtype": "float16",
  "eos_token_id": 2,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 4096,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pad_token_id": 0,
  "pretraining_tp": 1,
  "quantization_config": {
    "bits": 4,
    "group_size": 128,
    "quant_method": "awq",
    "version": "gemm",
    "zero_point": true
  },
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "use_cache": true,
  "vocab_size": 32000
}

这样我们就知道了这是AWQ量化，每个权重从fp16/bf16(torch_btype)量化成了int4(bit)，每128(group_size)个值共享一对zero和scale

因此，对推理框架来说，我们只需要查看quantization_config字段就可以了解具体量化形式了：

with _set_default_torch_dtype(torch.float16):
    with torch.device("cuda"):
        hf_quant_config = getattr(
            self.model_config.hf_config, "quantization_config", None
        )
        # 当然这里暂时只支持awq，实际应该要支持更多做更多ifelse判断量化方法
        if hf_quant_config is not None:
            quant_config = AWQConfig.from_config(hf_quant_config)
            linear_method = quant_config.get_linear_method()
        model = model_class(
            config=self.model_config.hf_config, linear_method=linear_method
        )
    model.load_weights(
        self.model_config.path,
        cache_dir=None,
        load_format=self.load_format,
        revision=None,
    )
self.model = model

怎么知道这个模型量化了哪些层

一般来说，量化主要是为了量化矩阵乘的矩阵weights（其实文本大模型主要也都是矩阵计算）

一般和提取特征相关的矩阵计算是不量化的，例如不会量化embedding层，多模态的模型也不量化一些特征提取的矩阵计算

也会有一些会显示的config中表示不量化的层：

例如下面是qwen3-next fp8量化模型的config.json片段：

"quantization_config": {
  "activation_scheme": "dynamic",
  "fmt": "e4m3",
  "modules_to_not_convert": [
    "lm_head",
    "model.layers.0.input_layernorm",
    "model.layers.0.linear_attn.A_log",
    "model.layers.0.linear_attn.conv1d",
    "model.layers.0.linear_attn.dt_bias",
    "model.layers.0.linear_attn.in_proj_ba",
    "model.layers.0.linear_attn.norm",
    "model.layers.0.mlp.gate",
    "model.layers.0.mlp.shared_expert_gate",
    "model.layers.0.post_attention_layernorm",
    "model.layers.1.input_layernorm",
    "model.layers.1.linear_attn.A_log",
    ......
    ],
    "quant_method": "fp8",
    "weight_block_size": [
      128,
      128
    ]
}

具体是怎么多态实现到量化上的

如果你观察models/**底下的文件，例如llama.py qwen3.py等，看里面的算子，比如mlp，o_prpj会发现都有一个linear_method

核心关键点就是这个 linear_method ，这个参数从创建model的时候就会传递进来，然后层层传递到具体的实例例如o_proj

再去看上面的python代码，可以看到把创建出来的linear_method实例传递到了model_class创建model实例

实际上这个linear_method是一个多态实现的算子方法，不同量化方法实现也不同：

例如普通fp16/bf16权重和激活值相同类型计算，实现一般就是直接调用torch.lienar
例如对于AWQ量化方法sglang和vllm就是自己实现的kernel，实现了(int4权重+zero+scale)*fp16/bf16激活值的算子融合计算（否则要先反量化成fp16/bf16，再调用普通的torch.linear会更慢)
例如对于deepseek这种权重是fp8，激活每次算出来是fp16，就是用的deepgemm库高效进行fp8矩阵乘，流程是fp16先量化成fp8，然后和fp8权重进行矩阵乘得到fp16激活值，碰到下一个矩阵乘继续量化成fp8和fp8权重计算得到fp16

量化的具体算子计算实现是什么

权重加载

在这里以nanosglang中的awq量化的o_proj来说，分成权重加载和实际的矩阵计算

首先来对比一下相同llama2-7b模型普通权重和awq实际权重shape：

# 普通的
| model.layers.0.self_attn.o_proj.weight          | torch.Size([4096, 4096])  | torch.float16 | 32.00 MB  | model-00001-of-00002.safetensors |

# awq量化后的
| model.layers.0.self_attn.o_proj.qweight         | torch.Size([4096, 512])   | torch.int32   | 8.00 MB   | model.safetensors |
| model.layers.0.self_attn.o_proj.qzeros          | torch.Size([32, 512])     | torch.int32   | 0.03 MB   | model.safetensors |
| model.layers.0.self_attn.o_proj.scales          | torch.Size([32, 4096])    | torch.float16 | 0.25 MB   | model.safetensors |

由此可以知道几点：

一个矩阵权重从原来的名字weight改成了三个（注意qweight前面加了q）
qweight的从维度4096到512的int32是因为每个值从fp16量化到了int4，然后八个值组成了一个int32，所以是4096/8=512
qzeros和scales的32是因为每128个共享一组zero和scale，所以4096/128=32
然后qzero也是int4量化，8个组成一个int32，而scale是fp16，独自占一个

权重加载流程

模型启动
  │
  ├─> ModelRunner.__init__()
  │    └─> load_model()
  │         ├─> 检测 quantization_config → AWQConfig
  │         └─> 创建 AWQLinearMethod
  │              └─> 传递给所有层
  │
  ├─> LlamaForCausalLM.__init__(linear_method=AWQLinearMethod)
  │    └─> LlamaModel.__init__(linear_method=AWQLinearMethod)
  │         └─> LlamaDecoderLayer.__init__(linear_method=AWQLinearMethod)
  │              └─> LlamaAttention.__init__(linear_method=AWQLinearMethod)
  │                   └─> RowParallelLinear(o_proj, linear_method=AWQLinearMethod)
  │                        ├─> self.linear_method.create_weights()
  │                        │    └─> 返回 {qweight, qzeros, scales}  # 量化权重
  │                        └─> 注册参数，设置属性
  │
  ├─> model.load_weights()
  │    └─> 遍历safetensors文件
  │         └─> 为每个权重调用 param.weight_loader()
  │              └─> 根据 param.attributes 切分权重

权重初始化创建代码：

class AWQLinearMethod(LinearMethodBase):
    def create_weights(
        self, input_size: int, output_size: int, params_dtype: torch.dtype
    ) -> Dict[str, torch.Tensor]:
        pack_factor = 32 // self.weight_bits  # 4-bit → 8

        # 1. 创建qweight: [input_size, output_size // pack_factor]
        #    形状: [2048, 512] (tp=2, 7B模型)
        qweight = Parameter(
            torch.empty(input_size, output_size // pack_factor,
                       device="cuda", dtype=torch.int32),
            requires_grad=False,
        )
        # 关键：设置属性，用于后续权重加载
        set_weight_attrs(qweight, {
            "input_dim": 0,
            "output_dim": 1,
            "packed_dim": 1,        # 标记维度1是打包的
            "pack_factor": 8,       # 8个4-bit值打包成1个int32
        })

        # 2. 创建qzeros: [input_size // group_size, output_size // pack_factor]
        #    形状: [16, 512] (group_size=128, 2048/128=16)
        qzeros = Parameter(
            torch.empty(input_size // self.quant_config.group_size,
                       output_size // pack_factor,
                       device="cuda", dtype=torch.int32),
            requires_grad=False,
        )
        set_weight_attrs(qzeros, {
            "input_dim": 0,
            "output_dim": 1,
            "packed_dim": 1,
            "pack_factor": 8,
        })

        # 3. 创建scales: [input_size // group_size, output_size]
        #    形状: [16, 4096]
        scales = Parameter(
            torch.empty(input_size // self.quant_config.group_size,
                       output_size,
                       device="cuda", dtype=params_dtype),  # float16
            requires_grad=False,
        )
        set_weight_attrs(scales, {
            "input_dim": 0,
            "output_dim": 1,
        })

        return {"qweight": qweight, "qzeros": qzeros, "scales": scales}

而权重从load_weights赋值，由于当前的权重赋值并不需要有什么额外操作，就是默认的_copy就可以了，就用int32存储

对比总结：

特性	非量化模型	AWQ量化模型
权重数量	1个 (weight)	3个 (qweight, qzeros, scales)
权重类型	float16	int32 + int32 + float16
属性	input_dim, output_dim	+ packed_dim, pack_factor

多态矩阵乘算子调用

还是以o_proj的forward调用为例

class RowParallelLinear(torch.nn.Module):
    def forward(self, input_):
        # input_.shape: [batch_size, seq_len, hidden_size]
        #            或: [total_tokens, hidden_size]

        # 1. 处理输入的切分
        if self.input_is_parallel:
            input_parallel = input_
        else:
            tp_rank = get_tensor_model_parallel_rank()
            splitted_input = split_tensor_along_last_dim(
                input_, num_partitions=self.tp_size
            )
            input_parallel = splitted_input[tp_rank]

        # 2. 关键：调用linear_method.apply_weights()
        #    对于AWQ：self.linear_method = AWQLinearMethod
        #    对于非量化：self.linear_method = UnquantizedLinearMethod
        output_parallel = self.linear_method.apply_weights(
            self.linear_weights,  # {qweight, qzeros, scales}
            input_parallel,
            bias=None
        )

        # 3. All-Reduce（TP场景）
        if self.reduce_results and self.tp_size > 1:
            output_ = tensor_model_parallel_all_reduce(output_parallel)
        else:
            output_ = output_parallel

        return output_, output_bias

1. AWQLinearMethod.apply_weights:

class AWQLinearMethod(LinearMethodBase):
    def apply_weights(
        self,
        weights: Dict[str, torch.Tensor],  # {qweight, qzeros, scales}
        x: torch.Tensor,                   # [M, K]
        bias: Optional[torch.Tensor] = None,
    ) -> torch.Tensor:
        qweight = weights["qweight"]    # [K, N//8]
        qzeros = weights["qzeros"]      # [K//G, N//8]
        scales = weights["scales"]      # [K//G, N]
        pack_factor = self.quant_config.pack_factor  # 8

        # 计算输出形状
        out_shape = x.shape[:-1] + (qweight.shape[-1] * pack_factor,)  # [M, N]
        reshaped_x = x.reshape(-1, x.shape[-1])  # [M, K]

        # 关键：调用Triton内核进行量化GEMM
        # 内部会：解包4-bit → 应用zero/scale → 矩阵乘法
        out = awq_gemm_triton(
            reshaped_x,  # [M, K]
            qweight,     # [K, N//8]
            scales,      # [K//G, N]
            qzeros,      # [K//G, N//8]
            pack_factor  # 8
        )

        if bias is not None:
            out = out + bias

        return out.reshape(out_shape)  # [M, N]

2. 非量化模型的apply_weights:

class UnquantizedLinearMethod(LinearMethodBase):
    def apply_weights(
        self,
        weights: Dict[str, torch.Tensor],  # {weight}
        x: torch.Tensor,
        bias: Optional[torch.Tensor] = None,
    ) -> torch.Tensor:
        weight = weights["weight"]  # [output_size, input_size]

        # 直接使用PyTorch的F.linear
        if self.separate_bias_add:
            if bias:
                return F.linear(x, weight) + bias
            return F.linear(x, weight)
        return F.linear(x, weight, bias)

对比总结：

特性	非量化模型	AWQ量化模型
计算函数	F.linear()	awq_gemm_triton()
反量化	不需要	Triton内核内部即时反量化
内存访问	直接访问fp16权重	解包4-bit + 查表scales/zeros

总结设计模式

策略模式 (Strategy Pattern)

# 抽象接口
class LinearMethodBase:
    def create_weights(...) -> Dict[str, Tensor]: ...
    def apply_weights(...) -> Tensor: ...

# 具体策略
class UnquantizedLinearMethod(LinearMethodBase): ...
class AWQLinearMethod(LinearMethodBase): ...

# 使用方
class RowParallelLinear:
    def __init__(self, ..., linear_method: LinearMethodBase):
        self.linear_method = linear_method  # 运行时决定使用哪个策略

多态调用 (Polymorphism)

# 同样的调用代码，不同的实现
output = self.linear_method.apply_weights(weights, x, bias)

# 非量化: F.linear(x, weight)
# AWQ: awq_gemm_triton(x, qweight, scales, qzeros)