Sglang 代码变更总结 (UTC+8 2026-04-04)

本文总结了 Sglang 项目在 2026年4月4日（UTC+8 0时到24时） main 分支的所有 commit 变更，共计 38 个 commit。

总体概览

分类	Commit 数量	关键变更
新模型/模型增强	4	LFM2-VL 视觉语言模型、Reasoning Tokens Usage、Score API、GLM-4.7 加载格式
性能优化/Kernel	8	LoRA CUDA Graph、FA4 Speculative Decoding、VLM Chunk-aware ViT、NVFP4 CUTLASS 默认、DSV3 router GEMM 基准、norm fusion、flashinfer 0.6.7.post2、kernel 0.4.1
Bug Fix	6	killall_sglang、spec decoding flaky test、mistral embedding 回归、XGrammarBackend reset、DP attention IPv6、step3.5-flash 崩溃
server_args 新增参数	1	`--stream-response-default-include-usage`
新增环境变量	0	无
Diffusion	6	LTX-2 两阶段流水线、Ring Attention 验证、NVFP4 形状基准、gated repo 修复、CI 改进、z-image norm fusion
CI/Workflow/测试	6	auto benchmark 工具、diffusion 预设对齐、rerun-test CPU stage、PD fixture 提取、Python 3.11 lint、MOE UT 修复
Revert/清理	3	Revert JIT activation、Revert NVFP4 Marlin 回退、Revert TRTLLM skip
其他	5	RL mxfp8 DeepSeek V3、dp profile hook、pause_generation 修复、FA3/FA4 lazy import、HiSparse 参数检查

一、新模型与模型增强

1.1 LFM2-VL 视觉语言模型

新增 Liquid Foundation Model 2 Vision-Language 模型支持，包含全新的 lfm2_vl.py 模型文件、siglip2.py 视觉编码器、配置文件和多模态处理器，共计 +1149 行。

Commit Message	总结	PR 链接
`model: support LFM2-VL (Liquid Foundation Model 2 Vision-Language) (#21230)`	新增 LFM2-VL 视觉语言模型，包含模型文件、siglip2 编码器、配置和多模态处理器	PR #21230

1.2 Reasoning Tokens Usage

新增 reasoning tokens 用量统计功能，在 OpenAI 兼容 API 的 response 中返回 reasoning token 数量。

Commit Message	总结	PR 链接
`[Feature] Add Reasoning Tokens Usage (#15562)`	在 OpenAI API 的 streaming/chat/completion 响应中新增 reasoning tokens 用量统计	PR #15562

1.3 Score API

实现了评分功能的 EngineScoreMixin，将 scoring 逻辑从 TokenizerManager 中解耦重构。

Commit Message	总结	PR 链接
`[Score API] Implement EngineScoreMixin for scoring functionality and refactor TokenizerManager (#21342)`	实现 EngineScoreMixin 用于评分功能，重构 TokenizerManager 的 scoring 逻辑	PR #21342

1.4 GLM-4.7 加载格式

Commit Message	总结	PR 链接
`GLM-4.7 and GLM-4.7-Flash Loading and import format (#21851)`	规范 GLM-4.7 和 GLM-4.7-Flash 的模型加载和导入格式	PR #21851

二、性能优化与 Kernel

2.1 LoRA CUDA Graph 支持

为 LoRA 增加 CUDA Graph 支持，覆盖 MoE LoRA runner、Triton kernel、memory pool 等组件，显著提升 LoRA 推理吞吐。

Commit Message	总结	PR 链接
`[5/n] Lora support cuda graph (#21647)`	为 LoRA 增加 CUDA Graph 支持，涵盖 MoE LoRA runner 和多层级 kernel，提升推理性能	PR #21647

2.2 FA4 Speculative Decoding

Commit Message	总结	PR 链接
`[Speculative Decoding] Add FA4-based Spec Support (#21080)`	基于 FlashAttention 4 实现 speculative decoding 支持	PR #21080

2.3 VLM Chunk-aware ViT 编码

Commit Message	总结	PR 链接
`[VLM] Chunk-aware ViT encoding with per-image cache and lazy device transfer (#22038)`	实现分块感知的 ViT 编码，支持逐图像缓存和延迟设备传输，减少内存峰值	PR #22038

2.4 NVFP4 CUTLASS 默认

Commit Message	总结	PR 链接
`[diffusion] Default NVFP4 to CUTLASS and add all-model shape benchmarks (#22091)`	将 NVFP4 默认后端切换为 CUTLASS，并增加全模型形状基准测试	PR #22091

2.5 DSV3 Router GEMM 基准测试

Commit Message	总结	PR 链接
`Add dsv3 router gemm benchmark on blackwell (#17707)`	在 Blackwell GPU 上添加 DeepSeek V3 router GEMM 性能基准测试	PR #17707

2.6 Diffusion norm fusion

Commit Message	总结	PR 链接
`[diffusion] improve: norm fusion for z-image (#18762)`	为 z-image 实现 norm fusion 优化	PR #18762

2.7 依赖版本升级

Commit Message	总结	PR 链接
`chore: bump flashinfer version to 0.6.7.post2 (#22097)`	将 flashinfer 版本升级到 0.6.7.post2	PR #22097
`chore: bump sglang-kernel version to 0.4.1 (#22009)`	将 sglang-kernel 版本升级到 0.4.1	PR #22009

三、Bug Fix

Commit Message	总结	PR 链接
`Fix killall_sglang missing the main sglang serve process (#22103)`	修复 killall_sglang 遗漏主 sglang serve 进程的问题	PR #22103
`Relax spec decoding accuracy threshold to fix flaky test (#22100)`	放宽 speculative decoding 精度阈值以修复 flaky 测试	PR #22100
`fix: mistral embedding regression fix (#21913)`	修复 Mistral embedding 的性能回归	PR #21913
`[Fix] XGrammarGrammarBackend reset to clear inherited cache (#22054)`	修复 XGrammarGrammarBackend 未清理继承缓存的问题	PR #22054
`Fix DP attention worker port binding for IPv6 support (#21917)`	修复 DP attention worker 的 IPv6 端口绑定问题	PR #21917
`Tiny fix step3.5-flash launch crash (#22076)`	修复 step3.5-flash 模型启动崩溃	PR #22076

四、server_args.py 新增参数

本次时间窗口内新增 1 个 命令行参数：

Commit Message	参数名	说明	PR 链接
`Add --stream-response-default-include-usage server flag (#16711)`	`--stream-response-default-include-usage` (bool, 默认 False)	即使未指定 stream_options，也在每个 streaming 响应中包含 usage 信息	PR #16711

此外，server_args.py 还有以下非参数变更：

Commit Message	变更内容	PR 链接
`[HiSparse]: Optimize server args checking-HiSparse is temporarily only available for DSA models. (#22065)`	增加 `enable_hisparse` 参数校验，限制仅 DSA 模型（DeepSeek V3.2、GLM-5）可用	PR #22065
`[Bugfix] Temporarily skip TRTLLM attention on (G)B300 (SM103) to avoid high-concurrency hang (#21906)` → 随后 Revert	先合入后回退：临时跳过 (G)B300 上的 TRTLLM attention 以避免高并发挂起	PR #21906 / PR #22098

五、新增环境变量

本次时间窗口内没有新增环境变量。

六、Diffusion

Commit Message	总结	PR 链接
`[diffusion] model: support two stage pipeline of LTX-2 (#20707)`	支持 LTX-2 的两阶段流水线推理	PR #20707
`[diffusion] fix: validate attention backend for Ring Attention in USPAttention (#21828)`	验证 USPAttention 中 Ring Attention 的 attention backend 配置	PR #21828
`[Diffusion] Fix weight scale swizzle and add large-M kernel config for FLUX.2-dev-NVFP4 (#22064)`	修复 FLUX.2-dev-NVFP4 的 weight scale swizzle 并增加 large-M kernel 配置	PR #22064
`[diffusion] fix: fix gated repo failing the generate cmd (#22040)`	修复 gated repo 导致 generate 命令失败的问题	PR #22040
`[diffusion] CI: improve diffusion comparison benchmark setting for realistic perf and auto-discover ut (#22086)`	改进 diffusion 对比基准测试设置，支持自动发现 UT	PR #22086
`Align diffusion nightly presets and broaden skill discovery (#22099)`	对齐 diffusion 夜间预设配置并扩展 skill 发现机制	PR #22099

七、CI / Workflow / 测试

Commit Message	总结	PR 链接
`[Benchmark] Add auto benchmark tool with YAML-driven server flag search and canonical dataset format (#21736)`	新增自动基准测试工具，支持 YAML 驱动的 server flag 搜索和规范化的数据集格式	PR #21736
`[CI] Support CPU stage and auto-batch same-stage files in /rerun-test (#22081)`	支持 CPU stage 和自动分批同 stage 测试文件的 /rerun-test 功能	PR #22081
`[Test] Extract common PD server setup into base fixture (#22080)`	将公共 PD server 配置提取为基础 fixture	PR #22080
`Fix Python 3.11 f-string lint error in deepgemm Blackwell benchmark (#22108)`	修复 deepgemm Blackwell 基准测试中的 Python 3.11 f-string lint 错误	PR #22108
`fix ut test_moe (#21735)`	修复 MOE 单元测试	PR #21735

八、其他变更

Commit Message	总结	PR 链接
`[RL] Support mxfp8 DeepSeek V3 (#21280)`	RL 场景下支持 mxfp8 量化的 DeepSeek V3 模型	PR #21280
`dp: add profile req hook (#22083)`	增加 DP profile 请求 hook	PR #22083
`fix: pause_generation should not populate running_batch on prefill nodes (#20273)`	修复 pause_generation 在 prefill 节点上不应填充 running_batch 的问题	PR #20273
`[Kernel] Make FA3/FA4 imports lazy in FlashAttentionBackend (#22028)`	将 FlashAttentionBackend 中的 FA3/FA4 导入改为懒加载，减少启动时间	PR #22028
`Revert "[Feature] JIT activation and update skills (by codex)" (#22078)`	回退前一天的 JIT activation 功能（兼容性问题）	PR #22078
`Revert "[Feature] NVFP4 Marlin fallback for non-Blackwell GPUs (SM75+)…" (#22047)`	回退 NVFP4 Marlin 回退功能	PR #22047

重点关注总结

新增模型

LFM2-VL: Liquid Foundation Model 2 视觉语言模型，含 siglip2 视觉编码器
GLM-4.7 / GLM-4.7-Flash: 加载格式规范化

新功能

Reasoning Tokens Usage: OpenAI API 响应中返回 reasoning token 用量
Score API: EngineScoreMixin 评分功能，从 TokenizerManager 解耦
Auto Benchmark Tool: YAML 驱动的自动基准测试工具

性能优化

LoRA CUDA Graph: 为 LoRA 增加 CUDA Graph 支持，显著提升吞吐
FA4 Speculative Decoding: 基于 FlashAttention 4 的 speculative decoding
VLM Chunk-aware ViT: 分块感知 ViT 编码，逐图像缓存 + 延迟传输，减少内存峰值
NVFP4 CUTLASS 默认: diffusion 场景下 NVFP4 默认使用 CUTLASS 后端
DSV3 Router GEMM 基准: Blackwell 上的 router GEMM 性能测试
FA3/FA4 懒加载: 减少启动时间
flashinfer 0.6.7.post2 + sglang-kernel 0.4.1: 依赖版本升级

server_args.py 新增参数

--stream-response-default-include-usage: 默认在 streaming 响应中包含 usage

新增环境变量

Bug Fix

killall_sglang 遗漏主进程、spec decoding flaky test、Mistral embedding 回归、XGrammarBackend 缓存、DP attention IPv6 绑定、step3.5-flash 启动崩溃