Sglang 代码变更总结 (UTC+8 2026-04-05)

本文总结了 Sglang 项目在 2026年4月5日（UTC+8 0时到24时） main 分支的所有 commit 变更，共计 23 个 commit。

总体概览

分类	Commit 数量	关键变更
新模型	1	Voxtral 语音转文本模型
性能优化/特性	4	DeepSeek V3.2 IndexCache、AMD MLA FP8(Kimi K2.5)、Flux 2 精度修复、diffusion float64 平台支持
Bug Fix	3	Hi-MambaRadixTree 不变量违例、PD staging warmup、f-string 缺失前缀
server_args.py 新增参数	0	无
新增环境变量	0	无
CI/Workflow/工具	5	CI auto-bisect 工作流、failfast 标志、夜间测试修复、reasoning 测试整合、auto benchmark 暂停
Speculative Decoding	2	SpecV2 qwen3 精度测试重开、Spec V1 路径隔离
Diffusion	2	Flux 系列精度修复、is_float64_supported 平台支持
重构/清理	4	think_end_id 统一、reasoning 测试整合、dump_metric 评估路径、flaky 测试移除
文档	1	GLM-5 文档更新

一、新模型

1.1 Voxtral 语音转文本模型

新增 Voxtral speech-to-text 模型支持，包含完整的模型文件、多模态处理器和 transformers 工具函数增强，共计 +777 行。

Commit Message	总结	PR 链接
`[model] support voxtral (speech-to-text) (#21635)`	新增 Voxtral 语音转文本模型，包含模型实现、多模态处理器和 HF transformers 兼容性支持	PR #21635

二、性能优化与特性

2.1 DeepSeek V3.2 IndexCache

为 DeepSeek V3.2 启用 IndexCache，优化 MLA attention 的索引缓存机制，提升推理效率。

Commit Message	总结	PR 链接
`Enable IndexCache for DeepSeek V3.2 (#21405)`	为 DeepSeek V3.2 启用 IndexCache，优化 MLA attention 索引缓存以提升推理性能	PR #21405

2.2 AMD MLA + FP8 KV Cache (Kimi K2.5)

支持 Kimi K2.5 在 AMD GPU 上使用 nhead<16 的 MLA 和 FP8 KV cache（TP=8 场景）。

Commit Message	总结	PR 链接
`[AMD]: Support MLA with nhead<16 and FP8 KV cache for TP=8 (Kimi K2.5) (#21213)`	在 AMD GPU 上为 Kimi K2.5 支持 nhead<16 的 MLA 和 FP8 KV cache（TP=8）	PR #21213

2.3 Diffusion 平台 float64 支持

Commit Message	总结	PR 链接
`[diffusion] Add is_float64_supported to Platform (#22112)`	在 Platform 接口中增加 `is_float64_supported` 方法，统一各平台 float64 能力检测	PR #22112

2.4 SpecV2 Qwen3 精度测试

Commit Message	总结	PR 链接
`[SpecV2]: Reopen kl accuracy test for qwen3 + SpecV2 (#22104)`	重新开启 Qwen3 + SpecV2 的 KL 精度测试	PR #22104

三、Bug Fix

Commit Message	总结	PR 链接
`[BugFix][RadixTree]: Fix backup invariant violation in Hi-MambaRadixTree (#22062)`	修复 Hi-MambaRadixTree 中 backup 不变量违例导致的缓存错误	PR #22062
`[PD] Fix staging warmup for GQA prefill decode different tp (#22153)`	修复 GQA prefill 和 decode 使用不同 TP 时 staging warmup 失败的问题	PR #22153
`fix: add missing f-string prefixes in warning and assert messages (#22067)`	修复 warning 和 assert 消息中缺失的 f-string 前缀，避免变量未正确插值	PR #22067

四、server_args.py 新增参数

本次时间窗口内没有新增命令行参数。

五、新增环境变量

本次时间窗口内没有新增环境变量。

六、Diffusion

Commit Message	总结	PR 链接
`[diffusion] fix: fix accuracy for flux series (#22059)`	修复 Flux 系列模型的推理精度问题，新增 Flux 2 编码器配置和 DiT 模型	PR #22059
`[diffusion] Add is_float64_supported to Platform (#22112)`	为 diffusion 平台增加 float64 支持检测，适配多种硬件后端	PR #22112

七、CI / Workflow / 工具

7.1 CI Auto-Bisect 工作流

新增 CI 自动二分查找工作流，用于自动回归分析，包括 GitHub Actions 配置、Python 脚本和 Slack 通知。

Commit Message	总结	PR 链接
`feat: CI auto-bisect workflow for automated regression analysis (#22119)`	新增 CI 自动二分查找工作流，包含 workflow 配置、bisect 脚本和 Slack 通知	PR #22119
`Update ci_auto_bisect.py to use correct model (#22142)`	修复 ci_auto_bisect.py 中使用了错误的模型	PR #22142

7.2 其他 CI 变更

Commit Message	总结	PR 链接
`Add failfast flag to rerun-test workflow (#22141)`	为 rerun-test 工作流增加 failfast 标志，首个失败即停止	PR #22141
`[Fix] Fix nightly tests (#22140)`	修复夜间测试中的问题	PR #22140
`[CI]Temporary ban auto benchmark tool test (#22138)`	临时禁用 auto benchmark tool 测试（不稳定）	PR #22138

八、Speculative Decoding

Commit Message	总结	PR 链接
`[SpecV2]: Reopen kl accuracy test for qwen3 + SpecV2 (#22104)`	重新开启 Qwen3 + SpecV2 的 KL 精度测试	PR #22104
`Isolate spec V1 path in decode post-processing (#22146)`	在 decode 后处理中隔离 Spec V1 代码路径，避免与 V2 混淆	PR #22146

九、重构与清理

Commit Message	总结	PR 链接
`Unify think_end_id to model_config as single source of truth (#22148)`	将 `think_end_id` 统一归口到 model_config，作为单一事实来源	PR #22148
`Consolidate reasoning tests into test/registered/reasoning/ (#22139)`	将 reasoning 相关测试整合到 `test/registered/reasoning/` 目录	PR #22139
`Add dump_metric to MMMU, lm-eval, and NeMo Skills eval paths (#22147)`	为 MMMU、lm-eval 和 NeMo Skills 评估路径增加 dump_metric 功能	PR #22147
`Migrate reasoning_tokens tests to existing server fixtures (#22102)`	将 reasoning_tokens 测试迁移到现有的 server fixture	PR #22102
`Remove flaky TestToolChoiceLfm2Moe from test_tool_choice (#22137)`	移除不稳定的 TestToolChoiceLfm2Moe 测试	PR #22137

十、其他变更

Commit Message	总结	PR 链接
`[Doc] Update GLM-5 instructions in sglang documentation (#21716)`	更新 Sglang 文档中 GLM-5 模型的使用说明	PR #21716
`DEBUG: reproduce flaky test_load_weights_from_remote_instance (#22150)`	调试复现 `test_load_weights_from_remote_instance` 的 flaky 问题	PR #22150

重点关注总结

新增模型

Voxtral: 语音转文本（speech-to-text）模型，含多模态处理器

性能优化

DeepSeek V3.2 IndexCache: 启用 IndexCache 优化 MLA attention 索引缓存
AMD MLA + FP8 KV Cache: Kimi K2.5 在 AMD GPU 上支持 nhead<16 的 MLA 和 FP8 KV cache
Diffusion float64 平台支持: 统一各平台 float64 能力检测
Flux 系列精度修复: 修复 Flux 2 推理精度问题

Bug Fix

Hi-MambaRadixTree: 修复 backup 不变量违例
PD staging warmup: 修复 GQA prefill/decode 不同 TP 时的 warmup 失败
f-string 缺失前缀: 修复消息插值错误

server_args.py 新增参数

新增环境变量

工具

CI Auto-Bisect: 自动二分查找回归问题的 CI 工作流，含 Slack 通知