SGLang main分支 Commit 总结 (2026-04-21 UTC+8)
本文总结了2026年4月21日(UTC+8 0:00-24:00)期间 SGLang main分支的所有commit,共计 51个commit。
一、新模型支持
昨日新增了对以下模型或模型特性的支持:
| Commit Message |
总结 |
PR链接 |
[AMD] Fused qk rmsnorm bf16 for amd/Kimi-K2.5-MXFP4 (#23186) |
为AMD GPU上的Kimi-K2.5-MXFP4模型实现fused qk rmsnorm bf16优化 |
#23186 |
[AMD] Enable MTP for GLM-5-mxfp4 model (#23219) |
为GLM-5-mxfp4模型启用MTP(Multi-Token Prediction)支持 |
#23219 |
[Diffusion][CPU] Init CPU platform support for SGLang Diffusion (#20816) |
为SGLang Diffusion初始化CPU平台支持,包括CPU worker、调度器和相关triton kernel的CPU fallback |
#20816 |
Add new Mintlify documentation site (docs_new/) (#23001) |
添加全新的Mintlify文档站点,包含大量新模型的cookbook文档 |
#23001 |
二、性能优化
2.1 注意力机制与Kernel优化
| Commit Message |
总结 |
PR链接 |
[KDA] Fuse gate+cumsum and reuse chunk index for KDA (#23038) |
融合KDA的gate和cumsum操作,复用chunk index,减少内存访问和提升性能 |
#23038 |
Optimize LTX2 feed-forward tensor parallelism (#23221) |
优化LTX2模型的feed-forward层张量并行性能 |
#23221 |
[Perf] Make EAGLE bigram key an O(1) view on RadixKey (#23106) |
将EAGLE bigram key优化为RadixKey上的O(1)视图,提升speculative decoding性能 |
#23106 |
2.2 缓存系统优化
| Commit Message |
总结 |
PR链接 |
Opt-in strip of thinking tokens from radix cache (#23315) |
新增--strip-thinking-cache参数,允许选择不缓存推理模型的thinking tokens,减少缓存占用 |
#23315 |
[Hybrid-Cache]: Refactor hybrid_pool_assembler.py (#23243) |
重构hybrid pool assembler,优化混合缓存组装逻辑 |
#23243 |
fix(hicache): emit KV events for L2 host cache insertions (#22894) |
修复HiCache,为L2主机缓存插入操作发出KV事件 |
#22894 |
2.3 其他优化
| Commit Message |
总结 |
PR链接 |
Support moe_dp_size = 1 for various attention_cp_size (#22003) |
支持在不同attention上下文并行大小下设置moe_dp_size=1 |
#22003 |
三、Bug修复
3.1 核心功能修复
| Commit Message |
总结 |
PR链接 |
Fix: Add token heuristic increment in total_tokens load balancing (#22614) |
修复total_tokens负载均衡中缺少token启发式增量的问题 |
#22614 |
[HiCache]Fix hybrid model move_indices (#22940) |
修复混合模型在HiCache中的move_indices逻辑错误 |
#22940 |
[PD] Fix clip logic when state indices lens are mismatch (#23323) |
修复PD disaggregation中状态索引长度不匹配时的clip逻辑 |
#23323 |
Fix trtllm mla chunked-prefill zero-length bug (#22291) (#22688) |
修复trtllm MLA chunked prefill中的零长度bug |
#22688 |
[Fix] Solve the error lead by _commit_transfer_to_req() when using IntraNode NVLink in PD disaggregation (#23252) |
修复PD disaggregation中使用IntraNode NVLink时_commit_transfer_to_req()导致的错误 |
#23252 |
Fix hybrid swa chunked prefill oom (#23174) |
修复hybrid SWA chunked prefill导致的OOM问题 |
#23174 |
Fix segfault in cudaMemcpyBatchAsync on CUDA 13.0 (#23136) |
修复CUDA 13.0上cudaMemcpyBatchAsync的段错误 |
#23136 |
fix: reset empty prefill batch fullness (#23138) |
修复空prefill batch的fullness重置问题 |
#23138 |
fix: add back priorty as radix cache policy (#23275) |
恢复priority作为radix cache策略的选项 |
#23275 |
3.2 Speculative Decoding修复
| Commit Message |
总结 |
PR链接 |
[sgl] fix incorrect behavior in cuda graph draft extend (#22832) |
修复cuda graph draft extend中的不正确行为 |
#22832 |
[sgl] multilayereagleworkerv2 fix (#22954) |
修复MultiLayerEagleWorkerV2的问题 |
#22954 |
[AMD] Resolve Qwen3.5 MTP (speculative decoding) radix cache conflict. (#22908) |
解决Qwen3.5 MTP speculative decoding与radix cache的冲突 |
#22908 |
3.3 平台特定修复
| Commit Message |
总结 |
PR链接 |
[XPU] Fix DeepSeek-OCR tests under transformers 5.x (#23044) |
修复DeepSeek-OCR在transformers 5.x下的测试问题(Intel XPU) |
#23044 |
[ROCm] Uniform docker to support AMD AINIC, BRCM Thor2 IBGDA NIC for MoRI-EP (#23263) |
统一AMD ROCm docker配置,支持AMD AINIC和BRCM Thor2 IBGDA网卡 |
#23263 |
fix legacy deepep path for flashinfer_cutedsl (#22925) |
修复flashinfer_cutedsl的legacy deepep路径 |
#22925 |
四、新功能与API
4.1 Score API
| Commit Message |
总结 |
PR链接 |
[Score API] Add Multi-Item Scoring with pre-computed delimiter indices (#22544) |
为Score API新增Multi-Item Scoring功能,使用预计算的delimiter索引进行多项目评分,新增--enable-mis参数 |
#22544 |
4.2 Speculative Decoding增强
| Commit Message |
总结 |
PR链接 |
[SPEC][1/N] feat: add adaptive speculative_num_steps for EAGLE topk=1 (#21599) |
为EAGLE topk=1添加自适应speculative_num_steps,动态调整步数,新增--speculative-adaptive和--speculative-adaptive-config参数 |
#21599 |
[sgl] add support for weight update function in spedec (#22088) |
为speculative decoding添加权重更新函数支持 |
#22088 |
五、server_args.py 新增参数
昨日在server_args.py中新增或修改了以下参数:
| 参数名 |
类型 |
说明 |
相关PR |
--strip-thinking-cache |
store_true |
跳过缓存推理模型的输出(thinking+answer),仅保留prompt前缀,减少缓存占用 |
#23315 |
--enable-mis |
store_true |
启用Multi-Item Scoring优化,将query和多个items合并为单个序列进行高效批处理 |
#22544 |
--speculative-adaptive |
store_true |
启用自适应speculative decoding,根据接受率动态调整num_steps |
#21599 |
--speculative-adaptive-config |
str |
自适应speculative decoding的JSON配置文件路径 |
#21599 |
其他server_args.py修改:
| Commit Message |
修改内容 |
PR链接 |
[AMD] Resolve Qwen3.5 MTP (speculative decoding) radix cache conflict. (#22908) |
解决Qwen3.5 MTP与radix cache的兼容性问题 |
#22908 |
Support moe_dp_size = 1 for various attention_cp_size (#22003) |
调整参数验证逻辑,支持moe_dp_size=1 |
#22003 |
fix: add back priorty as radix cache policy (#23275) |
恢复priority作为radix cache eviction策略 |
#23275 |
| 多处修改 |
优化attention backend自动选择逻辑,FlashInfer不支持attention sinks时自动降级到Triton |
- |
六、环境变量
昨日的commit中,主要涉及文档层面整理了环境变量相关说明(通过docs_new文档站点更新),未见新增环境变量的代码提交。
七、CI/CD与基础设施
| Commit Message |
总结 |
PR链接 |
[AMD] prepare for MI300x PR runner pool: registry mirror, runner routing, threshold tuning (#23156) |
为MI300x PR runner pool做准备:registry镜像、runner路由、阈值调优 |
#23156 |
[AMD] CI - Fix the cancelled guard to AMD CI (#23338) |
修复AMD CI的cancelled guard逻辑 |
#23338 |
[misc] CI hygiene: enforce __main__ entry, drop silent-skipped tests, fix rerun-test protoc (#23305) |
CI卫生改进:强制执行__main__入口、删除静默跳过的测试、修复rerun-test protoc |
#23305 |
[CI][MLA] Enable deterministic inference for MGSM MLA FP8 test (#23303) |
为MGSM MLA FP8测试启用确定性推理 |
#23303 |
[CI][LoRA] Drop flaky all-None batch from multi-LoRA parity test (#23287) |
从multi-LoRA parity测试中删除不稳定的all-None batch |
#23287 |
ci: reduce scheduled PR test from 4x to 3x daily (#23313) |
将定时PR测试从每天4次减少到3次 |
#23313 |
[docker] Fix stray backslash dropping sgl-model-gateway COPY (#23097) |
修复dockerfile中导致sgl-model-gateway COPY丢失的反斜杠问题 |
#23097 |
[Docker] Move Rust toolchain install to torch_deps stage (#23278) |
将Rust工具链安装移至torch_deps阶段 |
#23278 |
[CI] Fix nightly docker builds failing on root-owned workspace leftovers (#23279) |
修复nightly docker构建因root权限workspace残留导致的失败 |
#23279 |
[CI] Fix wait-for-jobs hanging when matrix job skipped at job level (#23277) |
修复matrix job在job级别跳过时wait-for-jobs挂起的问题 |
#23277 |
[release] install rust toolchain in main dockerfile (#23014) |
在主dockerfile中安装Rust工具链 |
#23014 |
[Diffusion][NPU][CI] update perf numbers (#23056) |
更新Diffusion NPU CI的性能数据 |
#23056 |
八、代码重构
| Commit Message |
总结 |
PR链接 |
[Refactor] Move radix-cache utils onto RadixKey as methods (#23209) |
将radix cache工具函数迁移为RadixKey的方法,提升代码组织性 |
#23209 |
[Refactor] Replace page_align_keys helper with RadixKey.page_aligned method (#23107) |
用RadixKey.page_aligned方法替换page_align_keys辅助函数 |
#23107 |
[CPU] expand the interface of shared_expert without scaling factor (#22933) |
扩展CPU shared_expert接口,支持无scaling factor的场景 |
#22933 |
九、文档与其他
| Commit Message |
总结 |
PR链接 |
[Docs] Update installation and TPU documentation to fix the render problem (#23344) |
更新安装和TPU文档,修复渲染问题 |
#23344 |
docs: redirect /cookbook to /cookbook/intro (#23348) |
添加/cookbook到/cookbook/intro的重定向 |
#23348 |
[Docs] Sync docs_new with legacy docs and update migration redirects (#23337) |
同步docs_new与legacy docs,更新迁移重定向 |
#23337 |
Docs/url redirect (#23312) |
文档URL重定向配置 |
#23312 |
Fix formatting for ACM-VIT in README acknowledgements section (#23325) |
修复README中ACM-VIT的格式 |
#23325 |
[NPU] [DOC] Quick start doc for Ascend NPU (#23238) |
添加Ascend NPU快速入门文档 |
#23238 |
Update CODEOWNERS to include new documentation paths for docs and doc… (#23293) |
更新CODEOWNERS以包含新的文档路径 |
#23293 |
十、总结
昨日的51个commit涵盖以下几个方面:
- 新模型支持:新增Kimi-K2.5-MXFP4(AMD优化)、GLM-5-mxfp4(MTP支持)、SGLang Diffusion CPU平台支持
- 性能优化:KDA kernel融合、LTX2张量并行优化、EAGLE bigram key O(1)优化、缓存系统重构
- Bug修复:约15个bug修复,涵盖负载均衡、HiCache、PD disaggregation、trtllm MLA、OOM、CUDA 13.0兼容性等
- 新功能:Multi-Item Scoring API、自适应speculative decoding、speculative权重更新
- server_args.py新增参数:
--strip-thinking-cache、--enable-mis、--speculative-adaptive、--speculative-adaptive-config
- CI/CD:AMD MI300x runner pool准备、CI卫生改进、Docker构建修复