SGLang main分支 Commit 总结 (2026-04-21 UTC+8)

本文总结了2026年4月21日（UTC+8 0:00-24:00）期间 SGLang main分支的所有commit，共计 51个commit。

一、新模型支持

昨日新增了对以下模型或模型特性的支持：

Commit Message	总结	PR链接
`[AMD] Fused qk rmsnorm bf16 for amd/Kimi-K2.5-MXFP4 (#23186)`	为AMD GPU上的Kimi-K2.5-MXFP4模型实现fused qk rmsnorm bf16优化	#23186
`[AMD] Enable MTP for GLM-5-mxfp4 model (#23219)`	为GLM-5-mxfp4模型启用MTP（Multi-Token Prediction）支持	#23219
`[Diffusion][CPU] Init CPU platform support for SGLang Diffusion (#20816)`	为SGLang Diffusion初始化CPU平台支持，包括CPU worker、调度器和相关triton kernel的CPU fallback	#20816
`Add new Mintlify documentation site (docs_new/) (#23001)`	添加全新的Mintlify文档站点，包含大量新模型的cookbook文档	#23001

二、性能优化

2.1 注意力机制与Kernel优化

Commit Message	总结	PR链接
`[KDA] Fuse gate+cumsum and reuse chunk index for KDA (#23038)`	融合KDA的gate和cumsum操作，复用chunk index，减少内存访问和提升性能	#23038
`Optimize LTX2 feed-forward tensor parallelism (#23221)`	优化LTX2模型的feed-forward层张量并行性能	#23221
`[Perf] Make EAGLE bigram key an O(1) view on RadixKey (#23106)`	将EAGLE bigram key优化为RadixKey上的O(1)视图，提升speculative decoding性能	#23106

2.2 缓存系统优化

Commit Message	总结	PR链接
`Opt-in strip of thinking tokens from radix cache (#23315)`	新增`--strip-thinking-cache`参数，允许选择不缓存推理模型的thinking tokens，减少缓存占用	#23315
`[Hybrid-Cache]: Refactor hybrid_pool_assembler.py (#23243)`	重构hybrid pool assembler，优化混合缓存组装逻辑	#23243
`fix(hicache): emit KV events for L2 host cache insertions (#22894)`	修复HiCache，为L2主机缓存插入操作发出KV事件	#22894

2.3 其他优化

Commit Message	总结	PR链接
`Support moe_dp_size = 1 for various attention_cp_size (#22003)`	支持在不同attention上下文并行大小下设置moe_dp_size=1	#22003

三、Bug修复

3.1 核心功能修复

Commit Message	总结	PR链接
`Fix: Add token heuristic increment in total_tokens load balancing (#22614)`	修复total_tokens负载均衡中缺少token启发式增量的问题	#22614
`[HiCache]Fix hybrid model move_indices (#22940)`	修复混合模型在HiCache中的move_indices逻辑错误	#22940
`[PD] Fix clip logic when state indices lens are mismatch (#23323)`	修复PD disaggregation中状态索引长度不匹配时的clip逻辑	#23323
`Fix trtllm mla chunked-prefill zero-length bug (#22291) (#22688)`	修复trtllm MLA chunked prefill中的零长度bug	#22688
`[Fix] Solve the error lead by _commit_transfer_to_req() when using IntraNode NVLink in PD disaggregation (#23252)`	修复PD disaggregation中使用IntraNode NVLink时_commit_transfer_to_req()导致的错误	#23252
`Fix hybrid swa chunked prefill oom (#23174)`	修复hybrid SWA chunked prefill导致的OOM问题	#23174
`Fix segfault in cudaMemcpyBatchAsync on CUDA 13.0 (#23136)`	修复CUDA 13.0上cudaMemcpyBatchAsync的段错误	#23136
`fix: reset empty prefill batch fullness (#23138)`	修复空prefill batch的fullness重置问题	#23138
`fix: add back priorty as radix cache policy (#23275)`	恢复priority作为radix cache策略的选项	#23275

3.2 Speculative Decoding修复

Commit Message	总结	PR链接
`[sgl] fix incorrect behavior in cuda graph draft extend (#22832)`	修复cuda graph draft extend中的不正确行为	#22832
`[sgl] multilayereagleworkerv2 fix (#22954)`	修复MultiLayerEagleWorkerV2的问题	#22954
`[AMD] Resolve Qwen3.5 MTP (speculative decoding) radix cache conflict. (#22908)`	解决Qwen3.5 MTP speculative decoding与radix cache的冲突	#22908

3.3 平台特定修复

Commit Message	总结	PR链接
`[XPU] Fix DeepSeek-OCR tests under transformers 5.x (#23044)`	修复DeepSeek-OCR在transformers 5.x下的测试问题（Intel XPU）	#23044
`[ROCm] Uniform docker to support AMD AINIC, BRCM Thor2 IBGDA NIC for MoRI-EP (#23263)`	统一AMD ROCm docker配置，支持AMD AINIC和BRCM Thor2 IBGDA网卡	#23263
`fix legacy deepep path for flashinfer_cutedsl (#22925)`	修复flashinfer_cutedsl的legacy deepep路径	#22925

四、新功能与API

4.1 Score API

Commit Message	总结	PR链接
`[Score API] Add Multi-Item Scoring with pre-computed delimiter indices (#22544)`	为Score API新增Multi-Item Scoring功能，使用预计算的delimiter索引进行多项目评分，新增`--enable-mis`参数	#22544

4.2 Speculative Decoding增强

Commit Message	总结	PR链接
`[SPEC][1/N] feat: add adaptive speculative_num_steps for EAGLE topk=1 (#21599)`	为EAGLE topk=1添加自适应speculative_num_steps，动态调整步数，新增`--speculative-adaptive`和`--speculative-adaptive-config`参数	#21599
`[sgl] add support for weight update function in spedec (#22088)`	为speculative decoding添加权重更新函数支持	#22088

五、server_args.py 新增参数

昨日在server_args.py中新增或修改了以下参数：

参数名	类型	说明	相关PR
`--strip-thinking-cache`	store_true	跳过缓存推理模型的输出（thinking+answer），仅保留prompt前缀，减少缓存占用	#23315
`--enable-mis`	store_true	启用Multi-Item Scoring优化，将query和多个items合并为单个序列进行高效批处理	#22544
`--speculative-adaptive`	store_true	启用自适应speculative decoding，根据接受率动态调整num_steps	#21599
`--speculative-adaptive-config`	str	自适应speculative decoding的JSON配置文件路径	#21599

其他server_args.py修改：

Commit Message	修改内容	PR链接
`[AMD] Resolve Qwen3.5 MTP (speculative decoding) radix cache conflict. (#22908)`	解决Qwen3.5 MTP与radix cache的兼容性问题	#22908
`Support moe_dp_size = 1 for various attention_cp_size (#22003)`	调整参数验证逻辑，支持moe_dp_size=1	#22003
`fix: add back priorty as radix cache policy (#23275)`	恢复priority作为radix cache eviction策略	#23275
多处修改	优化attention backend自动选择逻辑，FlashInfer不支持attention sinks时自动降级到Triton	-

六、环境变量

昨日的commit中，主要涉及文档层面整理了环境变量相关说明（通过docs_new文档站点更新），未见新增环境变量的代码提交。

七、CI/CD与基础设施

Commit Message	总结	PR链接
`[AMD] prepare for MI300x PR runner pool: registry mirror, runner routing, threshold tuning (#23156)`	为MI300x PR runner pool做准备：registry镜像、runner路由、阈值调优	#23156
`[AMD] CI - Fix the cancelled guard to AMD CI (#23338)`	修复AMD CI的cancelled guard逻辑	#23338
`[misc] CI hygiene: enforce __main__ entry, drop silent-skipped tests, fix rerun-test protoc (#23305)`	CI卫生改进：强制执行__main__入口、删除静默跳过的测试、修复rerun-test protoc	#23305
`[CI][MLA] Enable deterministic inference for MGSM MLA FP8 test (#23303)`	为MGSM MLA FP8测试启用确定性推理	#23303
`[CI][LoRA] Drop flaky all-None batch from multi-LoRA parity test (#23287)`	从multi-LoRA parity测试中删除不稳定的all-None batch	#23287
`ci: reduce scheduled PR test from 4x to 3x daily (#23313)`	将定时PR测试从每天4次减少到3次	#23313
`[docker] Fix stray backslash dropping sgl-model-gateway COPY (#23097)`	修复dockerfile中导致sgl-model-gateway COPY丢失的反斜杠问题	#23097
`[Docker] Move Rust toolchain install to torch_deps stage (#23278)`	将Rust工具链安装移至torch_deps阶段	#23278
`[CI] Fix nightly docker builds failing on root-owned workspace leftovers (#23279)`	修复nightly docker构建因root权限workspace残留导致的失败	#23279
`[CI] Fix wait-for-jobs hanging when matrix job skipped at job level (#23277)`	修复matrix job在job级别跳过时wait-for-jobs挂起的问题	#23277
`[release] install rust toolchain in main dockerfile (#23014)`	在主dockerfile中安装Rust工具链	#23014
`[Diffusion][NPU][CI] update perf numbers (#23056)`	更新Diffusion NPU CI的性能数据	#23056

八、代码重构

Commit Message	总结	PR链接
`[Refactor] Move radix-cache utils onto RadixKey as methods (#23209)`	将radix cache工具函数迁移为RadixKey的方法，提升代码组织性	#23209
`[Refactor] Replace page_align_keys helper with RadixKey.page_aligned method (#23107)`	用RadixKey.page_aligned方法替换page_align_keys辅助函数	#23107
`[CPU] expand the interface of shared_expert without scaling factor (#22933)`	扩展CPU shared_expert接口，支持无scaling factor的场景	#22933

九、文档与其他

Commit Message	总结	PR链接
`[Docs] Update installation and TPU documentation to fix the render problem (#23344)`	更新安装和TPU文档，修复渲染问题	#23344
`docs: redirect /cookbook to /cookbook/intro (#23348)`	添加/cookbook到/cookbook/intro的重定向	#23348
`[Docs] Sync docs_new with legacy docs and update migration redirects (#23337)`	同步docs_new与legacy docs，更新迁移重定向	#23337
`Docs/url redirect (#23312)`	文档URL重定向配置	#23312
`Fix formatting for ACM-VIT in README acknowledgements section (#23325)`	修复README中ACM-VIT的格式	#23325
`[NPU] [DOC] Quick start doc for Ascend NPU (#23238)`	添加Ascend NPU快速入门文档	#23238
`Update CODEOWNERS to include new documentation paths for docs and doc… (#23293)`	更新CODEOWNERS以包含新的文档路径	#23293

十、总结

昨日的51个commit涵盖以下几个方面：

新模型支持：新增Kimi-K2.5-MXFP4（AMD优化）、GLM-5-mxfp4（MTP支持）、SGLang Diffusion CPU平台支持
性能优化：KDA kernel融合、LTX2张量并行优化、EAGLE bigram key O(1)优化、缓存系统重构
Bug修复：约15个bug修复，涵盖负载均衡、HiCache、PD disaggregation、trtllm MLA、OOM、CUDA 13.0兼容性等
新功能：Multi-Item Scoring API、自适应speculative decoding、speculative权重更新
server_args.py新增参数：--strip-thinking-cache、--enable-mis、--speculative-adaptive、--speculative-adaptive-config
CI/CD：AMD MI300x runner pool准备、CI卫生改进、Docker构建修复