SGLang main分支 Commit 总结 (2026-04-21 UTC+8)

本文总结了2026年4月21日(UTC+8 0:00-24:00)期间 SGLang main分支的所有commit,共计 51个commit


一、新模型支持

昨日新增了对以下模型或模型特性的支持:

Commit Message 总结 PR链接
[AMD] Fused qk rmsnorm bf16 for amd/Kimi-K2.5-MXFP4 (#23186) 为AMD GPU上的Kimi-K2.5-MXFP4模型实现fused qk rmsnorm bf16优化 #23186
[AMD] Enable MTP for GLM-5-mxfp4 model (#23219) 为GLM-5-mxfp4模型启用MTP(Multi-Token Prediction)支持 #23219
[Diffusion][CPU] Init CPU platform support for SGLang Diffusion (#20816) 为SGLang Diffusion初始化CPU平台支持,包括CPU worker、调度器和相关triton kernel的CPU fallback #20816
Add new Mintlify documentation site (docs_new/) (#23001) 添加全新的Mintlify文档站点,包含大量新模型的cookbook文档 #23001

二、性能优化

2.1 注意力机制与Kernel优化

Commit Message 总结 PR链接
[KDA] Fuse gate+cumsum and reuse chunk index for KDA (#23038) 融合KDA的gate和cumsum操作,复用chunk index,减少内存访问和提升性能 #23038
Optimize LTX2 feed-forward tensor parallelism (#23221) 优化LTX2模型的feed-forward层张量并行性能 #23221
[Perf] Make EAGLE bigram key an O(1) view on RadixKey (#23106) 将EAGLE bigram key优化为RadixKey上的O(1)视图,提升speculative decoding性能 #23106

2.2 缓存系统优化

Commit Message 总结 PR链接
Opt-in strip of thinking tokens from radix cache (#23315) 新增--strip-thinking-cache参数,允许选择不缓存推理模型的thinking tokens,减少缓存占用 #23315
[Hybrid-Cache]: Refactor hybrid_pool_assembler.py (#23243) 重构hybrid pool assembler,优化混合缓存组装逻辑 #23243
fix(hicache): emit KV events for L2 host cache insertions (#22894) 修复HiCache,为L2主机缓存插入操作发出KV事件 #22894

2.3 其他优化

Commit Message 总结 PR链接
Support moe_dp_size = 1 for various attention_cp_size (#22003) 支持在不同attention上下文并行大小下设置moe_dp_size=1 #22003

三、Bug修复

3.1 核心功能修复

Commit Message 总结 PR链接
Fix: Add token heuristic increment in total_tokens load balancing (#22614) 修复total_tokens负载均衡中缺少token启发式增量的问题 #22614
[HiCache]Fix hybrid model move_indices (#22940) 修复混合模型在HiCache中的move_indices逻辑错误 #22940
[PD] Fix clip logic when state indices lens are mismatch (#23323) 修复PD disaggregation中状态索引长度不匹配时的clip逻辑 #23323
Fix trtllm mla chunked-prefill zero-length bug (#22291) (#22688) 修复trtllm MLA chunked prefill中的零长度bug #22688
[Fix] Solve the error lead by _commit_transfer_to_req() when using IntraNode NVLink in PD disaggregation (#23252) 修复PD disaggregation中使用IntraNode NVLink时_commit_transfer_to_req()导致的错误 #23252
Fix hybrid swa chunked prefill oom (#23174) 修复hybrid SWA chunked prefill导致的OOM问题 #23174
Fix segfault in cudaMemcpyBatchAsync on CUDA 13.0 (#23136) 修复CUDA 13.0上cudaMemcpyBatchAsync的段错误 #23136
fix: reset empty prefill batch fullness (#23138) 修复空prefill batch的fullness重置问题 #23138
fix: add back priorty as radix cache policy (#23275) 恢复priority作为radix cache策略的选项 #23275

3.2 Speculative Decoding修复

Commit Message 总结 PR链接
[sgl] fix incorrect behavior in cuda graph draft extend (#22832) 修复cuda graph draft extend中的不正确行为 #22832
[sgl] multilayereagleworkerv2 fix (#22954) 修复MultiLayerEagleWorkerV2的问题 #22954
[AMD] Resolve Qwen3.5 MTP (speculative decoding) radix cache conflict. (#22908) 解决Qwen3.5 MTP speculative decoding与radix cache的冲突 #22908

3.3 平台特定修复

Commit Message 总结 PR链接
[XPU] Fix DeepSeek-OCR tests under transformers 5.x (#23044) 修复DeepSeek-OCR在transformers 5.x下的测试问题(Intel XPU) #23044
[ROCm] Uniform docker to support AMD AINIC, BRCM Thor2 IBGDA NIC for MoRI-EP (#23263) 统一AMD ROCm docker配置,支持AMD AINIC和BRCM Thor2 IBGDA网卡 #23263
fix legacy deepep path for flashinfer_cutedsl (#22925) 修复flashinfer_cutedsl的legacy deepep路径 #22925

四、新功能与API

4.1 Score API

Commit Message 总结 PR链接
[Score API] Add Multi-Item Scoring with pre-computed delimiter indices (#22544) 为Score API新增Multi-Item Scoring功能,使用预计算的delimiter索引进行多项目评分,新增--enable-mis参数 #22544

4.2 Speculative Decoding增强

Commit Message 总结 PR链接
[SPEC][1/N] feat: add adaptive speculative_num_steps for EAGLE topk=1 (#21599) 为EAGLE topk=1添加自适应speculative_num_steps,动态调整步数,新增--speculative-adaptive--speculative-adaptive-config参数 #21599
[sgl] add support for weight update function in spedec (#22088) 为speculative decoding添加权重更新函数支持 #22088

五、server_args.py 新增参数

昨日在server_args.py中新增或修改了以下参数:

参数名 类型 说明 相关PR
--strip-thinking-cache store_true 跳过缓存推理模型的输出(thinking+answer),仅保留prompt前缀,减少缓存占用 #23315
--enable-mis store_true 启用Multi-Item Scoring优化,将query和多个items合并为单个序列进行高效批处理 #22544
--speculative-adaptive store_true 启用自适应speculative decoding,根据接受率动态调整num_steps #21599
--speculative-adaptive-config str 自适应speculative decoding的JSON配置文件路径 #21599

其他server_args.py修改:

Commit Message 修改内容 PR链接
[AMD] Resolve Qwen3.5 MTP (speculative decoding) radix cache conflict. (#22908) 解决Qwen3.5 MTP与radix cache的兼容性问题 #22908
Support moe_dp_size = 1 for various attention_cp_size (#22003) 调整参数验证逻辑,支持moe_dp_size=1 #22003
fix: add back priorty as radix cache policy (#23275) 恢复priority作为radix cache eviction策略 #23275
多处修改 优化attention backend自动选择逻辑,FlashInfer不支持attention sinks时自动降级到Triton -

六、环境变量

昨日的commit中,主要涉及文档层面整理了环境变量相关说明(通过docs_new文档站点更新),未见新增环境变量的代码提交。


七、CI/CD与基础设施

Commit Message 总结 PR链接
[AMD] prepare for MI300x PR runner pool: registry mirror, runner routing, threshold tuning (#23156) 为MI300x PR runner pool做准备:registry镜像、runner路由、阈值调优 #23156
[AMD] CI - Fix the cancelled guard to AMD CI (#23338) 修复AMD CI的cancelled guard逻辑 #23338
[misc] CI hygiene: enforce __main__ entry, drop silent-skipped tests, fix rerun-test protoc (#23305) CI卫生改进:强制执行__main__入口、删除静默跳过的测试、修复rerun-test protoc #23305
[CI][MLA] Enable deterministic inference for MGSM MLA FP8 test (#23303) 为MGSM MLA FP8测试启用确定性推理 #23303
[CI][LoRA] Drop flaky all-None batch from multi-LoRA parity test (#23287) 从multi-LoRA parity测试中删除不稳定的all-None batch #23287
ci: reduce scheduled PR test from 4x to 3x daily (#23313) 将定时PR测试从每天4次减少到3次 #23313
[docker] Fix stray backslash dropping sgl-model-gateway COPY (#23097) 修复dockerfile中导致sgl-model-gateway COPY丢失的反斜杠问题 #23097
[Docker] Move Rust toolchain install to torch_deps stage (#23278) 将Rust工具链安装移至torch_deps阶段 #23278
[CI] Fix nightly docker builds failing on root-owned workspace leftovers (#23279) 修复nightly docker构建因root权限workspace残留导致的失败 #23279
[CI] Fix wait-for-jobs hanging when matrix job skipped at job level (#23277) 修复matrix job在job级别跳过时wait-for-jobs挂起的问题 #23277
[release] install rust toolchain in main dockerfile (#23014) 在主dockerfile中安装Rust工具链 #23014
[Diffusion][NPU][CI] update perf numbers (#23056) 更新Diffusion NPU CI的性能数据 #23056

八、代码重构

Commit Message 总结 PR链接
[Refactor] Move radix-cache utils onto RadixKey as methods (#23209) 将radix cache工具函数迁移为RadixKey的方法,提升代码组织性 #23209
[Refactor] Replace page_align_keys helper with RadixKey.page_aligned method (#23107) 用RadixKey.page_aligned方法替换page_align_keys辅助函数 #23107
[CPU] expand the interface of shared_expert without scaling factor (#22933) 扩展CPU shared_expert接口,支持无scaling factor的场景 #22933

九、文档与其他

Commit Message 总结 PR链接
[Docs] Update installation and TPU documentation to fix the render problem (#23344) 更新安装和TPU文档,修复渲染问题 #23344
docs: redirect /cookbook to /cookbook/intro (#23348) 添加/cookbook到/cookbook/intro的重定向 #23348
[Docs] Sync docs_new with legacy docs and update migration redirects (#23337) 同步docs_new与legacy docs,更新迁移重定向 #23337
Docs/url redirect (#23312) 文档URL重定向配置 #23312
Fix formatting for ACM-VIT in README acknowledgements section (#23325) 修复README中ACM-VIT的格式 #23325
[NPU] [DOC] Quick start doc for Ascend NPU (#23238) 添加Ascend NPU快速入门文档 #23238
Update CODEOWNERS to include new documentation paths for docs and doc… (#23293) 更新CODEOWNERS以包含新的文档路径 #23293

十、总结

昨日的51个commit涵盖以下几个方面:

  1. 新模型支持:新增Kimi-K2.5-MXFP4(AMD优化)、GLM-5-mxfp4(MTP支持)、SGLang Diffusion CPU平台支持
  2. 性能优化:KDA kernel融合、LTX2张量并行优化、EAGLE bigram key O(1)优化、缓存系统重构
  3. Bug修复:约15个bug修复,涵盖负载均衡、HiCache、PD disaggregation、trtllm MLA、OOM、CUDA 13.0兼容性等
  4. 新功能:Multi-Item Scoring API、自适应speculative decoding、speculative权重更新
  5. server_args.py新增参数--strip-thinking-cache--enable-mis--speculative-adaptive--speculative-adaptive-config
  6. CI/CD:AMD MI300x runner pool准备、CI卫生改进、Docker构建修复