统计时间范围:2026-04-17 00:00 ~ 24:00 (UTC+8)
分支:main
Commit 总数:47


一、重点修改概览

新模型支持

Commit Message 总结 PR 链接
[CPU] Add gemma4_rmsnorm_cpu kernel (#22842) 为 CPU 后端添加 Gemma-4 模型的 RMSNorm 内核支持 PR #22842
[CI] Adding Gemma 4 to Nightly CI (#22408) 将 Gemma-4 模型加入夜间 CI 测试 PR #22408
feat: Support MXFP4 quantized dense models on AMD CDNA2/CDNA3 GPUs (#19143) 支持 AMD CDNA2/CDNA3 GPU 上的 MXFP4 量化稠密模型(新增 petit_mxfp4 量化选项,后被 revert) PR #19143
Revert "feat: Support MXFP4 quantized dense models on AMD CDNA2/CDNA3 GPUs (#19143)" (#23031) 回退上述 MXFP4 AMD 支持(可能存在兼容性问题) PR #23031

性能优化

Commit Message 总结 PR 链接
feat: add coordinated checkpoint prefetch for network filesystem loading (#20843) 新增协调式 checkpoint 预取功能,在网络文件系统(NFS/Lustre)上将 Ncheckpoint 的 I/O 降为 1checkpoint,显著加速多卡模型加载 PR #20843
[VLM] Reduce GPU memory footprint of CUDA IPC MM feature transport (#22662) 将 VLM 多模量特征的 CUDA IPC 缓存默认值从 4GB 降至 1GB,降低 GPU 显存占用 PR #22662
[sgl] improve accuracy of additional page requirement during spec decode (#22406) 提升 speculative decoding 期间额外页面需求估算的精度,减少内存浪费 PR #22406
Allow piecewise CUDA graph with speculative decoding (#22128) 允许 speculative decoding 与 piecewise CUDA graph 同时使用,此前两者互斥,现在可叠加获得性能收益 PR #22128
[diffusion] feat: disaggregated diffusion (#21701) 实现扩散模型的 disaggregated 部署架构,支持 prefll/decode 分离,提升扩散模型推理吞吐 PR #21701
feat(observability): add OpenTelemetry tracing for speculative decoding (#19545) 为 speculative decoding 添加 OpenTelemetry 追踪支持,便于性能分析和可观测性 PR #19545

Bug Fix

Commit Message 总结 PR 链接
fix: normalize tool message content for GLM5.1 chat template (#22595) 修复 GLM-5.1 聊天模板中 tool message 内容的归一化处理 PR #22595
Fix for the low-probability garbled output issue in the GLM-5 series models. (#22811) 修复 GLM-5 系列模型低概率出现的乱码输出问题 PR #22811
[Pipeline Parallelism][Bug] Fix scheduler hang in pipeline parallelism setup (#23006) 修复 pipeline parallelism 初始化时 scheduler 卡死的问题 PR #23006
[Bug Fix] Ensure prefill_info_table is populated before honoring disagg_prefill_dp_rank (#22990) 修复 disaggregated prefill 中 prefill_info_table 未就绪就使用 dp_rank 导致的错误 PR #22990
[AMD] Qwen3.5 MXFP4 breaks after shared expert fusion is enabled (#22948) 修复 AMD 上 Qwen3.5 MXFP4 在启用 shared expert fusion 后的崩溃问题 PR #22948
[Bugfix] [NPU] Fix check_env on Ascend for CANN 8.5 (#22888) 修复 Ascend NPU CANN 8.5 环境下的 check_env 检测逻辑 PR #22888
[CPU][sgl-kernel] extend_attention_cpu and flash_attn_varlen_func: fix nan for large seq (#22434) 修复 CPU 后端在长序列下 extend_attention 和 flash_attn 出现 NaN 的问题 PR #22434
add check for none status code in FinishAbort (#22535) 在 FinishAbort 中增加对 None 状态码的检查,防止异常 PR #22535
fix(loads): switch get_loads_communicator to watching mode (#22919) 将 loads communicator 切换为 watching 模式,改进通信行为 PR #22919
fix(loads): preserve include filtering after watching mode switch (#22959) 修复 watching 模式切换后 include 过滤失效的问题 PR #22959

server_args.py 新增参数

Commit Message 总结 PR 链接
feat: add coordinated checkpoint prefetch for network filesystem loading (#20843) 新增 --weight-loader-prefetch-checkpoints(开启 checkpoint 预取)和 --weight-loader-prefetch-num-threads(预取线程数,默认 4)两个参数 PR #20843

新增环境变量

环境变量 类型 默认值 说明 来源 PR
SGLANG_PREFETCH_BLOCK_SIZE_MB Int 16 Checkpoint 预取的块大小(MB) PR #20843
SGLANG_USE_AITER_UNIFIED_ATTN Bool False 启用 AITER 统一注意力后端(AMD/ROCm) PR #22994
SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK Int 4096 MORI-EP 每 rank 最大 dispatch token 数 PR #22994
SGLANG_MORI_MOE_MAX_INPUT_TOKENS Int 0 MoE 计算前截断 dispatch buffer 的行数,减少 padding token 的 kernel 开销;0 表示禁用截断 PR #22952

二、按模块分类的 Commit 详情

1. 模型与量化

Commit Message 总结 PR 链接
feat: Support MXFP4 quantized dense models on AMD CDNA2/CDNA3 GPUs (#19143) 为 AMD CDNA2/CDNA3 添加 MXFP4 量化支持,新增 petit_mxfp4 量化后端 PR #19143
Revert "feat: Support MXFP4 quantized dense models on AMD CDNA2/CDNA3 GPUs (#19143)" (#23031) 回退上述 MXFP4 AMD 支持(可能因兼容性问题) PR #23031
[CPU] Add gemma4_rmsnorm_cpu kernel (#22842) 为 CPU 后端添加 Gemma-4 RMSNorm kernel PR #22842
[AMD] Qwen3.5 MXFP4 breaks after shared expert fusion is enabled (#22948) 修复 AMD 上 Qwen3.5 MXFP4 启用 shared expert fusion 后的问题 PR #22948
[CPU][sgl-kernel] extend_attention_cpu and flash_attn_varlen_func: fix nan for large seq (#22434) 修复 CPU 长序列下 attention 的 NaN 问题 PR #22434
[misc] refine outdated comments for chain-style multi-layer MTP (#22996) 更新 chain-style multi-layer MTP 模型的过时注释 PR #22996

2. 性能优化与推理引擎

Commit Message 总结 PR 链接
feat: add coordinated checkpoint prefetch for network filesystem loading (#20843) 协调式 checkpoint 预取,大幅降低网络文件系统上的多卡模型加载 I/O PR #20843
[VLM] Reduce GPU memory footprint of CUDA IPC MM feature transport (#22662) VLM CUDA IPC 缓存从 4GB 降至 1GB,减少显存占用 PR #22662
[sgl] improve accuracy of additional page requirement during spec decode (#22406) 改进 spec decode 期间页面需求估算精度 PR #22406
Allow piecewise CUDA graph with speculative decoding (#22128) 解除 piecewise CUDA graph 与 speculative decoding 的互斥限制 PR #22128
[diffusion] feat: disaggregated diffusion (#21701) 实现扩散模型的 disaggregated 部署(prefill/decode 分离) PR #21701
feat(observability): add OpenTelemetry tracing for speculative decoding (#19545) 为 spec decode 添加 OpenTelemetry 追踪 PR #19545
[diffusion] refactor: extract LTX2 image encoding from denoising stage (#22976) 重构:将 LTX2 图像编码从去噪阶段分离出来 PR #22976
refactor: extract FanOutCommunicator and use declarative spec table (#22967) 重构:提取 FanOutCommunicator 并使用声明式 spec 表 PR #22967

3. Bug 修复

Commit Message 总结 PR 链接
fix: normalize tool message content for GLM5.1 chat template (#22595) 修复 GLM-5.1 tool message 内容归一化 PR #22595
Fix for the low-probability garbled output issue in the GLM-5 series models. (#22811) 修复 GLM-5 系列低概率乱码输出 PR #22811
[Pipeline Parallelism][Bug] Fix scheduler hang in pipeline parallelism setup (#23006) 修复 pipeline parallelism 初始化时 scheduler 卡死 PR #23006
[Bug Fix] Ensure prefill_info_table is populated before honoring disagg_prefill_dp_rank (#22990) 修复 disaggregated prefill 中 prefill_info_table 就绪顺序问题 PR #22990
[Bugfix] [NPU] Fix check_env on Ascend for CANN 8.5 (#22888) 修复 Ascend CANN 8.5 环境检查 PR #22888
add check for none status code in FinishAbort (#22535) FinishAbort 增加 None 状态码检查 PR #22535
fix(loads): switch get_loads_communicator to watching mode (#22919) loads communicator 切换为 watching 模式 PR #22919
fix(loads): preserve include filtering after watching mode switch (#22959) 修复 watching 模式切换后 include 过滤失效 PR #22959
[Diffusion] [NPU] Fix multimodal gen CI (#22879) 修复 NPU 多模态生成 CI PR #22879
test: fix flaky required function calling assertion (#22890) 修复 function calling 测试中的 flaky assertion PR #22890

4. Ray 与分布式

Commit Message 总结 PR 链接
[Ray] Support multi-replica serving by making scheduler actor names unique (#22917) 通过使 scheduler actor 名称唯一来支持多副本服务 PR #22917
[Ray] Bind scheduler actors to GPU-local NUMA node (#22989) 将 scheduler actors 绑定到 GPU 本地 NUMA 节点,提升内存访问性能 PR #22989
use envs in server_args (#22994) 将 server_args 中的环境变量调用统一迁移到 envs 模块,新增 SGLANG_USE_AITER_UNIFIED_ATTN 和 SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK 环境变量定义 PR #22994

5. 扩散模型 (Diffusion)

Commit Message 总结 PR 链接
[diffusion] feat: disaggregated diffusion (#21701) 扩散模型 disaggregated 架构,支持独立部署 denoise 和 encode 节点 PR #21701
[diffusion] refactor: extract LTX2 image encoding from denoising stage (#22976) 将 LTX2 图像编码逻辑从去噪阶段解耦 PR #22976
[codex] Update diffusion skills (#23028) 更新 diffusion 相关的 Claude Code skills 文档 PR #23028

6. HiCache 与内存管理

Commit Message 总结 PR 链接
[UnifiedRadixTree]: Add HiCache hook interface for TreeComponent (#22924) 为 TreeComponent 添加 HiCache hook 接口 PR #22924
[HiSparse]: Adding e2e ut for hisparse (#22979) 添加 HiSparse 端到端单元测试 PR #22979

7. CI 与测试

Commit Message 总结 PR 链接
[CI] Adding Gemma 4 to Nightly CI (#22408) 将 Gemma-4 加入夜间 CI 测试 PR #22408
test(4-gpu-b200): split test_qwen35_models.py + bump partitions 5→6 (#22913) 拆分 Qwen3.5 测试文件并将分区数从 5 增加到 6 PR #22913
migrate CPU-only unit tests from openai_server to unit/ (#22965) 将 CPU 单元测试从 openai_server 迁移到 unit/ 目录 PR #22965
[AMD] CI Job Monitor: fix queue time, utilization, and summary metrics (#22274) 修复 AMD CI 任务监控的排队时间、利用率和汇总指标 PR #22274
ci: install rust toolchain in ci_install_dependency.sh (#23017) 在 CI 依赖安装脚本中添加 Rust 工具链安装 PR #23017
CI: fix lint (#22991) 修复 CI lint 问题 PR #22991

8. 文档

Commit Message 总结 PR 链接
[NPU] [DOC] Update npu best practice docs to match latest code (#22975) 更新 NPU 最佳实践文档以匹配最新代码 PR #22975
[Docs] fix profiling endpoint (#22982) 修复 profiling 端点文档 PR #22982
[Doc] correct the HTTP endpoint for stopping profiling in benchmark_and_profiling.md (#22523) 修正 benchmark_and_profiling.md 中停止 profiling 的 HTTP 端点 PR #22523
[Docs] [npu] change the feature support status (#23041) 更新 NPU 特性支持状态表 PR #23041
[PD]feat(bench): add --fake-prefill flag for decode-only stress testing (#22973) 为 decode-only 压力测试添加 --fake-prefill 标志及文档 PR #22973

9. 杂项

Commit Message 总结 PR 链接
[misc] update .github/CODEOWNERS (#22993) 更新 CODEOWNERS 文件 PR #22993