统计时间范围:2026-04-17 00:00 ~ 24:00 (UTC+8)
分支:main
Commit 总数:47
一、重点修改概览
新模型支持
| Commit Message |
总结 |
PR 链接 |
[CPU] Add gemma4_rmsnorm_cpu kernel (#22842) |
为 CPU 后端添加 Gemma-4 模型的 RMSNorm 内核支持 |
PR #22842 |
[CI] Adding Gemma 4 to Nightly CI (#22408) |
将 Gemma-4 模型加入夜间 CI 测试 |
PR #22408 |
feat: Support MXFP4 quantized dense models on AMD CDNA2/CDNA3 GPUs (#19143) |
支持 AMD CDNA2/CDNA3 GPU 上的 MXFP4 量化稠密模型(新增 petit_mxfp4 量化选项,后被 revert) |
PR #19143 |
Revert "feat: Support MXFP4 quantized dense models on AMD CDNA2/CDNA3 GPUs (#19143)" (#23031) |
回退上述 MXFP4 AMD 支持(可能存在兼容性问题) |
PR #23031 |
性能优化
| Commit Message |
总结 |
PR 链接 |
feat: add coordinated checkpoint prefetch for network filesystem loading (#20843) |
新增协调式 checkpoint 预取功能,在网络文件系统(NFS/Lustre)上将 Ncheckpoint 的 I/O 降为 1checkpoint,显著加速多卡模型加载 |
PR #20843 |
[VLM] Reduce GPU memory footprint of CUDA IPC MM feature transport (#22662) |
将 VLM 多模量特征的 CUDA IPC 缓存默认值从 4GB 降至 1GB,降低 GPU 显存占用 |
PR #22662 |
[sgl] improve accuracy of additional page requirement during spec decode (#22406) |
提升 speculative decoding 期间额外页面需求估算的精度,减少内存浪费 |
PR #22406 |
Allow piecewise CUDA graph with speculative decoding (#22128) |
允许 speculative decoding 与 piecewise CUDA graph 同时使用,此前两者互斥,现在可叠加获得性能收益 |
PR #22128 |
[diffusion] feat: disaggregated diffusion (#21701) |
实现扩散模型的 disaggregated 部署架构,支持 prefll/decode 分离,提升扩散模型推理吞吐 |
PR #21701 |
feat(observability): add OpenTelemetry tracing for speculative decoding (#19545) |
为 speculative decoding 添加 OpenTelemetry 追踪支持,便于性能分析和可观测性 |
PR #19545 |
Bug Fix
| Commit Message |
总结 |
PR 链接 |
fix: normalize tool message content for GLM5.1 chat template (#22595) |
修复 GLM-5.1 聊天模板中 tool message 内容的归一化处理 |
PR #22595 |
Fix for the low-probability garbled output issue in the GLM-5 series models. (#22811) |
修复 GLM-5 系列模型低概率出现的乱码输出问题 |
PR #22811 |
[Pipeline Parallelism][Bug] Fix scheduler hang in pipeline parallelism setup (#23006) |
修复 pipeline parallelism 初始化时 scheduler 卡死的问题 |
PR #23006 |
[Bug Fix] Ensure prefill_info_table is populated before honoring disagg_prefill_dp_rank (#22990) |
修复 disaggregated prefill 中 prefill_info_table 未就绪就使用 dp_rank 导致的错误 |
PR #22990 |
[AMD] Qwen3.5 MXFP4 breaks after shared expert fusion is enabled (#22948) |
修复 AMD 上 Qwen3.5 MXFP4 在启用 shared expert fusion 后的崩溃问题 |
PR #22948 |
[Bugfix] [NPU] Fix check_env on Ascend for CANN 8.5 (#22888) |
修复 Ascend NPU CANN 8.5 环境下的 check_env 检测逻辑 |
PR #22888 |
[CPU][sgl-kernel] extend_attention_cpu and flash_attn_varlen_func: fix nan for large seq (#22434) |
修复 CPU 后端在长序列下 extend_attention 和 flash_attn 出现 NaN 的问题 |
PR #22434 |
add check for none status code in FinishAbort (#22535) |
在 FinishAbort 中增加对 None 状态码的检查,防止异常 |
PR #22535 |
fix(loads): switch get_loads_communicator to watching mode (#22919) |
将 loads communicator 切换为 watching 模式,改进通信行为 |
PR #22919 |
fix(loads): preserve include filtering after watching mode switch (#22959) |
修复 watching 模式切换后 include 过滤失效的问题 |
PR #22959 |
server_args.py 新增参数
| Commit Message |
总结 |
PR 链接 |
feat: add coordinated checkpoint prefetch for network filesystem loading (#20843) |
新增 --weight-loader-prefetch-checkpoints(开启 checkpoint 预取)和 --weight-loader-prefetch-num-threads(预取线程数,默认 4)两个参数 |
PR #20843 |
新增环境变量
| 环境变量 |
类型 |
默认值 |
说明 |
来源 PR |
SGLANG_PREFETCH_BLOCK_SIZE_MB |
Int |
16 |
Checkpoint 预取的块大小(MB) |
PR #20843 |
SGLANG_USE_AITER_UNIFIED_ATTN |
Bool |
False |
启用 AITER 统一注意力后端(AMD/ROCm) |
PR #22994 |
SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK |
Int |
4096 |
MORI-EP 每 rank 最大 dispatch token 数 |
PR #22994 |
SGLANG_MORI_MOE_MAX_INPUT_TOKENS |
Int |
0 |
MoE 计算前截断 dispatch buffer 的行数,减少 padding token 的 kernel 开销;0 表示禁用截断 |
PR #22952 |
二、按模块分类的 Commit 详情
1. 模型与量化
| Commit Message |
总结 |
PR 链接 |
feat: Support MXFP4 quantized dense models on AMD CDNA2/CDNA3 GPUs (#19143) |
为 AMD CDNA2/CDNA3 添加 MXFP4 量化支持,新增 petit_mxfp4 量化后端 |
PR #19143 |
Revert "feat: Support MXFP4 quantized dense models on AMD CDNA2/CDNA3 GPUs (#19143)" (#23031) |
回退上述 MXFP4 AMD 支持(可能因兼容性问题) |
PR #23031 |
[CPU] Add gemma4_rmsnorm_cpu kernel (#22842) |
为 CPU 后端添加 Gemma-4 RMSNorm kernel |
PR #22842 |
[AMD] Qwen3.5 MXFP4 breaks after shared expert fusion is enabled (#22948) |
修复 AMD 上 Qwen3.5 MXFP4 启用 shared expert fusion 后的问题 |
PR #22948 |
[CPU][sgl-kernel] extend_attention_cpu and flash_attn_varlen_func: fix nan for large seq (#22434) |
修复 CPU 长序列下 attention 的 NaN 问题 |
PR #22434 |
[misc] refine outdated comments for chain-style multi-layer MTP (#22996) |
更新 chain-style multi-layer MTP 模型的过时注释 |
PR #22996 |
2. 性能优化与推理引擎
| Commit Message |
总结 |
PR 链接 |
feat: add coordinated checkpoint prefetch for network filesystem loading (#20843) |
协调式 checkpoint 预取,大幅降低网络文件系统上的多卡模型加载 I/O |
PR #20843 |
[VLM] Reduce GPU memory footprint of CUDA IPC MM feature transport (#22662) |
VLM CUDA IPC 缓存从 4GB 降至 1GB,减少显存占用 |
PR #22662 |
[sgl] improve accuracy of additional page requirement during spec decode (#22406) |
改进 spec decode 期间页面需求估算精度 |
PR #22406 |
Allow piecewise CUDA graph with speculative decoding (#22128) |
解除 piecewise CUDA graph 与 speculative decoding 的互斥限制 |
PR #22128 |
[diffusion] feat: disaggregated diffusion (#21701) |
实现扩散模型的 disaggregated 部署(prefill/decode 分离) |
PR #21701 |
feat(observability): add OpenTelemetry tracing for speculative decoding (#19545) |
为 spec decode 添加 OpenTelemetry 追踪 |
PR #19545 |
[diffusion] refactor: extract LTX2 image encoding from denoising stage (#22976) |
重构:将 LTX2 图像编码从去噪阶段分离出来 |
PR #22976 |
refactor: extract FanOutCommunicator and use declarative spec table (#22967) |
重构:提取 FanOutCommunicator 并使用声明式 spec 表 |
PR #22967 |
3. Bug 修复
| Commit Message |
总结 |
PR 链接 |
fix: normalize tool message content for GLM5.1 chat template (#22595) |
修复 GLM-5.1 tool message 内容归一化 |
PR #22595 |
Fix for the low-probability garbled output issue in the GLM-5 series models. (#22811) |
修复 GLM-5 系列低概率乱码输出 |
PR #22811 |
[Pipeline Parallelism][Bug] Fix scheduler hang in pipeline parallelism setup (#23006) |
修复 pipeline parallelism 初始化时 scheduler 卡死 |
PR #23006 |
[Bug Fix] Ensure prefill_info_table is populated before honoring disagg_prefill_dp_rank (#22990) |
修复 disaggregated prefill 中 prefill_info_table 就绪顺序问题 |
PR #22990 |
[Bugfix] [NPU] Fix check_env on Ascend for CANN 8.5 (#22888) |
修复 Ascend CANN 8.5 环境检查 |
PR #22888 |
add check for none status code in FinishAbort (#22535) |
FinishAbort 增加 None 状态码检查 |
PR #22535 |
fix(loads): switch get_loads_communicator to watching mode (#22919) |
loads communicator 切换为 watching 模式 |
PR #22919 |
fix(loads): preserve include filtering after watching mode switch (#22959) |
修复 watching 模式切换后 include 过滤失效 |
PR #22959 |
[Diffusion] [NPU] Fix multimodal gen CI (#22879) |
修复 NPU 多模态生成 CI |
PR #22879 |
test: fix flaky required function calling assertion (#22890) |
修复 function calling 测试中的 flaky assertion |
PR #22890 |
4. Ray 与分布式
| Commit Message |
总结 |
PR 链接 |
[Ray] Support multi-replica serving by making scheduler actor names unique (#22917) |
通过使 scheduler actor 名称唯一来支持多副本服务 |
PR #22917 |
[Ray] Bind scheduler actors to GPU-local NUMA node (#22989) |
将 scheduler actors 绑定到 GPU 本地 NUMA 节点,提升内存访问性能 |
PR #22989 |
use envs in server_args (#22994) |
将 server_args 中的环境变量调用统一迁移到 envs 模块,新增 SGLANG_USE_AITER_UNIFIED_ATTN 和 SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK 环境变量定义 |
PR #22994 |
5. 扩散模型 (Diffusion)
| Commit Message |
总结 |
PR 链接 |
[diffusion] feat: disaggregated diffusion (#21701) |
扩散模型 disaggregated 架构,支持独立部署 denoise 和 encode 节点 |
PR #21701 |
[diffusion] refactor: extract LTX2 image encoding from denoising stage (#22976) |
将 LTX2 图像编码逻辑从去噪阶段解耦 |
PR #22976 |
[codex] Update diffusion skills (#23028) |
更新 diffusion 相关的 Claude Code skills 文档 |
PR #23028 |
6. HiCache 与内存管理
| Commit Message |
总结 |
PR 链接 |
[UnifiedRadixTree]: Add HiCache hook interface for TreeComponent (#22924) |
为 TreeComponent 添加 HiCache hook 接口 |
PR #22924 |
[HiSparse]: Adding e2e ut for hisparse (#22979) |
添加 HiSparse 端到端单元测试 |
PR #22979 |
7. CI 与测试
| Commit Message |
总结 |
PR 链接 |
[CI] Adding Gemma 4 to Nightly CI (#22408) |
将 Gemma-4 加入夜间 CI 测试 |
PR #22408 |
test(4-gpu-b200): split test_qwen35_models.py + bump partitions 5→6 (#22913) |
拆分 Qwen3.5 测试文件并将分区数从 5 增加到 6 |
PR #22913 |
migrate CPU-only unit tests from openai_server to unit/ (#22965) |
将 CPU 单元测试从 openai_server 迁移到 unit/ 目录 |
PR #22965 |
[AMD] CI Job Monitor: fix queue time, utilization, and summary metrics (#22274) |
修复 AMD CI 任务监控的排队时间、利用率和汇总指标 |
PR #22274 |
ci: install rust toolchain in ci_install_dependency.sh (#23017) |
在 CI 依赖安装脚本中添加 Rust 工具链安装 |
PR #23017 |
CI: fix lint (#22991) |
修复 CI lint 问题 |
PR #22991 |
8. 文档
| Commit Message |
总结 |
PR 链接 |
[NPU] [DOC] Update npu best practice docs to match latest code (#22975) |
更新 NPU 最佳实践文档以匹配最新代码 |
PR #22975 |
[Docs] fix profiling endpoint (#22982) |
修复 profiling 端点文档 |
PR #22982 |
[Doc] correct the HTTP endpoint for stopping profiling in benchmark_and_profiling.md (#22523) |
修正 benchmark_and_profiling.md 中停止 profiling 的 HTTP 端点 |
PR #22523 |
[Docs] [npu] change the feature support status (#23041) |
更新 NPU 特性支持状态表 |
PR #23041 |
[PD]feat(bench): add --fake-prefill flag for decode-only stress testing (#22973) |
为 decode-only 压力测试添加 --fake-prefill 标志及文档 |
PR #22973 |
9. 杂项
| Commit Message |
总结 |
PR 链接 |
[misc] update .github/CODEOWNERS (#22993) |
更新 CODEOWNERS 文件 |
PR #22993 |