SGLang Main Branch Commit Summary (UTC+8 2026-04-14)

统计时间范围：UTC+8 2026-04-14 00:00 至 2026-04-14 23:59（UTC 2026-04-13 16:00 至 2026-04-14 15:59）

共统计 46 个 commit。

一、新模型/新功能支持

本时段新增了对多个模型和功能的支持，包括扩散模型、量化格式、存储后端等。

Commit Message	总结	PR 链接
`[diffusion] model: support Ltx 2.3 two stage ti2v (#22667)`	为扩散模型添加 LTX 2.3 两阶段 text-to-video 生成支持	PR #22667
`[diffusion] quant: add FLUX.1-dev modelopt nvfp4 support (#22672)`	为 FLUX.1-dev 模型添加 ModelOpt NVFP4 量化支持	PR #22672
`[HiCache & HybridModel] mooncake backend support DSA & mamba model (#21259)`	HiCache Mooncake 存储后端新增对 DSA 和 Mamba 模型的支持	PR #21259
`[Feature] Add SiMM as sglang HiCache Storage backend (#18016)`	新增 SiMM 作为 HiCache 的存储后端选项	PR #18016
`[lora][moe] Virtual experts for LoRA MoE (#22122)`	为 LoRA MoE 添加虚拟专家（virtual experts）功能	PR #22122
`hicache storage backend mooncake support ascend hixl (#20016)`	Mooncake HiCache 存储后端支持 Ascend HiXL	PR #20016
`feat(metrics): expose raw KV cache pool token counts as prometheus gauges (#22726)`	将原始 KV cache 池 token 数量暴露为 Prometheus 指标	PR #22726

二、性能优化

本时段在 EPLB 路由、DP attention、FP8 模型、PCG 路径、embedding 模式等方面进行了多项性能优化。

Commit Message	总结	PR 链接
`[sgl] perf optimization for eplb (#21232)`	EPLB 专家并行负载均衡性能优化，改进负载均衡算法和 dispatch 逻辑	PR #21232
`Replace all-reduce + dp_scatter with reduce_scatterv for DP attention (#22642)`	将 DP attention 中的 all-reduce + dp_scatter 替换为更高效的 reduce_scatterv 集合通信	PR #22642
`perf: optimize PCG inductor path for FP8 models (#21734)`	优化 Piecewise CUDA Graph 在ductor路径下对 FP8 模型的性能	PR #21734
`perf: skip KV cache in FA backend for embedding mode (#21971)`	在 embedding 模式下跳过 FlashAttention 后端的 KV cache 操作，减少不必要开销	PR #21971
`Use reshape instead of contiguous().view() in TRTLLMHAAttnBackend (#22517)`	TRTLLM MHA 后端中使用 reshape 替代 contiguous().view()，减少内存拷贝	PR #22517
`Clean up TokenizerManager and req_time_stats: reduce overhead and simplify (#21646)`	清理 TokenizerManager 和请求时间统计逻辑，降低运行时开销	PR #21646

三、Bug Fix

本时段修复了 EPLB dispatch、attention padding token 计算、Prometheus 指标、FP8 预填充输出等多个 bug。

Commit Message	总结	PR 链接
`fix: EPLB dispatch OOB when shared experts fusion enabled under DeepEP (#22525)`	修复 DeepEP 下启用共享专家融合时 EPLB dispatch 越界访问问题	PR #22525
`[bugfix] avoid attention padding tokens computation in pcg (#17706)`	修复 PCG 中 attention 对 padding token 的多余计算问题	PR #17706
`[Anthropic] Fix clock mismatch in received_time causing negative Prometheus metrics (#22247)`	修复 Anthropic API 兼容层中时钟不匹配导致 Prometheus 指标为负数的问题	PR #22247
`[ROCm]fix(aiter): cast fp8 prefill output back to model dtype (#22626)`	修复 ROCm aiter 后端 FP8 prefill 输出未正确转换回模型 dtype 的问题	PR #22626
`GLM-5/5.1 MXFP4 Checkpoint Inference Compatibility Fix (#22543)`	修复 GLM-5/5.1 MXFP4 检查点推理兼容性问题	PR #22543
`Restore Qwen3 rope config fallback (#22739)`	恢复 Qwen3 RoPE 配置的 fallback 逻辑，防止配置缺失时崩溃	PR #22739
`Delete dead rematch path in SessionAwareCache.release_session (#22735)`	删除 SessionAwareCache 中无法到达的 rematch 代码路径	PR #22735
`fix[glm4.7 flash]: properly detect gfx95_quant_format (#22720)`	修复 GLM4.7 Flash 模型中 gfx95_quant_format 检测逻辑	PR #22720
`fix: use describe mode for SGLang version detection (#22600)`	修复 SGLang 版本检测，改用 describe 模式获取版本信息	PR #22600
`[AMD] Add MoE weights and scales padding (#21097)`	为 AMD 平台添加 MoE 权重和 scale 的 padding 支持	PR #21097

四、server_args.py 新增/修改参数

本时段在 server_args.py 中新增了 hicache 存储后端选项，并对部分参数进行了重构。

Commit Message	总结	PR 链接
`[Feature] Add SiMM as sglang HiCache Storage backend (#18016)`	`--hicache-storage-backend` 新增 `simm` 选项	PR #18016
`[Misc] Migrate SGLANG_SET_CPU_AFFINITY to envs and refactor model config building (#22730)`	将 SGLANG_SET_CPU_AFFINITY 迁移到 environ 统一管理，重构模型配置构建逻辑	PR #22730
`[lora][moe] Virtual experts for LoRA MoE (#22122)`	新增 LoRA MoE 虚拟专家相关 server_args 参数	PR #22122
`env: add knob to control SWA eviction interval (#22645)`	`--hicache-storage-backend` 相关 SWA 驱逐间隔参数调整	PR #22645
`[sgl] perf optimization for eplb (#21232)`	优化 EPLB 相关参数配置，device 参数自动去除索引后缀	PR #21232

五、新增环境变量

本时段新增以下环境变量：

环境变量	类型	默认值	说明	来源 PR
`SGLANG_SWA_EVICTION_INTERVAL_MULTIPLIER`	EnvFloat	1.0	控制 SWA（Sliding Window Attention）驱逐间隔的倍率	PR #22645

另外，SGLANG_SET_CPU_AFFINITY 被迁移到 environ 模块统一管理（非新增，但位置变更）。PR #22730

六、CI/CD 与基础设施

Commit Message	总结	PR 链接
`[AMD] Add MiniMax-M2.7 accuracy and performance nightly tests (#22722)`	为 AMD MI30X/MI35X 添加 MiniMax-M2.7 精度和性能夜间测试	PR #22722
`[AMD] Replace push trigger with scheduled runs and enable parallel stage execution (#22489)`	AMD CI 工作流从 push 触发改为定时执行，并启用并行阶段	PR #22489
`[CI] Add optional image input to GB200 nightly workflow_dispatch (#22745)`	GB200 夜间工作流支持可选镜像输入	PR #22745
`[CI] Add workflow_dispatch and environment gate to GB200 nightly pipeline (#22733)`	GB200 夜间流水线添加手动触发和环境门控	PR #22733
`[CI] Reinstall flashinfer-jit-cache on CUDA version mismatch (#22741)`	CUDA 版本不匹配时自动重新安装 flashinfer-jit-cache	PR #22741
`ci: skip full rerun when sgl-kernel wheel already built (#22534)`	sgl-kernel wheel 已构建时跳过完整重跑	PR #22534
`[Docker] Remove flashinfer cache copy (#22653)`	Docker 构建中移除 flashinfer 缓存拷贝	PR #22653
`Revert "Upgrade CI default CUDA version from 12.9 to 13.0" (#22727)`	回退 CI 默认 CUDA 版本从 13.0 回到 12.9	PR #22727

七、NPU/Ascend 相关

Commit Message	总结	PR 链接
`[NPU] qwen3next low latency best practice docs. (#22808)`	新增 Qwen3-Next NPU 低延迟最佳实践文档	PR #22808
`[NPU] [DOC] Update NPU docs to match latest code (#22796)`	更新 NPU 文档以匹配最新代码	PR #22796
`[NPU] Modify the parameter name and optional values, and add the parameter restrictions. Modify some parameters supported type. (#22804)`	修改 NPU 参数名称、可选值和限制，更新支持的参数类型	PR #22804
`[NPU] [DOC] Fix outdated descriptions in the NPU documentation (#22707)`	修复 NPU 文档中过时的描述	PR #22707
`fix:[NPU]correct the full name of then Kimi model (#22799)`	修正 NPU Kimi 模型的完整名称	PR #22799
`Offloading docs update (#22795)`	更新 Offloading 相关文档	PR #22795

八、其他修改

Commit Message	总结	PR 链接
`[gateway] Support SGLANG_LOG_MS for millisecond precision in router logs (#22506)`	网关日志支持 SGLANG_LOG_MS 环境变量以实现毫秒级精度	PR #22506
`[HiSparse] Clarify decode token usage logs (#22331)`	改进 HiSparse 的 decode token 使用日志	PR #22331
`[Misc] Add @cache_once to is_arch_support_pdl in jit_kernel (#22724)`	为 jit_kernel 的 is_arch_support_pdl 添加 @cache_once 装饰器	PR #22724
`Refactor unified radix cache UT into parameterized test suite (#22812)`	将统一 radix 缓存单元测试重构为参数化测试套件	PR #22812
`Add page_size and SWA coverage to unified radix cache bench test (#22815)`	为统一 radix 缓存基准测试添加 page_size 和 SWA 覆盖	PR #22815
`[Docs] Fix formatting of tool-call-parser options (#22793)`	修复 tool-call-parser 选项的文档格式	PR #22793
`Update CODEOWNERS for musa/mlx (#22593)`	更新 musa/mlx 的 CODEOWNERS	PR #22593

重点总结

值得关注的新模型/功能

LTX 2.3 两阶段 text-to-video：扩散模型新增 LTX 2.3 ti2v 支持
FLUX.1-dev NVFP4 量化：FLUX.1-dev 模型支持 ModelOpt NVFP4 量化格式
SiMM 存储后端：新增 SiMM 作为 HiCache 存储后端选项
Virtual Experts for LoRA MoE：LoRA MoE 支持虚拟专家，提升 LoRA 训练/推理灵活性
DSA & Mamba 模型支持：Mooncake 后端新增对 DSA 和 Mamba 模型的支持

值得关注的性能优化

EPLB 性能优化：专家并行负载均衡算法改进
DP attention reduce_scatterv：替换 all-reduce + dp_scatter 为更高效的 reduce_scatterv
FP8 模型 PCG 优化：优化 FP8 模型在 PCG inductor 路径下的性能
Embedding 模式 KV cache 跳过：embedding 模式下跳过不必要的 KV cache 操作

重要 Bug Fix

EPLB dispatch OOB：修复 DeepEP 共享专家融合场景下的越界访问
Prometheus 负指标：修复时钟不匹配导致的负数 Prometheus 指标
GLM-5/5.1 MXFP4 兼容性：修复 MXFP4 检查点推理兼容性