Sglang 每日变更总结 - 2026-04-15

Sglang 每日变更总结 (2026-04-15 UTC+8)

统计时间范围：2026-04-15 00:00:00 ~ 2026-04-15 23:59:59 (UTC+8)
共计 26 个 commit 合并到 main 分支。

一、新功能

1. Ray DataParallel 支持

新增了基于 Ray 的 DataParallel（DP）和 DP Attention 支持，允许通过 Ray Engine 进行分布式部署。

Commit Message	总结	PR 链接
`[Ray] Add data parallel (DP) and DP attention support to RayEngine (#21887)`	新增 RayDataParallelController，用 Ray SchedulerActor 替代 multiprocessing.Process 实现 DP/DP Attention 分布式推理	PR #21887

2. Diffusion RL 训练支持

为 T2I（Text-to-Image）后训练新增了完整的 RL rollout 能力，包括独立 rollout API、去噪环境回传和 SP 对齐的 log-prob 计算。

Commit Message	总结	PR 链接
`[diffusion] rl: support standalone rollout api, denoising environment backpass and sp-aligned log-prob for T2I post-training (#22604)`	新增 standalone rollout API、denoising 环境回传机制和 sequence-parallel 对齐的 log-prob 计算，支持 Qwen Image 和 ZImage pipeline	PR #22604

3. Pooler 返回 pooled_hidden_states

多个 Reward Model（Llama、Gemma2、InternLM2、Qwen2）和 pooler 层新增 pooled_hidden_states 字段返回能力，支持在分类/打分同时获取原始隐藏状态。

Commit Message	总结	PR 链接
`[Score API] Add return_pooled_hidden_states to Scoring API for SequenceClassification / RewardModel (#22427)`	pooler.py 抽取 pool_hidden_states 公共函数；Llama/Gemma2/InternLM2/Qwen2 RM 返回 pooled_hidden_states，支持 MIS 路径	PR #22427

二、性能优化

1. AMD / Qwen3.5 优化

针对 AMD GPU 上的 Qwen3.5 模型进行了两项关键优化：share expert fusion 和 fused Triton kernel。

Commit Message	总结	PR 链接
`[AMD] Enable share expert fusion with router experts for Qwen3.5 BF16 & FP8 (#20736)`	为 Qwen2MoE 和 Qwen3_5 启用 share expert fusion with router experts，支持 BF16 和 FP8 精度	PR #20736
`[AMD] Optimize _append_shared_to_topk_output by a single fused Triton kernel for Qwen3.5 (#22844)`	将 _append_shared_to_topk_output 融合为单个 Triton kernel，减少 Qwen3.5 在 AMD 上的 MoE topk 计算开销	PR #22844

2. Serving 层性能优化

将 stream_buffer 的字符串拼接从 O(n²) 优化为基于整数偏移的 O(n) 操作。

Commit Message	总结	PR 链接
`[serving] replace O(n²) stream_buffer string concat with integer offset (#22606)`	用整数偏移替代字符串拼接，将 streaming chat/completions 的 buffer 操作从 O(n²) 降至 O(n)	PR #22606

3. Step3p5 模型优化

Step3p5 模型的 MoE 层通信优化：share_expert 和 MoE 输出先合并再统一 all-reduce，每层节省一次 TP all-reduce。同时移除了中间张量 dump 代码。

Commit Message	总结	PR 链接
`[Step3p5] Optimize allreduce in MoE layers (#22773)`	share_expert 设置 reduce_results=False，与 MoE 输出合并后单次 all-reduce；移除 SGLANG_DUMP_STEP3P5_INTERMEDIATE 调试代码；layer_id 查找从 list 改为 set	PR #22773

4. 其他优化

Commit Message	总结	PR 链接
`[Misc] Use cache_once for is_arch_support_pdl in sgl-kernel (#22725)`	sgl-kernel 中 is_arch_support_pdl 使用 cache_once 缓存，避免重复检测	PR #22725
`[diffusion] chore: auto-enable best parallel setting if unspecified (#22763)`	Diffusion 未指定并行配置时自动启用最优并行设置	PR #22763

三、Bug Fix

1. Streaming Session 修复（系列）

对 streaming session 进行了一系列内存泄漏和计数修复。

Commit Message	总结	PR 链接
`Streaming session: fix retract tail leak via _free_tail (#22862)`	通过 _free_tail 修复 retract 时 tail 内存泄漏问题，重构 session_aware_cache 的 free 逻辑	PR #22862
`Refactor streaming session abort handling (#22790)`	重构 streaming session abort 处理流程，改进 session_controller 和 session_aware_cache 的异常处理	PR #22790
`Fix streaming session busy-check double-counting via active_pool_idxs (#22753)`	通过 active_pool_idxs 修复 streaming session busy-check 重复计数问题	PR #22753
`Rename _alive_streaming_session_count; use _is_streaming helper (#22755)`	重命名 _alive_streaming_session_count，统一使用 _is_streaming 辅助函数	PR #22755

2. HiCache / HiSparse 修复

Commit Message	总结	PR 链接
`[HiCache]Fix CP support for hybrid model (#22782)`	修复 hybrid model 在 Context Parallel 下的 HiCache 支持	PR #22782
`[HiCache] Fix memory host free logic when share_indices_with_anchor enabled (#22767)`	修复 share_indices_with_anchor 启用时 memory pool host 的 free 逻辑	PR #22767
`[HiSparse][BugFix]: Fix the memory leak issue during health checks. (#22882)`	修复 health check 期间的内存泄漏问题	PR #22882

3. 其他 Bug Fix

Commit Message	总结	PR 链接
`[BugFix] Fix EAGLE speculative decoding missing grammar-based finish (#21723)`	修复 EAGLE 投机解码在 grammar-based finish 条件下缺失的问题	PR #21723
`[VLM] fix LFM2-VL offline inference and GPU JPEG decode (#22448)`	修复 LFM2-VL 离线推理和 GPU JPEG 解码问题	PR #22448

四、模型相关修改

Commit Message	总结	PR 链接
`[AMD] Enable share expert fusion with router experts for Qwen3.5 BF16 & FP8 (#20736)`	Qwen2MoE/Qwen3_5 启用 AMD share expert fusion	PR #20736
`[AMD] Optimize _append_shared_to_topk_output by a single fused Triton kernel for Qwen3.5 (#22844)`	Qwen3.5 MoE topk 输出融合 Triton kernel	PR #22844
`[VLM] Enable per-image ViT cache and avoid TP CUDA context creation for Kimi-K2.5 (#22858)`	KimiK25 移除 vision_tower_forward_auto 函数，直接使用 vision_tower + mm_projection_auto；修复 grid_thws 字段名为 image_grid_thw	PR #22858
`[Bugfix] Preserve auto-detected quant_config for GLM NextN draft model (#22823)`	GLM4 MoE NextN 修复 needs_quant_draft 判断条件，增加 quant_config is not None 检查	PR #22823
`[Step3p5] Optimize allreduce in MoE layers (#22773)`	Step3p5 MoE 通信优化和调试代码清理	PR #22773

五、server_args.py 变更

Commit Message	总结	PR 链接
`Cleanup server_args.py and minor code tidying (#22820)`	代码整理：重新组织常量定义顺序（按字母/逻辑分组）；新增空白行分隔逻辑块；移除 `add_mamba_ssm_dtype_choices` 扩展函数，将 `--mamba-ssm-dtype` 的 choices 改为内联字面量 `["float32", "bfloat16", "float16"]`。无新增参数。	PR #22820

新增参数： 今日无新增 server_args 参数。
移除参数： 移除了 add_mamba_ssm_dtype_choices 扩展函数（功能不变，仅改为内联）。

六、环境变量变更

环境变量	说明	相关 PR
无新增	今日 commit 中未发现新增环境变量。代码中涉及的环境变量 `SGLANG_SET_CPU_AFFINITY`、`SGLANG_NUMA_BIND_V2`、`SGLANG_USE_AITER` 为已有变量的使用。	-

七、CI / 文档 / 其他

CI

Commit Message	总结	PR 链接
`[AMD][CI] Add GLM-5-MXFP4 accuracy and perf nightly tests for MI35x (#21773)`	为 MI35x 新增 GLM-5-MXFP4 精度和性能 nightly 测试	PR #21773
`ci: skip approval for nightly gb200 runs, keep for manual triggers (#22768)`	GB200 nightly 运行跳过 approval，手动触发保留	PR #22768
`Update CI Permissions (#22826)`	更新 CI 权限配置	PR #22826
`[diffusion] CI: refactor diffusion ci and reduce redundancy (#22810)`	重构 diffusion CI，减少冗余	PR #22810
`[diffusion] CI: reset thresholds (#22854)`	重置 diffusion CI 阈值	PR #22854

文档

Commit Message	总结	PR 链接
`[NPU] Offloading docs update (#22860)`	更新 Ascend NPU offloading 文档	PR #22860
`[diffusion] quant: update modelopt quantization docs and CI coverage (#22772)`	更新 modelopt 量化文档和 CI 覆盖	PR #22772
`[Docs] Move ptxas sm_103a workaround into For CUDA 13 section (#22852)`	将 ptxas sm_103a workaround 文档移至 CUDA 13 章节	PR #22852

基础设施

Commit Message	总结	PR 链接
`[PD] Add a fallback to bypass rust dep for mini_lb (#21982)`	mini_lb 新增 fallback 机制，绕过 Rust 依赖	PR #21982
`Add runai-model-streamer into Python packages installed in Dockerfile and fix NotADirectoryError Docker regression (#22537)`	Dockerfile 新增 runai-model-streamer 包，修复 NotADirectoryError Docker 回归	PR #22537

八、重点变更总结

新模型

今日没有新增全新模型的支持，但有对已有模型的重要改进：

Qwen3.5：AMD 平台上的 share expert fusion 和 fused Triton kernel 优化
Kimi-K2.5：vision tower forward 重构和字段名修复
GLM4 MoE NextN：quant draft 判断条件修复

性能优化

AMD GPU 上 Qwen3.5 的 MoE share expert fusion 和 topk fused kernel
Serving 层 stream_buffer 从 O(n²) 优化到 O(n)
Step3p5 MoE 通信优化（节省每次 all-reduce）
Diffusion 自动并行配置优化

Bug Fix

Streaming session 系列修复（内存泄漏、abort 处理、busy-check 重复计数）
HiCache CP 支持和 memory host free 逻辑修复
HiSparse health check 内存泄漏修复
EAGLE 投机解码 grammar-based finish 修复
LFM2-VL 离线推理和 GPU JPEG 解码修复

server_args.py

无新增参数，仅做了代码整理和常量重排序。

环境变量

无新增环境变量。