Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4
Mixed-precision quantized version of Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled — a Claude 4.6 Opus reasoning-distilled Qwen3.5-27B model.
~25 GB on disk (~23.7 GiB in VRAM). Recommended GPU: NVIDIA RTX PRO 6000 (96 GB) — works out of the box with vLLM 0.17.0, serving 8 concurrent requests at full 262K context. Also fits on a single RTX 5090 (32 GB) with ~225K context once upcoming vLLM fixes land (see Hardware Requirements).
RTX 5090 Setup Guide (single GPU, 32 GB)
If you have a GPU with >= 48 GB VRAM (RTX PRO 6000, A100, H100, etc.), skip this section — stock
pip install vllmworks out of the box.
vLLM 0.17.0 has two Blackwell-specific bugs that cause OOM on 32 GB GPUs when running Qwen3.5's hybrid DeltaNet architecture. Until the upstream fixes are merged, you can install vLLM from the PR branches:
Step 1: Install vLLM from the Triton warmup fix PR
pip install git+https://github.com/AuYang261/vllm.git@fix/gdn-triton-warmup
This installs PR #36599, which pre-warms DeltaNet Triton kernels during vLLM's profiling phase so autotuning doesn't OOM after KV cache allocation.
Step 2: Apply the TMA fix (one-line patch)
PR #36325 disables Triton's TMA codepath on Blackwell (SM120), which allocates oversized descriptor buffers. Apply it manually:
# Find the file
UTILS_FILE=$(python -c "import vllm; import os; print(os.path.join(os.path.dirname(vllm.__file__), 'model_executor/layers/fla/ops/utils.py'))")
# Replace the TMA check (restrict to Hopper only)
sed -i 's/is_nvidia and torch.cuda.get_device_capability(0)\[0\] >= 9/is_nvidia and 9 <= torch.cuda.get_device_capability(0)[0] < 12/' "$UTILS_FILE"
echo "Patched: $UTILS_FILE"
Step 3: Run
vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4 \
--max-model-len 131072 \
--gpu-memory-utilization 0.85 \
--enforce-eager \
--reasoning-parser qwen3
Notes:
--enforce-eagerdisables CUDA graph capture, which avoids a third issue with DeltaNet's mamba cache sizing on small GPUs. This has a minor throughput impact (~10-15%) but ensures stable operation.--gpu-memory-utilization 0.85leaves headroom for Triton kernel workspace allocations.- Once both PRs are merged into a vLLM release, you can switch back to stock
pip install vllmand remove--enforce-eager.
Quantization Strategy
Non-uniform mixed-precision quantization using llm-compressor v0.10.1, with layer-level precision assignment based on architectural sensitivity analysis of the Qwen3.5 hybrid DeltaNet architecture. The quantization uses the compressed-tensors format with two quantization groups plus unquantized layers:
| Precision | Layers | Rationale |
|---|---|---|
| FP8 W8A8 (per-channel weights, per-token dynamic activations) | DeltaNet projections (in_proj_qkv, in_proj_z, out_proj), softmax q_proj/k_proj/v_proj, MLP down_proj |
Sensitive layers: QKV projections directly shape attention computation, DeltaNet accumulates error in recurrent state, down_proj is the MLP accumulation point. Uniform QKV precision required for fused QKV weight loading. |
| NVFP4 W4A4 (FP4 E2M1 weights + activations, FP8 E4M3 per-group scales, group_size=16) | Softmax o_proj, MLP gate_proj/up_proj |
Error-tolerant layers: o_proj is post-attention reduction, gate/up are pre-activation (errors dampened by SiLU gating) |
| BF16 (unquantized) | lm_head, embed_tokens, DeltaNet small projections (in_proj_a, in_proj_b), all norms, visual encoder |
Industry consensus: lm_head amplifies errors across 248K vocab; embed_tokens is a lookup table; DeltaNet low-rank projections are numerically sensitive; vision tower retained at full precision |
Weight Breakdown
| Component | Size | Precision |
|---|---|---|
| MLP | 11.4 GB | FP8 + NVFP4 |
| DeltaNet attention | 5.6 GB | FP8 + BF16 |
| lm_head | 2.5 GB | BF16 |
| embed_tokens | 2.5 GB | BF16 |
| Softmax attention | 1.4 GB | FP8 + NVFP4 |
| Visual encoder | 0.9 GB | BF16 (vision tower retained; text-only calibration — vision capability not separately validated post-quantization) |
| Quantization scales | 0.8 GB | Mixed |
| Total | ~25 GB |
Self-Calibration
Calibration data was generated by the model itself (512 reasoning traces across 8 categories: math, code, logic, analysis, creative writing, general knowledge, tool calling, Korean). Self-calibration produces traces that match the model's actual activation distributions during reasoning, yielding better quantization accuracy than generic web text for reasoning-distilled models.
- Calibration samples: 512
- Mean sequence length: 2,180 tokens
- Max sequence length: 4,096 tokens
- Generation: sglang offline Engine with
enable_thinking=True
Architecture
Qwen3.5-27B uses a hybrid DeltaNet + softmax attention architecture with full_attention_interval=4:
Layer pattern (64 layers):
[DeltaNet, DeltaNet, DeltaNet, Softmax] × 16
= 48 DeltaNet layers + 16 softmax attention layers
Key architectural parameters:
- Hidden size: 5,120
- Attention heads: 24 (query), 4 (KV, GQA)
- Head dimension: 256
- DeltaNet heads: 16 key, 48 value (dim 128 each)
- MLP intermediate: 17,408
- Vocabulary: 248,320
- Max position embeddings: 262,144
Only 16 of 64 layers require KV cache — the 48 DeltaNet layers use a fixed-size recurrent state that doesn't grow with sequence length. This gives ~4x more context capacity than a standard transformer of the same size.
KV Cache Budget
Per-token KV cache cost (only 16 softmax layers):
- FP16: 4 KV heads x 256 dim x 2 (K+V) x 2 bytes x 16 layers = 64 KB/token
- FP8: 32 KB/token
| GPU | Available for KV | Max Context (FP8 KV) |
|---|---|---|
| RTX 5090 (32 GB) | ~7 GiB | ~225K tokens (single request) |
| RTX PRO 6000 (96 GB) | ~71 GiB | 8 concurrent requests × 262K tokens each |
Usage
Serving with vLLM (recommended)
RTX PRO 6000 / high-VRAM GPUs (>= 48 GB) — works out of the box with vLLM 0.17.0:
pip install vllm>=0.17.0
vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4 \
--max-model-len 262144 \
--reasoning-parser qwen3
RTX 5090 / consumer Blackwell (32 GB) — requires upcoming vLLM fixes (see Known Issues):
vllm serve mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4 \
--max-model-len 131072 \
--gpu-memory-utilization 0.85 \
--reasoning-parser qwen3
Note: The model loads (~23.7 GiB) but vLLM's Triton autotuner and CUDA graph capture currently trigger OOM on 32 GB GPUs. Two upstream PRs are expected to resolve this — see Known Issues below. On GPUs with >= 48 GB VRAM, the extra headroom makes these issues irrelevant.
Transformers (direct loading)
from transformers import AutoTokenizer, Qwen3_5ForConditionalGeneration
import torch
model = Qwen3_5ForConditionalGeneration.from_pretrained(
"mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4",
trust_remote_code=True,
)
Compatibility
| Framework | Supported | Notes |
|---|---|---|
| vLLM >= 0.17.0 | Yes | compressed-tensors NVFP4 with Blackwell FP4 acceleration. Works on >= 48 GB GPUs; 32 GB GPUs require upcoming fixes (see Known Issues) |
| transformers >= 5.3.0 | Yes | Direct loading with device_map="auto" |
| SGLang | No | compressed-tensors NVFP4 not supported (only ModelOpt format) |
| llama.cpp / GGUF | No | NVFP4 format not supported |
Hardware Requirements
| Configuration | VRAM | Notes |
|---|---|---|
| Minimum | 28 GB | Weights only, minimal context |
| RTX 5090 | 32 GB | ~225K context with FP8 KV cache. Requires upcoming vLLM fixes (see Known Issues) |
| RTX PRO 6000 (recommended) | 96 GB | 8 concurrent × 262K context with FP8 KV cache. Works out of the box with vLLM 0.17.0 |
| 2x RTX 5090 | 64 GB | Tensor parallel, full context |
Known Issues
Qwen3.5's hybrid DeltaNet architecture combines recurrent (DeltaNet) and softmax attention layers, which exercises less-tested code paths in vLLM on Blackwell GPUs with limited VRAM. Two open PRs address these issues:
| Issue | PR | Status | Impact |
|---|---|---|---|
| Triton autotuner OOM during first inference on Blackwell | vllm-project/vllm#36599 | Open | DeltaNet Triton kernels are not pre-warmed during vLLM's profiling phase; autotuning triggers after KV cache allocation fills available VRAM |
| TMA descriptor allocation OOM on SM120 (Blackwell) | vllm-project/vllm#36325 | Open | Triton's TMA codepath allocates oversized buffers on Blackwell; fix restricts TMA to Hopper (SM90) |
Workaround: Use a GPU with >= 48 GB VRAM (e.g., RTX PRO 6000, A100, H100), where the extra memory headroom makes these issues irrelevant. Once both PRs are merged, RTX 5090 and other 32 GB Blackwell GPUs will work out of the box.
Source Model
This is a quantization of Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled, which is an SFT fine-tune of Qwen/Qwen3.5-27B using Claude 4.6 Opus reasoning distillation data.
Training datasets:
- nohurry/Opus-4.6-Reasoning-3000x-filtered
- TeichAI/claude-4.5-opus-high-reasoning-250x
- Jackrong/Qwen3.5-reasoning-700x
Quantization Details
- Tool: llm-compressor v0.10.1
- Format: compressed-tensors (mixed FP8 + NVFP4)
- Scheme: Two quantization groups — FP8 W8A8 (per-channel weights, per-token activations) for sensitive layers; NVFP4 W4A4 (FP4 E2M1, group_size=16) for error-tolerant layers; BF16 for embeddings, head, norms, and small projections
- Calibration: Self-generated reasoning traces (512 samples, 8 categories)
- Hardware: NVIDIA RTX PRO 6000 Blackwell (96 GB)
- Time: ~62 minutes (calibration pass)
License
Apache 2.0, following the base model license.
- Downloads last month
- 2,961
Model tree for mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4
Base model
Qwen/Qwen3.5-27B