Qwen3.5-122B-A10B-MTP-NVFP4
This is Sehyo/Qwen3.5-122B-A10B-NVFP4 with MTP (Multi-Token Prediction) weights added, enabling speculative decoding with vLLM.
The original Sehyo NVFP4 checkpoint omitted the MTP head during quantization. The MTP weights were extracted from the official Qwen/Qwen3.5-122B-A10B BF16 checkpoint and added as a separate shard (mtp_weights.safetensors). The model.safetensors.index.json was updated to include the MTP tensor references.
What's Different from Sehyo/Qwen3.5-122B-A10B-NVFP4
- β
mtp_weights.safetensorsβ MTP head weights (extracted from BF16 base, ~1.4 GB) - β
model.safetensors.index.jsonβ Updated to mapmtp.*tensors to the new shard - β
config.jsonβ Addedin_proj_baandin_proj_qkvzto quantization ignore list (avoids vLLM loading warnings) - β All other config/tokenizer files included in this repo
Downloading the Model
The two large model weight shards (model-00001-of-00002.safetensors, 50 GB; model-00002-of-00002.safetensors, 26 GB) are hosted in Sehyo's original repo due to HuggingFace's per-file size limits. Download everything as follows:
# Install
pip install huggingface_hub
# Download model shards from Sehyo's repo
huggingface-cli download Sehyo/Qwen3.5-122B-A10B-NVFP4 \
model-00001-of-00002.safetensors \
model-00002-of-00002.safetensors \
--local-dir ./Qwen3.5-122B-A10B-MTP-NVFP4
# Download everything else (MTP weights + updated configs) from this repo
huggingface-cli download scottgl/Qwen3.5-122B-A10B-MTP-NVFP4 \
--local-dir ./Qwen3.5-122B-A10B-MTP-NVFP4
This will merge both into a single local directory ready for use.
Usage with vLLM (MTP Speculative Decoding)
vllm serve scottgl/Qwen3.5-122B-A10B-MTP-NVFP4 \
--speculative-model "[ngram]" \
--num-speculative-tokens 2 \
--kv-cache-dtype fp8 \
...
Or with the MTP draft model approach (requires patched vLLM with qwen3_5_mtp.py):
vllm serve scottgl/Qwen3.5-122B-A10B-MTP-NVFP4 \
--speculative-model scottgl/Qwen3.5-122B-A10B-MTP-NVFP4 \
--speculative-model-quantization nvfp4 \
--num-speculative-tokens 2 \
...
Changelog
- 28/02/2026: Added MTP weights from BF16 base; updated index and config.
- 25/02/2026: Original NVFP4 quantization by Sehyo.
Original Model Details (from Sehyo)
- Quantization: NVFP4 via vLLM's LLM Compressor
- Calibration Samples: 512 (256 from each dataset)
- Datasets: HuggingFaceH4/ultrachat_200k, nvidia/Nemotron-Post-Training-Dataset-v2
- Max sequence length: 4096
- All experts calibrated:
moe_calibrate_all_experts=True
Credits
- Original NVFP4 quantization by Sehyo
- Base model: Qwen/Qwen3.5-122B-A10B by Alibaba Cloud
- Downloads last month
- 743
Model tree for scottgl/Qwen3.5-122B-A10B-MTP-NVFP4
Base model
Qwen/Qwen3.5-122B-A10B