Qwen3.5-122B-A10B-MTP-NVFP4

This is Sehyo/Qwen3.5-122B-A10B-NVFP4 with MTP (Multi-Token Prediction) weights added, enabling speculative decoding with vLLM.

The original Sehyo NVFP4 checkpoint omitted the MTP head during quantization. The MTP weights were extracted from the official Qwen/Qwen3.5-122B-A10B BF16 checkpoint and added as a separate shard (mtp_weights.safetensors). The model.safetensors.index.json was updated to include the MTP tensor references.

What's Different from Sehyo/Qwen3.5-122B-A10B-NVFP4

✅ mtp_weights.safetensors — MTP head weights (extracted from BF16 base, ~1.4 GB)
✅ model.safetensors.index.json — Updated to map mtp.* tensors to the new shard
✅ config.json — Added in_proj_ba and in_proj_qkvz to quantization ignore list (avoids vLLM loading warnings)
✅ All other config/tokenizer files included in this repo

Downloading the Model

The two large model weight shards (model-00001-of-00002.safetensors, 50 GB; model-00002-of-00002.safetensors, 26 GB) are hosted in Sehyo's original repo due to HuggingFace's per-file size limits. Download everything as follows:

# Install
pip install huggingface_hub

# Download model shards from Sehyo's repo
huggingface-cli download Sehyo/Qwen3.5-122B-A10B-NVFP4 \
  model-00001-of-00002.safetensors \
  model-00002-of-00002.safetensors \
  --local-dir ./Qwen3.5-122B-A10B-MTP-NVFP4

# Download everything else (MTP weights + updated configs) from this repo
huggingface-cli download scottgl/Qwen3.5-122B-A10B-MTP-NVFP4 \
  --local-dir ./Qwen3.5-122B-A10B-MTP-NVFP4

This will merge both into a single local directory ready for use.

Usage with vLLM (MTP Speculative Decoding)

vllm serve scottgl/Qwen3.5-122B-A10B-MTP-NVFP4 \
  --speculative-model "[ngram]" \
  --num-speculative-tokens 2 \
  --kv-cache-dtype fp8 \
  ...

Or with the MTP draft model approach (requires patched vLLM with qwen3_5_mtp.py):

vllm serve scottgl/Qwen3.5-122B-A10B-MTP-NVFP4 \
  --speculative-model scottgl/Qwen3.5-122B-A10B-MTP-NVFP4 \
  --speculative-model-quantization nvfp4 \
  --num-speculative-tokens 2 \
  ...

Changelog

28/02/2026: Added MTP weights from BF16 base; updated index and config.
25/02/2026: Original NVFP4 quantization by Sehyo.

Original Model Details (from Sehyo)

Quantization: NVFP4 via vLLM's LLM Compressor
Calibration Samples: 512 (256 from each dataset)
Datasets: HuggingFaceH4/ultrachat_200k, nvidia/Nemotron-Post-Training-Dataset-v2
Max sequence length: 4096
All experts calibrated: moe_calibrate_all_experts=True

Credits

Original NVFP4 quantization by Sehyo
Base model: Qwen/Qwen3.5-122B-A10B by Alibaba Cloud

Downloads last month: 743

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for scottgl/Qwen3.5-122B-A10B-MTP-NVFP4

Base model

Qwen/Qwen3.5-122B-A10B

Quantized

(61)

this model