Qwen3.5-122B-A10B-MTP-NVFP4

This is Sehyo/Qwen3.5-122B-A10B-NVFP4 with MTP (Multi-Token Prediction) weights added, enabling speculative decoding with vLLM.

The original Sehyo NVFP4 checkpoint omitted the MTP head during quantization. The MTP weights were extracted from the official Qwen/Qwen3.5-122B-A10B BF16 checkpoint and added as a separate shard (mtp_weights.safetensors). The model.safetensors.index.json was updated to include the MTP tensor references.

What's Different from Sehyo/Qwen3.5-122B-A10B-NVFP4

  • βœ… mtp_weights.safetensors β€” MTP head weights (extracted from BF16 base, ~1.4 GB)
  • βœ… model.safetensors.index.json β€” Updated to map mtp.* tensors to the new shard
  • βœ… config.json β€” Added in_proj_ba and in_proj_qkvz to quantization ignore list (avoids vLLM loading warnings)
  • βœ… All other config/tokenizer files included in this repo

Downloading the Model

The two large model weight shards (model-00001-of-00002.safetensors, 50 GB; model-00002-of-00002.safetensors, 26 GB) are hosted in Sehyo's original repo due to HuggingFace's per-file size limits. Download everything as follows:

# Install
pip install huggingface_hub

# Download model shards from Sehyo's repo
huggingface-cli download Sehyo/Qwen3.5-122B-A10B-NVFP4 \
  model-00001-of-00002.safetensors \
  model-00002-of-00002.safetensors \
  --local-dir ./Qwen3.5-122B-A10B-MTP-NVFP4

# Download everything else (MTP weights + updated configs) from this repo
huggingface-cli download scottgl/Qwen3.5-122B-A10B-MTP-NVFP4 \
  --local-dir ./Qwen3.5-122B-A10B-MTP-NVFP4

This will merge both into a single local directory ready for use.

Usage with vLLM (MTP Speculative Decoding)

vllm serve scottgl/Qwen3.5-122B-A10B-MTP-NVFP4 \
  --speculative-model "[ngram]" \
  --num-speculative-tokens 2 \
  --kv-cache-dtype fp8 \
  ...

Or with the MTP draft model approach (requires patched vLLM with qwen3_5_mtp.py):

vllm serve scottgl/Qwen3.5-122B-A10B-MTP-NVFP4 \
  --speculative-model scottgl/Qwen3.5-122B-A10B-MTP-NVFP4 \
  --speculative-model-quantization nvfp4 \
  --num-speculative-tokens 2 \
  ...

Changelog

  • 28/02/2026: Added MTP weights from BF16 base; updated index and config.
  • 25/02/2026: Original NVFP4 quantization by Sehyo.

Original Model Details (from Sehyo)

  • Quantization: NVFP4 via vLLM's LLM Compressor
  • Calibration Samples: 512 (256 from each dataset)
  • Datasets: HuggingFaceH4/ultrachat_200k, nvidia/Nemotron-Post-Training-Dataset-v2
  • Max sequence length: 4096
  • All experts calibrated: moe_calibrate_all_experts=True

Credits

Downloads last month
743
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for scottgl/Qwen3.5-122B-A10B-MTP-NVFP4

Quantized
(61)
this model