stepfun-ai/Step-3.7-Flash
Production-grade vision-language MoE (~198B total / 11B active parameters) combining a 196B sparse language backbone with a 1.8B perception encoder, hybrid SWA/Global attention, and 3-way Multi-Token Prediction
Sparse MoE VLM with hybrid attention and 3-layer MTP speculative decoding
Guide
Overview
Step-3.7-Flash is a 198B-parameter sparse Mixture-of-Experts vision-language model from StepFun, pairing a 196B language backbone with a 1.8B perception encoder. It activates ~11B parameters per token and supports a 256k context window with three selectable reasoning levels (low / medium / high).
Key highlights:
- Multimodal Understanding: Native vision encoder for single and multi-image inputs alongside text
- Hybrid Attention Architecture: Interleaves Sliding Window Attention (512-token window) and Global Attention at a 3:1 ratio
- Sparse MoE: 11B active parameters out of 198B total
- Multi-Layer MTP: 3-way Multi-Token Prediction (MTP-3) for low-latency reasoning chains
Available precisions:
- stepfun-ai/Step-3.7-Flash (BF16)
- stepfun-ai/Step-3.7-Flash-FP8
- stepfun-ai/Step-3.7-Flash-NVFP4 (Blackwell only)
Prerequisites
- vLLM version: nightly (the model registry hasn't shipped in a stable release yet)
- Hardware: 8xH200/B200 for BF16 and FP8; 4xB200 for NVFP4
Install vLLM (nightly)
uv venv
source .venv/bin/activate
uv pip install -U vllm --pre \
--extra-index-url https://wheels.vllm.ai/nightly
Or via Docker:
docker pull vllm/vllm-openai:stepfun37
Launching the Server
BF16
vllm serve stepfun-ai/Step-3.7-Flash \
--served-model-name step3p7-flash \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--disable-cascade-attn \
--reasoning-parser step3p5 \
--tool-call-parser step3p5 \
--enable-auto-tool-choice \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 3}' \
--trust-remote-code
FP8
vllm serve stepfun-ai/Step-3.7-Flash-FP8 \
--served-model-name step3p7-flash \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--disable-cascade-attn \
--reasoning-parser step3p5 \
--tool-call-parser step3p5 \
--enable-auto-tool-choice \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 3}' \
--trust-remote-code
NVFP4 (Blackwell only)
Requires modelopt quantization and FP8 KV cache alignment.
vllm serve stepfun-ai/Step-3.7-Flash-NVFP4 \
--served-model-name step3p7 \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.9 \
--enable-expert-parallel \
--quantization modelopt \
--kv-cache-dtype fp8 \
--reasoning-parser step3p5 \
--tool-call-parser step3p5 \
--enable-auto-tool-choice \
--async-scheduling \
--trust-remote-code
DGX Station Single-GPU
The DGX Station
ships a single GB300 Grace-Blackwell Ultra Superchip with 252 GB of HBM3e, so the
FP8 and NVFP4 checkpoints both fit entirely in VRAM on one GPU (BF16 at ~475 GB does
not). Use the dedicated vllm/vllm-openai:stepfun37 image and serve on a single GPU:
vllm serve stepfun-ai/Step-3.7-Flash-FP8 \
--served-model-name step3p7-flash \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.95 \
--kv-cache-dtype fp8 \
--reasoning-parser step3p5 \
--tool-call-parser step3p5 \
--enable-auto-tool-choice \
--trust-remote-code
Swap in stepfun-ai/Step-3.7-Flash-NVFP4 to run the NVFP4 checkpoint instead. Or
launch the prebuilt container directly:
docker run -d --name vllm-server \
--gpus all --ipc host \
--ulimit memlock=-1 --ulimit stack=67108864 \
-p 8000:8000 \
-e HF_TOKEN="$HF_TOKEN" \
-v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
vllm/vllm-openai:stepfun37 \
stepfun-ai/Step-3.7-Flash-FP8 \
--gpu-memory-utilization 0.95 \
--trust-remote-code \
--reasoning-parser step3p5 \
--enable-auto-tool-choice \
--tool-call-parser step3p5 \
--kv-cache-dtype fp8
See the NVIDIA build DGX Station instructions for the full container setup.
Benchmarking
vllm bench serve \
--backend vllm \
--model stepfun-ai/Step-3.7-Flash \
--endpoint /v1/completions \
--dataset-name random \
--random-input 2048 \
--random-output 1024 \
--max-concurrency 10 \
--num-prompt 100
Troubleshooting
- MoE kernel tuning: See tune-moe-kernel to tune Triton kernels for your hardware.
- NVFP4 + TP > 4: The author recommends TP4+EP for NVFP4. Higher TP isn't validated.
- Cascade attention: Always pass
--disable-cascade-attn— the hybrid SWA/GA schedule is not compatible with cascade attention in vLLM.