stepfun-ai/Step-3.7-Flash

Production-grade vision-language MoE (~198B total / 11B active parameters) combining a 196B sparse language backbone with a 1.8B perception encoder, hybrid SWA/Global attention, and 3-way Multi-Token Prediction

Sparse MoE VLM with hybrid attention and 3-layer MTP speculative decoding

View on HuggingFace

moe198B / 11B262,144 ctxvLLM nightly+multimodal

Guide

Overview

Step-3.7-Flash is a 198B-parameter sparse Mixture-of-Experts vision-language model from StepFun, pairing a 196B language backbone with a 1.8B perception encoder. It activates ~11B parameters per token and supports a 256k context window with three selectable reasoning levels (low / medium / high).

Key highlights:

Multimodal Understanding: Native vision encoder for single and multi-image inputs alongside text
Hybrid Attention Architecture: Interleaves Sliding Window Attention (512-token window) and Global Attention at a 3:1 ratio
Sparse MoE: 11B active parameters out of 198B total
Multi-Layer MTP: 3-way Multi-Token Prediction (MTP-3) for low-latency reasoning chains

Available precisions:

stepfun-ai/Step-3.7-Flash (BF16)
stepfun-ai/Step-3.7-Flash-FP8
stepfun-ai/Step-3.7-Flash-NVFP4 (Blackwell only)

Prerequisites

vLLM version: nightly (the model registry hasn't shipped in a stable release yet)
Hardware: 8xH200/B200 for BF16 and FP8; 4xB200 for NVFP4

Install vLLM (nightly)

uv venv
source .venv/bin/activate
uv pip install -U vllm --pre \
    --extra-index-url https://wheels.vllm.ai/nightly

Or via Docker:

docker pull vllm/vllm-openai:stepfun37

Launching the Server

BF16

vllm serve stepfun-ai/Step-3.7-Flash \
    --served-model-name step3p7-flash \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \
    --disable-cascade-attn \
    --reasoning-parser step3p5 \
    --tool-call-parser step3p5 \
    --enable-auto-tool-choice \
    --speculative-config '{"method": "mtp", "num_speculative_tokens": 3}' \
    --trust-remote-code

FP8

vllm serve stepfun-ai/Step-3.7-Flash-FP8 \
    --served-model-name step3p7-flash \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \
    --disable-cascade-attn \
    --reasoning-parser step3p5 \
    --tool-call-parser step3p5 \
    --enable-auto-tool-choice \
    --speculative-config '{"method": "mtp", "num_speculative_tokens": 3}' \
    --trust-remote-code

NVFP4 (Blackwell only)

Requires modelopt quantization and FP8 KV cache alignment.

vllm serve stepfun-ai/Step-3.7-Flash-NVFP4 \
    --served-model-name step3p7 \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.9 \
    --enable-expert-parallel \
    --quantization modelopt \
    --kv-cache-dtype fp8 \
    --reasoning-parser step3p5 \
    --tool-call-parser step3p5 \
    --enable-auto-tool-choice \
    --async-scheduling \
    --trust-remote-code

DGX Station Single-GPU

The DGX Station ships a single GB300 Grace-Blackwell Ultra Superchip with 252 GB of HBM3e, so the FP8 and NVFP4 checkpoints both fit entirely in VRAM on one GPU (BF16 at ~475 GB does not). Use the dedicated vllm/vllm-openai:stepfun37 image and serve on a single GPU:

vllm serve stepfun-ai/Step-3.7-Flash-FP8 \
    --served-model-name step3p7-flash \
    --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.95 \
    --kv-cache-dtype fp8 \
    --reasoning-parser step3p5 \
    --tool-call-parser step3p5 \
    --enable-auto-tool-choice \
    --trust-remote-code

Swap in stepfun-ai/Step-3.7-Flash-NVFP4 to run the NVFP4 checkpoint instead. Or launch the prebuilt container directly:

docker run -d --name vllm-server \
    --gpus all --ipc host \
    --ulimit memlock=-1 --ulimit stack=67108864 \
    -p 8000:8000 \
    -e HF_TOKEN="$HF_TOKEN" \
    -v "$HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub" \
    vllm/vllm-openai:stepfun37 \
    stepfun-ai/Step-3.7-Flash-FP8 \
      --gpu-memory-utilization 0.95 \
      --trust-remote-code \
      --reasoning-parser step3p5 \
      --enable-auto-tool-choice \
      --tool-call-parser step3p5 \
      --kv-cache-dtype fp8

See the NVIDIA build DGX Station instructions for the full container setup.

Benchmarking

vllm bench serve \
    --backend vllm \
    --model stepfun-ai/Step-3.7-Flash \
    --endpoint /v1/completions \
    --dataset-name random \
    --random-input 2048 \
    --random-output 1024 \
    --max-concurrency 10 \
    --num-prompt 100

Troubleshooting

MoE kernel tuning: See tune-moe-kernel to tune Triton kernels for your hardware.
NVFP4 + TP > 4: The author recommends TP4+EP for NVFP4. Higher TP isn't validated.
Cascade attention: Always pass --disable-cascade-attn — the hybrid SWA/GA schedule is not compatible with cascade attention in vLLM.