⚡ Performance and Efficiency Benchmarks

This section reports the performance of Qwen 3 on NPU with FastFlowLM (FLM).

Note:

  • Results are based on FastFlowLM v0.9.19.
  • Under FLM’s default NPU power mode (Performance)
  • Test system spec: AMD Ryzen™ AI 7 350 (Krakan Point) with 32 GB DRAM.
  • Newer versions may deliver improved performance.

🚀 Decoding Speed (TPS, or Tokens per Second, starting @ different context lengths)

Model Hardware 1k 2k 4k 8k 16k 32k Model
Qwen 3 0.6B NPU (FLM) 53.7 47.3 38.7 28.7 18.9 11.2 Qwen 3 0.6B
Qwen 3 1.7B NPU (FLM) 28.4 26.8 23.9 19.6 14.5 9.5 Qwen 3 1.7B
Qwen 3 4B NPU (FLM) 14.3 13.7 12.7 11.1 8.8 6.3 Qwen 3 4B
Qwen 3 8B NPU (FLM) 8.3 8.2 7.7 7.2 6.1 4.8 Qwen 3 8B

🚀 Prefill Speed (TPS, or Tokens per Second, with different prompt lengths)

Model Hardware 1k 2k 4k 8k 16k 32k Model
Qwen 3 0.6B NPU (FLM) 1280 1356 1128 779 444 236 Qwen 3 0.6B
Qwen 3 1.7B NPU (FLM) 939 1029 916 674 408 225 Qwen 3 1.7B
Qwen 3 4B NPU (FLM) 509 574 543 435 282 164 Qwen 3 4B
Qwen 3 8B NPU (FLM) 357 420 408 345 243 150 Qwen 3 8B

🚀 Prefill TTFT with Image Input (seconds)

Prefill time-to-first-token (TTFT) for Qwen3-VL-4B on NPU (FastFlowLM) with different image resolutions.

Model Hardware 720p (1280×720) 1080p (1920×1080) 2K (2560×1440) 4K (3840×2160)
Qwen3-VL-4B NPU (FLM) 3.5 9.1 21.0 84.7

This test uses a short prompt: “Describe this image.”