⚡ Performance and Efficiency Benchmarks

This section reports the performance of Qwen 3 on NPU with FastFlowLM (FLM).

Note:

  • Results are based on FastFlowLM v0.9.8.
  • Under FLM’s default NPU power mode (Performance)
  • Test system spec: AMD Ryzen™ AI 7 350 (Krakan Point) with 32 GB DRAM.
  • Newer versions may deliver improved performance.

🚀 Decoding Speed (TPS, or Tokens per Second, @ different context lengths)

Model Hardware 1k 2k 4k 8k 16k 32k Model
Qwen 3 0.6B NPU (FLM) 50.4 45.3 36.0 25.7 16.4 9.6 Qwen 3 0.6B
Qwen 3 1.7B NPU (FLM) 27.8 26.1 22.7 18.3 13.1 8.3 Qwen 3 1.7B
Qwen 3 4B NPU (FLM) 14.0 13.3 11.9 10.1 7.7 5.3 Qwen 3 4B
Qwen 3 8B NPU (FLM) 8.1 7.9 7.4 6.6 5.5 4.1 Qwen 3 8B

🚀 Prefill Speed (TTFT, or Time to First Token in Seconds, with different prompt lengths)

Model Hardware 1k 2k 4k 8k 16k 32k Model
Qwen 3 0.6B NPU (FLM) 0.88 1.42 3.59 10.66 37.20 139.44 Qwen 3 0.6B
Qwen 3 1.7B NPU (FLM) 1.19 2.11 4.70 12.50 40.81 146.53 Qwen 3 1.7B
Qwen 3 4B NPU (FLM) 2.04 3.61 7.68 19.23 58.85 203.42 Qwen 3 4B
Qwen 3 8B NPU (FLM) 2.87 4.87 10.17 23.87 68.02 224.52 Qwen 3 8B