⚡ Performance and Efficiency Benchmarks

This section reports the performance of LLaMA 3.x on NPU with FastFlowLM (FLM).

Note:

  • Results are based on FastFlowLM v0.9.21.
  • Under FLM’s default NPU power mode (Performance)
  • Test system spec: AMD Ryzen™ AI 7 350 (Krakan Point) with 32 GB DRAM.
  • Newer versions may deliver improved performance.

🚀 Decoding Speed (TPS, or Tokens per Second, starting @ different context lengths)

Model Hardware 1k 2k 4k 8k 16k 32k 64k 128k Model
LLaMA 3.2 1B NPU (FLM) 62.7 58.8 52.7 44.8 33.7 23.6 14.6 10.6 LLaMA 3.2 1B
LLaMA 3.2 3B NPU (FLM) 26.2 24.6 22.1 18.3 13.7 9.1 6.8 OOM LLaMA 3.2 3B
LLaMA 3.1 8B NPU (FLM) 12.7 12.4 11.6 10.4 8.6 6.3 OOM OOM LLaMA 3.1 8B

OOM: Out Of Memory
Only <50% system DRAM can be accessed by NPU
On systems with more than 32 GB DRAM, longer context lengths are supported. FLM supports the full context length available for each model.


🚀 Prefill Speed (TPS, or Tokens per Second, with different prompt lengths)

Model Hardware 1k 2k 4k 8k 16k 32k Model
LLaMA 3.2 1B NPU (FLM) 1442 1766 1750 1473 967 577 LLaMA 3.2 1B
LLaMA 3.2 3B NPU (FLM) 678 797 738 583 373 214 LLaMA 3.2 3B
LLaMA 3.1 8B NPU (FLM) 384 447 426 376 267 167 LLaMA 3.1 8B