⚡ Performance and Efficiency Benchmarks

This section reports the performance of LLaMA 3.x on NPU with FastFlowLM (FLM).

Note:

  • Results are based on FastFlowLM v0.9.8.
  • Under FLM’s default NPU power mode (Performance)
  • Test system spec: AMD Ryzen™ AI 7 350 (Krakan Point) with 32 GB DRAM.
  • Newer versions may deliver improved performance.

🚀 Decoding Speed (TPS, or Tokens per Second, @ different context lengths)

Model Hardware 1k 2k 4k 8k 16k 32k 64k 128k Model
LLaMA 3.2 1B NPU (FLM) 41.5 40.6 38.1 33.2 25.6 18.6 12.2 8.9 LLaMA 3.2 1B
LLaMA 3.2 3B NPU (FLM) 18.3 17.8 15.9 13.6 10.5 7.3 6.3 OOM LLaMA 3.2 3B
LLaMA 3.1 8B NPU (FLM) 9.1 9.0 8.3 7.5 6.2 4.6 OOM OOM LLaMA 3.1 8B

OOM: Out Of Memory
On systems with more than 32 GB DRAM, longer context lengths are supported. FLM supports the full context length available for each model.


🚀 Prefill Speed (TTFT, or Time to First Token in Seconds, with different prompt lengths)

Model Hardware 1k 2k 4k 8k 16k 32k Model
LLaMA 3.2 1B NPU (FLM) 0.76 1.16 2.41 5.76 16.93 57.02 LLaMA 3.2 1B
LLaMA 3.2 3B NPU (FLM) 1.81 2.64 5.56 14.05 43.88 152.91 LLaMA 3.2 3B
LLaMA 3.1 8B NPU (FLM) 3.04 4.53 9.32 21.79 61.46 195.65 LLaMA 3.1 8B