For Production Deployment:
Optimization Strategies:
| Hardware | Llama 3.3 8B (tokens/sec) | Llama 3.3 70B (tokens/sec) | Llama 3.2 |
|---|---|---|---|
| RTX 4090 | 89.2 | 12.1 | |
| RTX 3090 | 67.4 | 8.3 | |
| A100 40GB | 156.7 | 45.2 | |
| M3 Max 128GB | 34.8 | 4.2 | |
| Strix Halo 128GB ollama | 5.1 | 85.02 | |
| Strix Halo 128GB llama.cpp | 90 | ||
| RTX 3060 | 131.76 |
| model | capabilities | size | context | quantization | eval rate [token/s] | prompt eval rate [token/s] |
|---|---|---|---|---|---|---|
| llama3.2 | completion tools | “3.2B” | 131072 | “Q4KM” | 88.14 | 715.43 |
| ministral-3:14b | completion vision tools | “13.9B” | 262144 | “Q4KM” | 23.78 | 302.07 |
| qwen3-coder:30b | completion tools | “30.5B” | 262144 | “Q4KM” | 73.75 | 72.41 |
| llama3:70b | completion | “70.6B” | 8192 | “Q4” | 5.55 | 9.72 |
| llava | completion vision | “7B” | 32768 | “Q4” | 49.92 | 207.27 |
| deepseek-coder-v2:16b | completion insert | “15.7B” | 163840 | “Q4” | 84.44 | 111.71 |
| bjoernb/qwen3-coder-30b-1m:latest | completion tools | “30.5B” | 1048576 | “Q4KM” | 74.23 | 94.84 |
| freehuntx/qwen3-coder:8b | completion tools | “8.2B” | 40960 | “Q4KM” | 37.97 | 565.68 |
| networkjohnny/deepseek-coder-v2-lite-base-q4km-gguf:latest | completion tools | “3.2B” | 131072 | “Q4KM” | 86.02 | 1124.53 |
| phi4-mini | completion tools | “3.8B” | 131072 | “Q4KM” | 72.24 | 31.37 |
| qwen2.5:7b | completion tools | “7.6B” | 32768 | “Q4KM” | 42.98 | 153.34 |
| llama3.3:70b-instruct-q4KM | completion tools | “70.6B” | 131072 | “Q4KM” | 5.06 | 15.50 |
| functiongemma | completion tools | “268.10M” | 32768 | “Q8_0” | 364.21 | 240.50 |