====== LLM ====== - https://collabnix.com/best-ollama-models-in-2025-complete-performance-comparison/ For Production Deployment: * Primary Choice: DeepSeek-R1 32B for reasoning-heavy applications * Coding Tasks: Qwen2.5-Coder 7B for optimal balance of capability and efficiency * General Purpose: Llama 3.3 70B for maximum versatility * Edge Computing: Phi-4 14B for resource-constrained environments Optimization Strategies: * Always enable **Flash Attention** and KV-cache quantization * Use **Q4_K_M** quantization for production deployments * Implement caching for repeated queries * Monitor GPU memory usage and implement automatic model swapping * Use load balancing for high-throughput applications ^ Hardware ^ Llama 3.3 8B (tokens/sec) ^ Llama 3.3 70B (tokens/sec) ^ Llama 3.2 ^ | RTX 4090 | 89.2 | 12.1 | | | RTX 3090 | 67.4 | 8.3 | | | A100 40GB | 156.7 | 45.2 | | | M3 Max 128GB | 34.8 | 4.2 | | | Strix Halo 128GB ollama | | 5.1 | 85.02 | | Strix Halo 128GB llama.cpp | | | 90 | | RTX 3060 | | | 131.76 | ^ model ^ capabilities ^ size ^ context ^ quantization ^ eval rate [token/s] ^ prompt eval rate [token/s] ^ | llama3.2 | completion tools | "3.2B" | 131072 | "Q4_K_M" | 88.14 | 715.43 | | ministral-3:14b | completion vision tools | "13.9B" | 262144 | "Q4_K_M" | 23.78 | 302.07 | | qwen3-coder:30b | completion tools | "30.5B" | 262144 | "Q4_K_M" | 73.75 | 72.41 | | llama3:70b | completion | "70.6B" | 8192 | "Q4" | 5.55 | 9.72 | | llava | completion vision | "7B" | 32768 | "Q4" | 49.92 | 207.27 | | deepseek-coder-v2:16b | completion insert | "15.7B" | 163840 | "Q4" | 84.44 | 111.71 | | bjoernb/qwen3-coder-30b-1m:latest | completion tools | "30.5B" | 1048576 | "Q4_K_M" | 74.23 | 94.84 | | freehuntx/qwen3-coder:8b | completion tools | "8.2B" | 40960 | "Q4_K_M" | 37.97 | 565.68 | | networkjohnny/deepseek-coder-v2-lite-base-q4_k_m-gguf:latest | completion tools | "3.2B" | 131072 | "Q4_K_M" | 86.02 | 1124.53 | | phi4-mini | completion tools | "3.8B" | 131072 | "Q4_K_M" | 72.24 | 31.37 | | qwen2.5:7b | completion tools | "7.6B" | 32768 | "Q4_K_M" | 42.98 | 153.34 | | llama3.3:70b-instruct-q4_K_M | completion tools | "70.6B" | 131072 | "Q4_K_M" | 5.06 | 15.50 | | functiongemma | completion tools | "268.10M" | 32768 | "Q8" | 364.21 | 240.50 | | danielsheep/Qwen3-Coder-30B-A3B-Instruct-1M-Unsloth | completion tools | "30.5B" | 1048576 | "Q4_K_M" | 71.60 | 33.14 | | gpt-oss:20b | completion tools thinking | "20.9B" | 131072 | "MXFP4" | 47.32 | 402.47 |