Differences

This shows you the differences between two versions of the page.

--- tips:llm [2025/12/19 07:42] – sscipioni
+++ tips:llm [2026/04/15 11:25] (current) – sscipioni
@@ Line 1: / Line 1: @@
 ====== LLM ======
+- https://collabnix.com/best-ollama-models-in-2025-complete-performance-comparison/
+For Production Deployment:
+  * Primary Choice: DeepSeek-R1 32B for reasoning-heavy applications
+  * Coding Tasks: Qwen2.5-Coder 7B for optimal balance of capability and efficiency
+  * General Purpose: Llama 3.3 70B for maximum versatility
+  * Edge Computing: Phi-4 14B for resource-constrained environments
+Optimization Strategies:
+  * Always enable **Flash Attention** and KV-cache quantization
+  * Use **Q4_K_M** quantization for production deployments
+  * Implement caching for repeated queries
+  * Monitor GPU memory usage and implement automatic model swapping
+  * Use load balancing for high-throughput applications
+^ Hardware ^ Llama 3.3 8B (tokens/sec) ^ Llama 3.3 70B (tokens/sec) ^ Llama 3.2 ^
+| RTX 4090 | 89.2 | 12.1 | |
+| RTX 3090 | 67.4 | 8.3 | |
+| A100 40GB | 156.7 | 45.2 | |
+| M3 Max 128GB | 34.8 | 4.2 | |
+| Strix Halo 128GB ollama | | 5.1 | 85.02 |
+| Strix Halo 128GB llama.cpp | |  | 90 |
+| RTX 3060 | | | 131.76 |
+ROCM
 ^ model                  ^ capabilities             ^ size     ^ context  ^ quantization                                                                      ^ eval rate [token/s]  ^ prompt eval rate [token/s]  ^
-| llama3.2               | completion tools         | "3.2B"   | 131072   | "Q4_K_M"                                                                          | 88.14                | 715.43                      |
+| llama3.2  | completion tools | "3.2B"   | 131072    | "Q4_K_M" | 52.78 | 1957.30 |
-| ministral-3:14b        | completion vision tools  | "13.9B"  | 262144   | "Q4_K_M"                                                                          | 23.78                | 302.07                      |
+| qwen-strixhalo  | completion tools | "30.5B"   | 262144    | "Q4_K_M" | 53.54 | 1056.37 |
-| qwen3-coder:30b        | completion tools         | "30.5B"  | 262144   | "Q4_K_M"                                                                          | 73.75                | 72.41                       |
+| qwen3-coder  | completion tools | "30.5B"   | 262144    | "Q4_K_M" | 52.10 | 776.55 |
-| llama3:70b | completion   | "70.6B"  | 8192     | "Q4" | 5.55 | 9.72 |
+| qwen3:30b-a3b  | completion tools thinking | "30.5B"   | 262144    | "Q4_K_M" | 50.19 | 803.06 |
-| llava  | completion vision | "7B"   | 32768    | "Q4"  | 49.92                | 207.27                      |
+| gpt-oss:20b  | completion tools thinking | "20.9B"   | 131072    | "MXFP4" | 45.37 | 519.90 |
-| deepseek-coder-v2:16b  | completion insert        | "15.7B"  | 163840   | "Q4"                                                                            | 84.44                | 111.71                      |
+| glm-4.7-flash  | completion tools thinking | "29.9B"   | 202752    | "Q4_K_M" | 41.54 | 470.09 |
-| bjoernb/qwen3-coder-30b-1m:latest  | completion tools | "30.5B"   | 1048576    | "Q4_K_M" | 74.23 | 94.84 |
+| qwen3:8b  | completion tools thinking | "8.2B"   | 40960    | "Q4_K_M" | 32.68 | 890.98 |
-| freehuntx/qwen3-coder:8b  | completion tools | "8.2B"   | 40960    | "Q4_K_M" | 37.97 | 565.68 |
+| qwen3-coder-next  | completion tools | "79.7B"   | 262144    | "Q4_K_M" | 33.06 | 380.21 |
-| networkjohnny/deepseek-coder-v2-lite-base-q4_k_m-gguf:latest  | completion tools | "3.2B"   | 131072    | "Q4_K_M" | 86.02 | 1124.53 |
+| qwen2.5-coder:14b-instruct-q4_K_M  | completion tools insert | "14.8B"   | 32768    | "Q4_K_M" | 17.25 | 527.74 |
-| phi4-mini  | completion tools | "3.8B"   | 131072    | "Q4_K_M" | 72.24 | 31.37 |
+| gemma4:latest  | completion vision audio tools thinking | "8.0B"   | 131072    | "Q4_K_M" | 50.15 | 1704.89 |
+| gemma4:e2b  | completion vision audio tools thinking | "5.1B"   | 131072    | "Q4_K_M" | 83.07 | 2799.72 |
+NVIDIA GeForce RTX 3060
+^ model                  ^ capabilities             ^ size     ^ context  ^ quantization                                                                      ^ eval rate [token/s]  ^ prompt eval rate [token/s]  ^
+| gemma4:e2b  | completion vision audio tools thinking | "5.1B"   | 131072    | "Q4_K_M" | 102.44 | 4202.89 |
+VULKAN
+^ model                  ^ capabilities             ^ size     ^ context  ^ quantization                                                                      ^ eval rate [token/s]  ^ prompt eval rate [token/s]  ^
+| qwen3-coder  | completion tools | "30.5B"   | 262144    | "Q4_K_M" | 54.03 | 805.43 |
+| llama3.2  | completion tools | "3.2B"   | 131072    | "Q4_K_M" | 52.54 | 1838.82 |
+| gpt-oss:20b  | completion tools thinking | "20.9B"   | 131072    | "MXFP4" | 43.36 | 475.60 |
+ollama model
+<code>
+FROM qwen3-coder
+# STRIX HALO AGENTIC TUNING
+PARAMETER num_ctx 128000
+PARAMETER num_batch 1024
+PARAMETER num_predict 4096
+SYSTEM """
+You are a Strix Halo Optimized Coding Agent.
+Always use asynchronous patterns and favor memory-efficient algorithms.
+"""
+</code>

Galileo Labs

User Tools

Site Tools

Differences

Page Tools