This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| tips:llm [2025/12/11 07:51] – [under 16GB] sscipioni | tips:llm [2025/12/26 15:35] (current) – sscipioni | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| ====== LLM ====== | ====== LLM ====== | ||
| - | ===== under 16GB ===== | + | - https:// |
| + | |||
| + | For Production Deployment: | ||
| + | * Primary Choice: DeepSeek-R1 32B for reasoning-heavy applications | ||
| + | * Coding Tasks: Qwen2.5-Coder 7B for optimal balance of capability and efficiency | ||
| + | * General Purpose: Llama 3.3 70B for maximum versatility | ||
| + | * Edge Computing: Phi-4 14B for resource-constrained environments | ||
| + | |||
| + | Optimization Strategies: | ||
| + | * Always enable **Flash Attention** and KV-cache quantization | ||
| + | * Use **Q4_K_M** quantization for production deployments | ||
| + | * Implement caching for repeated queries | ||
| + | * Monitor GPU memory usage and implement automatic model swapping | ||
| + | * Use load balancing for high-throughput applications | ||
| + | |||
| + | |||
| + | |||
| + | ^ Hardware ^ Llama 3.3 8B (tokens/ | ||
| + | | RTX 4090 | 89.2 | 12.1 | | | ||
| + | | RTX 3090 | 67.4 | 8.3 | | | ||
| + | | A100 40GB | 156.7 | 45.2 | | | ||
| + | | M3 Max 128GB | 34.8 | 4.2 | | | ||
| + | | Strix Halo 128GB ollama | | 5.1 | 85.02 | | ||
| + | | Strix Halo 128GB llama.cpp | | | 90 | | ||
| + | | RTX 3060 | | | 131.76 | | ||
| - | - vision: **llama3.2-vision** | ||
| - | - coding and agentic: **deepseek-coder-v2: | ||
| - | - general reasoning: **llama3.1: | ||
| Line 15: | Line 36: | ||
| | llava | completion vision | " | | llava | completion vision | " | ||
| | deepseek-coder-v2: | | deepseek-coder-v2: | ||
| + | | bjoernb/ | ||
| + | | freehuntx/ | ||
| + | | networkjohnny/ | ||
| + | | phi4-mini | ||
| + | | qwen2.5: | ||
| + | | llama3.3: | ||
| + | | functiongemma | ||
| + | | danielsheep/ | ||
| + | | gpt-oss: | ||