This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| tips:llm [2025/12/11 07:49] – [Table] sscipioni | tips:llm [2026/02/15 08:29] (current) – sscipioni | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| ====== LLM ====== | ====== LLM ====== | ||
| - | ===== under 16GB ===== | + | - https:// |
| - | - vision: **llama3.2-vision** | + | For Production Deployment: |
| - | - coding and agentic: | + | |
| - | - general reasoning: | + | |
| + | * General Purpose: Llama 3.3 70B for maximum versatility | ||
| + | * Edge Computing: Phi-4 14B for resource-constrained environments | ||
| + | Optimization Strategies: | ||
| + | * Always enable **Flash Attention** and KV-cache quantization | ||
| + | * Use **Q4_K_M** quantization for production deployments | ||
| + | * Implement caching for repeated queries | ||
| + | * Monitor GPU memory usage and implement automatic model swapping | ||
| + | * Use load balancing for high-throughput applications | ||
| + | |||
| + | |||
| + | ^ Hardware ^ Llama 3.3 8B (tokens/ | ||
| + | | RTX 4090 | 89.2 | 12.1 | | | ||
| + | | RTX 3090 | 67.4 | 8.3 | | | ||
| + | | A100 40GB | 156.7 | 45.2 | | | ||
| + | | M3 Max 128GB | 34.8 | 4.2 | | | ||
| + | | Strix Halo 128GB ollama | | 5.1 | 85.02 | | ||
| + | | Strix Halo 128GB llama.cpp | | | 90 | | ||
| + | | RTX 3060 | | | 131.76 | | ||
| + | |||
| + | |||
| + | ROCM | ||
| ^ model ^ capabilities | ^ model ^ capabilities | ||
| - | | llama3.2 | + | | llama3.2 |
| - | | ministral-3:14b | + | | qwen-strixhalo |
| - | | qwen3-coder: | + | | qwen3-coder |
| - | | llama3:70b | completion | + | | qwen3:30b-a3b |
| - | | deepseek-coder-v2:16b | completion insert | + | | gpt-oss:20b |
| + | | glm-4.7-flash | ||
| + | | qwen3: | ||
| + | | qwen3-coder-next | completion tools | " | ||
| + | | qwen2.5-coder:14b-instruct-q4_K_M | ||
| + | |||
| + | |||
| + | VULKAN | ||
| + | ^ model ^ capabilities | ||
| + | | qwen3-coder | ||
| + | | llama3.2 | ||
| + | | gpt-oss: | ||
| + | |||
| + | |||
| + | ollama model | ||
| + | < | ||
| + | FROM qwen3-coder | ||
| + | |||
| + | # STRIX HALO AGENTIC TUNING | ||
| + | PARAMETER num_ctx 128000 | ||
| + | PARAMETER num_batch 1024 | ||
| + | PARAMETER num_predict 4096 | ||
| + | |||
| + | SYSTEM """ | ||
| + | You are a Strix Halo Optimized Coding Agent. | ||
| + | Always use asynchronous patterns and favor memory-efficient algorithms. | ||
| + | """ | ||
| + | </ | ||