This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| tips:rocm [2025/12/03 11:18] – [Benchmark] sscipioni | tips:rocm [2025/12/29 23:04] (current) – [3\. ROCm (Optional but Recommended)] sscipioni | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| + | References: | ||
| + | - https:// | ||
| + | |||
| + | |||
| This report outlines the deployment of the **Ollama LLM runtime** on **Arch Linux** specifically tailored for the **AMD Ryzen AI Max+ 395 APU**. The primary focus is optimizing performance by leveraging the integrated **Radeon 8060S iGPU** through the **Vulkan** backend, and considering the potential of the **XDNA 2 NPU** for heterogeneous acceleration. | This report outlines the deployment of the **Ollama LLM runtime** on **Arch Linux** specifically tailored for the **AMD Ryzen AI Max+ 395 APU**. The primary focus is optimizing performance by leveraging the integrated **Radeon 8060S iGPU** through the **Vulkan** backend, and considering the potential of the **XDNA 2 NPU** for heterogeneous acceleration. | ||
| Line 33: | Line 37: | ||
| sudo pacman -S mesa vulkan-radeon lib32-vulkan-radeon vulkan-headers | sudo pacman -S mesa vulkan-radeon lib32-vulkan-radeon vulkan-headers | ||
| ``` | ``` | ||
| - | |||
| ### 3\. ROCm (Optional but Recommended) | ### 3\. ROCm (Optional but Recommended) | ||
| Line 40: | Line 43: | ||
| ```bash | ```bash | ||
| # Install essential ROCm packages | # Install essential ROCm packages | ||
| - | sudo pacman | + | yay -S rocm-core |
| - | # Install | + | yay -S rocm-hip-sdk |
| - | # sudo pacman | + | sudo usermod -a -G render, |
| + | |||
| + | # xdna | ||
| + | yay -S amdxdna-driver-bin xrt-npu-git | ||
| ``` | ``` | ||
| + | |||
| + | **IMPORTANT**: | ||
| ### 4\. Memory Configuration | ### 4\. Memory Configuration | ||
| Line 118: | Line 127: | ||
| ----- | ----- | ||
| - | ## 🌐 Heterogeneous Architecture Strategy | + | - |
| - | + | ||
| - | The Ryzen AI Max+ 395 introduces the **XDNA 2 NPU**, enabling true heterogeneous computing. While the Ollama (via `llama.cpp`) Vulkan backend accelerates the iGPU, **direct, standardized, | + | |
| - | + | ||
| - | ### 1\. Current State: iGPU Dominance (Vulkan/ | + | |
| - | + | ||
| - | * **iGPU (Vulkan):** Offers the best performance for general-purpose LLM inference, especially for larger models ($8B+$). This is the focus of the Vulkan optimization. | + | |
| - | * **NPU (XDNA 2):** Best leveraged by specific, often proprietary, | + | |
| - | + | ||
| - | ### 2\. Future Heterogeneous Model (Conceptual) | + | |
| - | + | ||
| - | The optimal, long-term deployment strategy will involve splitting the LLM workload to maximize the strengths of each compute unit. | + | |
| - | + | ||
| - | \<pre class=" | + | |
| - | graph LR | + | |
| - | subgraph Frontend | + | |
| - | A[User Prompt] --\> B(Ollama Server/ | + | |
| - | end | + | |
| - | B --\> C{Workload Scheduler (Ollama)} | + | |
| - | C --\> |Prompt/ | + | |
| - | C --\> |Token Decoding (High TPS focus)| E(Radeon 8060S iGPU - Vulkan) | + | |
| - | D --\> F[NPU Results] | + | |
| - | E --\> G[iGPU Results] | + | |
| - | F & G --\> H(Final Token Stream) | + | |
| - | H --\> B | + | |
| - | \</ | + | |
| - | + | ||
| - | > **Figure: Conceptual Heterogeneous LLM Pipeline for AMD Ryzen AI Max+ 395** | + | |
| - | + | ||
| - | ### 3\. NPU Exploration (Advanced) | + | |
| - | + | ||
| - | For the skilled Linux user aiming for NPU utilization, | + | |
| - | + | ||
| - | * **ONNX-EP: | + | |
| - | * **AMD Guides:** Follow AMD's specific Linux guides for the Ryzen AI 300 series to enable NPU drivers and libraries, then integrate them with a custom LLM serving solution that can utilize them (e.g., a hand-built `llama.cpp` with NPU support or a dedicated NPU LLM tool). | + | |
| - | + | ||
| - | ----- | + | |
| ## ⚙️ Performance Tuning Notes | ## ⚙️ Performance Tuning Notes | ||
| Line 166: | Line 139: | ||
| + | |||
| + | ====== lemonade-server ====== | ||
| + | |||
| + | <code | download> | ||
| + | yay -S lemonade-server | ||
| + | </ | ||
| + | |||
| + | oga-hybrid mode: this splits the work, the NPU handles the prefill (prompt processing), | ||
| + | <code | download> | ||
| + | lemonade-server run Qwen3-Coder-30B-A3B-Instruct-GGUF --recipe oga-hybrid --llamacpp rocm | ||
| + | </ | ||
| + | |||
| + | <code | download> | ||
| + | curl http:// | ||
| + | -H " | ||
| + | -d '{ | ||
| + | " | ||
| + | " | ||
| + | " | ||
| + | }' | ||
| + | curl http:// | ||
| + | </ | ||
| ====== Benchmark ====== | ====== Benchmark ====== | ||
| Line 175: | Line 170: | ||
| # Use CMAKE to enable Vulkan | # Use CMAKE to enable Vulkan | ||
| cmake -DGGML_VULKAN=ON -B build | cmake -DGGML_VULKAN=ON -B build | ||
| - | cmake --build build --config Release | + | cmake --build build --config Release |
| # This command downloads the 4.92 GB model file directly. | # This command downloads the 4.92 GB model file directly. | ||
| wget https:// | wget https:// | ||
| + | https:// | ||
| # bench | # bench | ||
| ./ | ./ | ||
| + | ./ | ||
| + | </ | ||
| + | |||
| + | ollama | ||
| + | <code bash> | ||
| + | ollama run llama3.2 | ||
| + | </ | ||
| + | |||
| + | |||
| + | < | ||
| + | yay -S python-huggingface-hub | ||
| + | hf download Qwen/ | ||
| </ | </ | ||