This report outlines the deployment of the Ollama LLM runtime on Arch Linux specifically tailored for the AMD Ryzen AI Max+ 395 APU. The primary focus is optimizing performance by leveraging the integrated Radeon 8060S iGPU through the Vulkan backend, and considering the potential of the XDNA 2 NPU for heterogeneous acceleration.
The AMD Ryzen AI Max+ 395 APU is a highly-integrated system-on-a-chip (SoC) featuring a heterogeneous architecture critical for efficient AI workloads:
Before deploying Ollama, the base Arch Linux installation must have the correct drivers and utilities to fully expose the APU's capabilities, especially for Vulkan and unified memory management.
Ensure the system is running a recent kernel (e.g., $6.10+$ or later) for optimal Zen 5 and RDNA 3.5 support.
# Update system and install a recent kernel if not already running sudo pacman -Syu linux linux-headers
The Vulkan backend relies on the open-source Mesa stack via the RADV driver.
# Install Mesa with Vulkan support for AMDGPU sudo pacman -S mesa vulkan-radeon lib32-vulkan-radeon
While the objective is Vulkan, installing the ROCm stack is often necessary for complete AMD GPU compute support and may be leveraged by other frameworks or future Ollama features. The Ryzen AI Max+ 395 is based on the gfx1150 target, which has improved support in recent ROCm versions (e.g., $6.4+$).
# Install essential ROCm packages sudo pacman -S rocm-core # Install rocm-hip-sdk if developing or using other tools # sudo pacman -S rocm-hip-sdk
The iGPU uses shared system RAM (Unified Memory). For optimal LLM performance, a large dedicated memory pool for graphics (iGPU/VRAM) is essential.
Ollama relies on an underlying LLM engine (typically a fork of llama.cpp). Recent Ollama releases (e.g., $0.12.6+$) include experimental support for Vulkan acceleration, which must be explicitly enabled via an environment variable.
The easiest method is using the Arch package manager or downloading the official binary.
# Install Ollama from the Arch Linux repository (or AUR) sudo pacman -S ollama # OR # curl -L https://ollama.com/download/install.sh | sh
To force Ollama to utilize the more performant Vulkan backend on the AMD iGPU, the OLLAMA_VULKAN environment variable must be set when running the service.
Modify the Ollama systemd service unit to include the required environment variable.
bash
sudo mkdir -p /etc/systemd/system/ollama.service.d/
/etc/systemd/system/ollama.service.d/vulkan_override.conf):
ini
[Service]
Environment="OLLAMA_VULKAN=1"
# Optional: Set the number of threads for the CPU fallback/host operations
# Environment="OLLAMA_NUM_THREADS=16"
bash
sudo systemctl daemon-reload
sudo systemctl restart ollama
Check the service logs for confirmation that Vulkan initialization was successful:
sudo journalctl -u ollama -f
Look for messages indicating Vulkan/GGML initialization on the iGPU. Ollama may also log which accelerator is being used when a model is run.
Test a small model, ensuring the output indicates GPU/Vulkan usage. The number of layers offloaded (--gpu-layers) is often determined automatically by the available VRAM (shared RAM).
ollama run llama3:8b # After the model downloads, monitor system resource usage (e.g., with htop and radeontop) # The prompt prefill phase will typically show high iGPU usage.
The Ryzen AI Max+ 395 introduces the XDNA 2 NPU, enabling true heterogeneous computing. While the Ollama (via llama.cpp) Vulkan backend accelerates the iGPU, direct, standardized, and easily-configured NPU acceleration support within Ollama on Linux is currently limited/experimental and often requires niche frameworks or ONNX models (e.g., for hybrid execution).
The optimal, long-term deployment strategy will involve splitting the LLM workload to maximize the strengths of each compute unit.
\<pre class=βmermaidβ> graph LR subgraph Frontend A[User Prompt] β> B(Ollama Server/API) end B β> C{Workload Scheduler (Ollama)} C β> |Prompt/Context Prefill (High TTFT focus)| D(XDNA 2 NPU) C β> |Token Decoding (High TPS focus)| E(Radeon 8060S iGPU - Vulkan) D β> F[NPU Results] E β> G[iGPU Results] F & G β> H(Final Token Stream) H β> B \</pre>
**Figure: Conceptual Heterogeneous LLM Pipeline for AMD Ryzen AI Max+ 395**
### 3. NPU Exploration (Advanced)
For the skilled Linux user aiming for NPU utilization, direct integration requires bypassing Ollama for NPU-specific frameworks:
llama.cpp with NPU support or a dedicated NPU LLM tool).| Component | Tuning Action | Rationale |
| :β | :β | :β |
| System RAM | Maximize physical RAM (e.g., 64GB/128GB). | The iGPU uses unified memory; more RAM directly equates to more VRAM for larger models/context. |
| BIOS/UEFI | Set UMA Frame Buffer Size to max (e.g., 16GB-32GB). | Crucial for allocating a large, dedicated memory pool for GPU offload. |
| Ollama | Use Q4_K_M or Q5_K_M quantization. | Optimal balance of VRAM usage and inference speed/quality. Larger quantizations (Q6/Q8) may be slower or consume too much VRAM. |
| Vulkan | Ensure OLLAMA_VULKAN=1 is set. | Forces the use of the Vulkan backend, which is generally reported as faster than ROCm on APUs for LLM workloads. |
| Model Selection | Prioritize MoE (Mixtral, DeepSeek) models. | The architecture excels at MoE models by efficiently leveraging the available memory bandwidth and unified memory for large models with small active experts. |
Would you like a step-by-step guide on how to build Ollama from source on Arch Linux to ensure the Vulkan backend is compiled correctly?