Update from me: I’ve moved to gaming on CachyOS vs Windows 11 for the last 6 months or so. I’ve been wanting to get back into Linux since I left Microsoft, and finally got around to it I guess. Some of the Windows 11 uh, quality drift and lack of vision from Microsoft on what a performant workstation looks like led me down that path.
Anyway after putting some time into Dune: Awakening I decided to spend some time on future career and learning. I had a Dell Precision workstation gathering dust and figured hey, I’ll build a lab box on CachyOS linux and do some local AI projects. This wasn’t the best idea turns out. No knock on CachyOS, or the maintainers of ZFS, but turns out when my lab box upgraded to CachyOS Linux 7, ZFS wasn’t supporting the kernel yet and made my unable to boot. A miss on research on my part. It’s fine. Lesson learned. I rebuilt the Dell with Ubuntu 24 LTS and proceeded to build a local AI agent I could use, similar goals to my post I did for Microsoft community a while back (archive) on TBI recovery.
Lets discuss hardware real quick.
The Dell is a dual Xeon with 192GB of DDR4 ram and a Nvidia P1000 video card with 4 GB of VRAM. This is not going to work for this project, I needed a gpu with more oomph and memory onboard. So I resorted to Facebook Marketplace where I combed posts for a used Nvidia 3090. The key here is two-fold. One, the GPU must have Tensor cores. This eliminates all GPUs prior to the Ampere line. For more information I turned to Google Gemini for reasoning:
For modern data science and deep learning, you should focus on chips starting from the Ampere (GAxxx) family and newer. These architectures include Tensor Cores, which are critical for accelerating the matrix math used in neural networks and modern data processing.
Fair enough. Whilst shopping, I landed a 3090 FE from a coworker. Look to generally spend 650-750 for a card (at time of writing). System RAM and CPU matter a little but not nearly as much as this core component of GPU. There is some research out there that the AMD x3D CPUs perform RAG functions better than other CPUs due to their huge cache size. The Dell with 2 Xeons should be fine though.
This is what I’m working with:
————–
OS: Ubuntu 24.04.4 LTS x86_64
Host: Precision 7920 Tower
Kernel: 6.17.0-22-generic
Uptime: 2 days, 15 hours, 48 mins
Packages: 1705 (dpkg), 14 (snap)
Shell: bash 5.2.21
Resolution: 3440×1440
Terminal: /dev/pts/1
CPU: Intel Xeon Silver 4208 (32) @ 3.200GHz
GPU: NVIDIA Quadro P1000
GPU: NVIDIA GeForce RTX 3090
Memory: 15761MiB / 192057MiB
For storage, speed is a bit important. the Dell chassis precluded me putting in the 3090 so I grabbed an Oculink PCI adapter (4x adapter) and put it in slot 2 of the Dell (not all PCI slots are created equal! Especially on multi-socket systems. It’s worth paying attention to, really. If I’d plugged the Oculink into the wrong slot, I could have halved the bandwidth for transfer rates to the card. The same bears true for the adapters I added for NVME slots. Some slots are 1x speed, some are closer to CPU 1, some closer to CPU 0. I put in 3 WD_BLACK SN7100 1TB NVMe drives on dedicated PCIe adapters. I could have purchased a single card with 4 slots, or 2 cards with 2 NVME slots each, etc. But the PCI bus speed and transfer rates were a concern here. I wanted it as performant as possible, so I went with 3 single slot cards. That way each NVME has a PCI3 4x bandwidth to itself. To further accelerate things, I did a ZFS RAIDZ1 pool on the 3 drives. Some will say RAIDZ2 is better, etc. But for a home lab I went with RAIDZ1, its redundant to a single drive failure and can be rebuilt/recovered so it works for these pseudo-production needs of mine. I’m backing up the contents to a spinner anyway so I’ve got some good CYA here.
Anyway here’s what I did:
sudo zpool create -f -o ashift=12 fastpool raidz1 \ /dev/disk/by-id/nvme-WD_BLACK_SN7100_1TB_251344800057 \ /dev/disk/by-id/nvme-WD_BLACK_SN7100_1TB_253854800702 \ /dev/disk/by-id/nvme-WD_BLACK_SN7100_1TB_254595802964 sudo zfs set compression=lz4 fastpool sudo zfs set atime=off fastpool sudo zfs set recordsize=1M fastpool sudo zfs create fastpool/ollama sudo zfs create fastpool/anythingllm
The boot/OS drive is outside the RAID on a single 512GB NVME drive.
AnythingLLM runs best in Docker (for workspace management) but hte Docker bridge interface must exist before Ollama can bind to it. So after making the ZFS raid, I installed Docker:
sudo apt-get update sudo apt-get install ca-certificates curl -y sudo install -m 0755 -d /etc/apt/keyrings sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc sudo chmod a+r /etc/apt/keyrings/docker.asc echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] \ https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \ sudo tee /etc/apt/sources.list.d/docker.list > /dev/null sudo apt-get update sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin -y
Ollama is installed natively to bypass the “Docker GPU Tax”. Reading up on this, sounds like some FUD mixed in with real concerns about decay of performance over time and special tweaks needed for GPU passthrough. I settled on running Ollama native instead to keep it simple and extensible in case I wanted to integrate Ollama into something else later on.
curl -fsSL https://ollama.com/install.sh | sh sudo systemctl stop ollama
And then copy the files over from the default Ollama install area onto the ZFS directory:
# Sync existing data to the new dataset sudo rsync -avzh /usr/share/ollama/.ollama/ /fastpool/ollama/ # Set correct ownership so the service user can read/write to the pool sudo chown -R ollama:ollama /fastpool/ollama # Optional: Remove old data to reclaim space on the OS drive sudo rm -rf /usr/share/ollama/.ollama/
In a multi-GPU system like mine, the P1000 typically gets picked for use before an E-GPU due to PCI ID priority. This is ok, you can use the command nvidia-smi to find your UUID for your specific GPU you’ve installed for this project and use just that with Ollama. You take the GPU- output and pop it into the service config:
nvidia-smi -L sudo systemctl edit ollama.service
in the ollama.service configuration, put the UUID in as shown below:
[Service]
# Storage Path
Environment="OLLAMA_MODELS=/fastpool/ollama"
# Networking: Bind to Docker Bridge for AnythingLLM access
Environment="OLLAMA_HOST=172.17.0.1:11434"
Environment="OLLAMA_ORIGINS=http://192.168.1.50:3001,http://localhost:3001"
# Hardware Isolation (Replace with your 3090 UUID)
Environment="CUDA_VISIBLE_DEVICES=GPU-your-3090-uuid-here"
Environment="CUDA_DEVICE_ORDER=PCI_BUS_ID"
# 2026 Performance Fixes
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KEEP_ALIVE=-1"
Environment="OLLAMA_NUM_PARALLEL=1"
# Preload Gemma 4 into VRAM on Service Start
ExecStartPost=/usr/bin/bash -c "sleep 5 && /usr/bin/curl http://172.17.0.1:11434/api/generate -d '{\"model\": \"gemma4:e4b-it-q8_0\"}'"
I landed on the model Gemma4:e4b-it-q8_0 because I wanted to be able to do stable-diffusion image generation as well as run a model on the GPU concurrently. If I wasn’t concerned with making graphics (I’m kinda on the fence on if I need this tbh), I’d be using the model gemma4:26b-a4b-it-q4_K_M which is a mixture of experts. You can read more about that here. TLDR makes the model more performant and useful. For a list of available Gemma4 models, go here: https://ollama.com/library/gemma4/tags
Anyway, the ExecStartPost command makes the Ollama service load the model at startup so when I first go to AnythingLLM’s interface, I don’t have to wait on load timings for a response.
Once Ollama is running, I deployed AnythingLLM to my ZFS raid:
sudo chown -R 1000:1000 /fastpool/anythingllm docker run -d -p 3001:3001 \ --name anythingllm \ --cap-add SYS_ADMIN \ -v /fastpool/anythingllm:/app/server/storage \ -e STORAGE_DIR="/app/server/storage" \ --add-host=host.docker.internal:host-gateway \ --restart always \ mintplexlabs/anythingllm:latest
The reasoning for this add-host is to allow the Ollama and AnythingLLM to have an interface to communicate on (and connect from a remote box).
sudo ufw allow 3001/tcp may also be needed to open the firewall for AnythingLLM.
The Result
By pointing the browser on my gaming rig to the headless node’s IP (http://192.168.1.50:3001), I now have a fully functional RAG environment.
-
Model:
gemma4:e4b-it-q8_0(128K context) -
Inference Speed: ~95 tokens/sec on the 3090.
-
Storage Latency: Near-zero sequential load times via RAIDZ1.
-
Persistence: All models and vector databases reside on a redundant, scrubbable ZFS array.
What do you do with it? Up to you really. I’m using it to make a TTRPG campaign and exploring some vibe coding projects.
An easy button project of interest in this vein:
https://aitherium.com/ written by David Parkhurst
Which I’m probably going to turn to once i’m done kicking the tires on what I’ve built.
Hopefully this was of interest to someone out there.
Leave a Reply