Architecting a Local, Headless AI Inference Lab

When engineering a localized, bare-metal AI lab environment, combining cloud-first container ecosystems with highly specific on-premises hardware introduces unique structural friction. In this guide, I document the deployment of a unified text-and-image inference stack utilizing OpenWebUI and ComfyUI on a headless Ubuntu Server LTS platform. Crucially, I trace the end-to-end journey of isolating container execution blocks, binding physical storage variables to a high-performance ZFS dataset pool (fastpool), mapping an external NVIDIA RTX 3090 FE via Oculink, and systematically debugging deep routing anomalies within complex API topologies.

I’ll be working with a Dell Precision and Oculink-connected Nvidia RTX 3090 FE running Ubuntu Server LTS. The server OS is installed and working and I’m ssh’d into the server via terminal from a gaming machine. The Dell Precision has 3 NVME drives provisioned via NVME2PCI adapters running ZFS and fastpool is mounted to it. The Precision also has a Nvidia P1000 card for the monitor/local display. That card is out of scope for our exercise.

Phase 1: Establishing Persistent ZFS Storage Layers

To avoid container immutability loss and guarantee clean, snapshot-capable data structures, application volumes must be abstracted out of the container storage layers into explicit, persistent datasets on the host filesystem.

# Create dedicated datasets inside the local NVMe ZFS storage pool
sudo zfs create -o mountpoint=/opt/ai fastpool/ai
sudo zfs create fastpool/ai/openwebui
sudo zfs create fastpool/ai/comfyui

# Align permissions to ensure container execution mapping succeeds
sudo chown -R $USER:$USER /opt/ai

Phase 2: OpenWebUI Deployment & Remote Network Access

OpenWebUI operates as our orchestrator and user interface dashboard. Because the primary text inference LLM agent runs natively via Ollama on the bare-metal Ubuntu host rather than inside the Docker virtual network, standard container-to-localhost communication loops will fail. To resolve this, I leverage the host-gateway bridging network model:

docker run -d \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v /opt/ai/openwebui:/app/backend/data \
  --name openwebui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

By injecting --add-host=host.docker.internal:host-gateway, OpenWebUI can transparently step outside its bridge environment and access host services (like Ollama on port 11434) via the http://host.docker.internal:11434 endpoint address.

Crucial Step: How to Access OpenWebUI Remotely from Your LAN
Because this is a headless server setup, you will interact with OpenWebUI from your primary workstation or gaming machine on the same local network.

1. Find your Server IP: Run ip a or hostname -I on your headless Ubuntu host to locate its local network IP (e.g., 192.168.1.150).
2. Open the Web URL: From your client machine’s web browser, navigate directly to http://<your-ubuntu-server-ip>:3000.
3. Configure the Host Firewall: If you are running Ubuntu’s Uncomplicated Firewall (UFW), inbound traffic on port 3000 will be blocked by default. You must explicitly open it by executing:
sudo ufw allow 3000/tcp

Phase 3: Hardware Translation & Stable Oculink Mapping

Unlike standard internal PCIe expansion cards, external GPUs connected via Oculink or eGPU interfaces do not guarantee a static enumeration order across reboots. If another device alters the hardware index array, assigning a container to index 0 might accidentally bind a lower-power integrated chip or cause runtime initialization errors. To future-proof the hardware layer, I target the hardware explicitly by its immutable, cryptographically unique NVIDIA UUID.

# Query the host kernel for the static, unalterable hardware UUID
nvidia-smi -L

Example Output: GPU-b169214d-e64e-5222-f36b-330e62645e1e

Next, ensure the host runtime layer is capable of translating native CUDA driver calls into containerized execution blocks by provisioning the NVIDIA Container Toolkit:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Phase 4: ComfyUI & The Cloud-First Container Conflict

Our image generation engine relies on the widely utilized ai-dock/comfyui image due to its highly optimized base layering of CUDA libraries, PyTorch extensions, and xformers. However, initializing this image within an isolated local network exposes serious architectural mismatches.

Encountered Anomaly: Unexpected External Telemetry & Daemon Proliferation
Upon inspection of the initial deployment container log loops via docker logs comfyui, the application was observed initializing multiple unvetted background daemons: cloudflared instances dialing out to establish public quick tunnels, Syncthing peer-to-peer loops, and built-in proxies trying to force redirects to port 1111. This created severe loopback binding failures and aggressive local browser cache traps.

The Architecture Solution: Surgical Docker Command Override
Rather than attempting to play nice with the image’s stubborn initialization layer, I leverage a high-level Docker command override. By appending the raw python environment execution array directly to the tail-end of the docker run configuration statement, Docker completely ignores the entrypoint supervisor scripts, entirely neutralizing Caddy, Cloudflare, SSH, and Syncthing.

docker run -d \
  --name comfyui \
  --gpus device=GPU-b169214d-e64e-5222-f36b-330e62645e1e \
  -p 8188:8188   -v /opt/ai/comfyui:/workspace \
  --restart always \
  ghcr.io/ai-dock/comfyui:latest-cuda \
  /opt/environments/python/comfyui/bin/python /opt/ComfyUI/main.py --listen 0.0.0.0 --port 8188

This deployment model guarantees a sterile, secure runtime execution state. Only the raw, pure Python ComfyUI engine initializes, bound explicitly to all local network interfaces via --listen 0.0.0.0 on standard port 8188. To connect your remote desktop to the raw ComfyUI canvas independently, ensure your host firewall permits traffic: sudo ufw allow 8188/tcp.

Phase 5: Headless Model Sourcing

Because our ZFS pool mounts natively to the headless server, downloading multi-gigabyte neural weights on a local desktop and staging them across slow network shares introduces needless bandwidth bottlenecks. Instead, I perform a direct container execution bypass to pull weights straight from the HuggingFace CDN networks into our persistent storage path:

# Direct injection of Juggernaut XL (Base Text-To-Image Weights)
docker exec -it comfyui wget -O /opt/ComfyUI/models/checkpoints/Juggernaut_XL.safetensors https://huggingface.co/RunDiffusion/Juggernaut-XL-v9/resolve/main/Juggernaut-XL_v9_RunDiffusionPhoto_v2.safetensors

# Direct injection of DreamShaper XL Turbo (High-Speed Latent Consistency Model)
docker exec -it comfyui wget -O /opt/ComfyUI/models/checkpoints/DreamShaperXL_Turbo_v2_1.safetensors https://huggingface.co/Lykon/dreamshaper-xl-v2-turbo/resolve/main/DreamShaperXL_Turbo_v2_1.safetensors

Optimization Metric: When utilizing the DreamShaper XL Turbo model, ensure the inference sampling configuration is restricted to 6 to 8 steps to avoid over-saturation artifacts.

Phase 6: API Data Mapping & Structural Variable Resolution

ComfyUI requires a complete node graph structure passed as an API execution block. To establish communication, I extract an explicit execution graph from the backend using ComfyUI’s Save (API Format) dev tool options. In our testing, standard variable substitutions failed with errors like [ERROR: 'NoneType' object is not subscriptable] and [ERROR: 'NoneType' object has no attribute 'lower']. This happens when mapping IDs mismatch or active chat caches hold onto unmapped models.

To resolve this, OpenWebUI’s Admin Panel -> Settings -> Images dashboard (accessible remotely via your port 3000 URL) must be hardcoded to match the underlying JSON variable integers precisely:

Image Creation Mappings: Prompt (Node 6, key: text), Model (Node 4, key: ckpt_name), Seed/Steps (Node 3, keys: seed/steps), Width/Height (Node 5, keys: width/height).
Image Modification Mappings: Image (Node 10, key: image), Prompt (Node 6, key: text), Model (Node 4, key: ckpt_name). Leave width and height parameters blank.

Phase 7: Prompt-Driven Image Editing & Agentic Execution

To handle prompt-driven image modification loops without getting hardcoded “As an AI language model…” rejections, core orchestration settings must be modified. Navigate to Admin Panel -> Settings -> Models -> Global Settings (Gear Icon), and change the Function Calling parameters block to Native. Next, explicitly ensure the image_generation capability checkbox is toggled to Enabled in the LLM’s specific properties page.

Finally, to prevent unmasked prompt edits from mapping onto solid black alpha layers (returning an unmodified picture), pivot your ComfyUI node network from an Inpainting structure to a clean Image-to-Image (Img2Img) topology: delete the VAEEncodeForInpaint block, link the LoadImage node directly into a standard VAE Encode block, and pull the KSampler‘s Denoise Factor down to 0.45 or 0.50. This forces your Oculink-isolated 3090 to preserve the shapes of your original layout while giving it the artistic freedom to render your modifications seamlessly.

Conclusion

I hope you enjoyed the walkthrough I wrote here and found it useful. Good luck in your own building projects and have fun with it, there’s a lot of capabilities out there for consumer grade hardware, enough to learn how this stuff works and can go sideways for sure 🙂

Jeff

System References & Technical Documentation

[1] NVIDIA Corporation. NVIDIA Container Toolkit Installation Guide.
[2] OpenWebUI Community. Integrating ComfyUI Backends via Workflow JSON Maps.
[3] OpenWebUI GitHub Repository. Resolving Subscriptable NoneType API Payloads and Native Tool Parameters. Issue Trackers #3121 & #4092.
[4] ComfyUI Project. Advanced Command-Line Execution Options and Loopback Listener Flags.
[5] OpenZFS Foundation. ZFS Admin Manual: Dataset Creation and Mountpoint Management.

Jeff Stokes

Debugging adventures of Jeff Stokes

Architecting a Local, Headless AI Inference Lab

Phase 1: Establishing Persistent ZFS Storage Layers

Phase 2: OpenWebUI Deployment & Remote Network Access

Phase 3: Hardware Translation & Stable Oculink Mapping

Phase 4: ComfyUI & The Cloud-First Container Conflict

Phase 5: Headless Model Sourcing

Phase 6: API Data Mapping & Structural Variable Resolution

Phase 7: Prompt-Driven Image Editing & Agentic Execution

Conclusion

System References & Technical Documentation

Like this:

Be the first to comment

Leave a ReplyCancel reply

Phase 1: Establishing Persistent ZFS Storage Layers

Phase 2: OpenWebUI Deployment & Remote Network Access

Phase 3: Hardware Translation & Stable Oculink Mapping

Phase 4: ComfyUI & The Cloud-First Container Conflict

Phase 5: Headless Model Sourcing

Phase 6: API Data Mapping & Structural Variable Resolution

Phase 7: Prompt-Driven Image Editing & Agentic Execution

Conclusion

System References & Technical Documentation

Share this:

Like this:

Be the first to comment

Leave a ReplyCancel reply