Multimodal LLMs Locally: What Vision Models Can Actually Do on Your GPU

Multimodal LLMs that accept both images and text as input have been available via API for a while. Running them locally is a different story. The gap between what cloud-hosted vision models can do and what you can run on your own GPU without an internet dependency is real, and most coverage of local multimodal models either overstates capability or undersells the hardware requirements. Here’s what vision models can actually do on your GPU when you’re running them locally.

The use case for local multimodal models is clearer than it might seem. If your workflow involves processing images that contain sensitive information, screenshots of internal dashboards, medical imagery, private documents, proprietary designs, sending them to an external API is a data handling decision you may not be authorized to make. Running a local LLM on your own hardware keeps that data on your machine. That’s not a feature, it’s a requirement in certain contexts.

The Models Worth Testing

LLaVA (Large Language and Vision Assistant) was one of the first practically useful open-weight multimodal models and it’s still a reasonable baseline. The 7B variant runs on a GPU with 8GB of VRAM at 4-bit quantization. It can describe images, answer questions about visual content, read text in images with moderate accuracy, and follow visual instructions. It’s not GPT-4V, but it’s useful for document processing, screenshot analysis, and basic visual QA tasks.

LLaVA-NeXT and its variants improved significantly on the original’s OCR and spatial reasoning. If your use case involves reading structured data from images, tables, forms, charts the newer variants are worth the slightly higher compute requirement. BakLLaVA combines the Mistral base model with LLaVA’s vision components and tends to produce better instruction-following behavior for practical tasks.

MiniCPM-V deserves attention specifically for lower-VRAM deployments. The 2B and 8B variants run on 6-8GB of VRAM and produce surprisingly capable visual reasoning for their size. For basic document understanding and image description tasks, this is the model to reach for when VRAM is the constraint. Choosing the right model for your GPU tier matters more with multimodal models than with text-only ones because vision encoders add memory overhead on top of the language model’s base requirements.

What They Actually Do Well

Screenshot analysis and UI description is a genuine strength of current local vision models. Give a multimodal model a screenshot of a web application and ask it to describe the layout, identify interactive elements, or flag potential usability issues, and it does a reasonable job. This has direct practical value for QA workflows where you want to automate visual inspection without building a full computer vision pipeline.

Document extraction from image-based PDFs and scanned documents is another real use case. OCR quality varies it’s not Tesseract-level reliability on structured forms, but for unstructured document content like letters, reports, and notes, local vision models extract text and answer questions about document content usably. The workflow is straightforward: pass the image to the model, ask specific extraction questions, and post-process the output.

Image-to-text description for accessibility and cataloging purposes works well within the capability range of current models. Batch processing a folder of images to generate alt text, descriptions, or searchable metadata is a practical automation task that runs reliably on local hardware without API costs.

Where They Fall Apart

Fine-grained spatial reasoning is still weak in most local vision models. Asking a model to count specific objects, identify precise locations of elements in a complex image, or reason about spatial relationships between multiple objects produces inconsistent results. Don’t build a pipeline that depends on accurate spatial reasoning from a local vision model without extensive testing first.

High-resolution image processing is constrained by how the model tiles or resizes its input. Most local multimodal models internally resize images to a fixed resolution before encoding them, which means fine detail in large images gets lost. If your use case requires reading small text in dense images or analyzing high-resolution technical diagrams, test your specific inputs against your target model before committing.

Video understanding is largely unavailable in local multimodal models at practical speeds. Some models accept multiple image frames but processing video at anything resembling real-time speed requires hardware and model architecture that isn’t yet accessible at the consumer GPU tier. If you need video analysis locally, frame sampling with image-level analysis is the practical workaround.

Running Them via Ollama

Ollama supports several multimodal models including LLaVA variants and BakLLaVA through the same interface you’d use for text-only models. The n8n and Ollama pipeline approach works for multimodal tasks too, pass an image path or base64-encoded image alongside your text prompt and the model handles the rest. The local API is identical in structure to the text endpoint, which makes integrating vision capability into existing local AI workflows straightforward.

Local multimodal models are genuinely useful for a narrower set of tasks than their cloud equivalents, and that’s fine. Know what they can do reliably, build workflows around those capabilities, and don’t expect them to replace a cloud vision API for tasks that require precision. For everything else, keeping your data local and your API costs at zero is a real advantage.