What is Agentic Vision?
Agentic Vision transforms image understanding from a static, one-shot process into an active investigation. Instead of simply describing what it sees, the model formulates plans to zoom in, inspect, manipulate, and analyze images step-by-step—grounding answers in visual evidence gathered through code execution.
Agentic Vision in Action
Watch the model zoom, rotate, and scan a document
Document View
Agent Log
The Think-Act-Observe Loop
At the core of agentic vision is a rigorous iterative process that mirrors how humans investigate complex visual information.
Think
The model analyzes the user's request and the initial image, then formulates a multi-step plan for how to extract the needed information.
Act
The model generates and executes Python code to manipulate or analyze the image—cropping regions of interest, running calculations, counting objects, or drawing annotations.
Observe
The transformed image is appended back into the model's context window, allowing it to inspect the results before deciding on the next action or producing a final answer.
Key Capabilities
Agentic vision enables several powerful capabilities that passive vision models cannot match.
Zoom & Inspect
The model detects when details are too small to read (like a distant gauge or serial number) and writes code to crop and re-examine the area at higher resolution.
Visual Math
Run multi-step calculations using code—summing line items on a receipt, measuring angles in a diagram, or generating charts from extracted data.
Image Annotation
Draw arrows, bounding boxes, or other annotations directly onto images to answer spatial questions like "Where should this item go?"
Iterative Refinement
If the first approach doesn't yield clear results, the model can try alternative strategies—different crop regions, image enhancement, or multiple counting methods.
How It Works
When you ask an agentic vision model a question about an image, it doesn't just look and respond. It reasons about what operations would help answer the question, executes code to perform those operations, and uses the results to inform its answer.
Receive Query
User asks a question about an image that requires detailed analysis.
Plan Operations
Model determines what visual operations (crop, zoom, annotate) would help answer the question.
Execute Code
Python code is generated and run to manipulate the image as planned.
Analyze Results
The modified image is fed back to the model for inspection.
Iterate or Answer
Model either performs additional operations or provides the final answer with evidence.
Example: Reading a Distant Serial Number
Imagine asking "What's the serial number on that device in the corner of the photo?"
Models with Agentic Vision
Several frontier models now support agentic vision capabilities.
Google Gemini 3 Flash
First major model to introduce "Agentic Vision" as a named feature, combining visual reasoning with code execution. Shows 5-10% quality boost on vision benchmarks when code execution is enabled.
NVIDIA Cosmos Reason
A 7B parameter reasoning VLM designed for physical AI applications. Can understand and act in real-world environments using prior knowledge and physics understanding.
OpenAI Computer-Using Agent
Combines large reasoning models with reinforcement-learned UI interaction, enabling pixel-precise pointing at objects and UI elements.
Real-World Applications
Agentic vision is already being deployed in production systems.
Document Processing
Automatically zoom into tables, charts, and fine print to extract accurate data from complex documents.
Quality Inspection
Detect defects by systematically inspecting different regions of product images at high resolution.
Spatial Reasoning
Answer "where should this go?" questions by annotating images with arrows and placement guides.
Receipt Analysis
Extract line items, calculate totals, and verify math by combining OCR with code-based computation.
Passive vs Agentic Vision
Understanding the fundamental difference in approach.
Passive Vision
Single forward pass through the model. What you see is what you get. Limited by initial image resolution and model attention.
Agentic Vision
Iterative investigation loop. Can zoom, crop, enhance, and re-examine. Grounds answers in executed code and visual evidence.
Key Takeaways
- 1Agentic vision treats image understanding as an active investigation, not passive perception
- 2The Think-Act-Observe loop enables models to zoom, crop, and analyze images iteratively
- 3Code execution provides verifiable, grounded visual reasoning
- 4Enabling agentic capabilities shows 5-10% improvement on vision benchmarks
- 5This paradigm bridges the gap between how humans and AI systems investigate visual information