Agentic Vision

How AI models transform passive image viewing into active visual investigation through code execution and iterative reasoning.

What is Agentic Vision?

Agentic Vision transforms image understanding from a static, one-shot process into an active investigation. Instead of simply describing what it sees, the model formulates plans to zoom in, inspect, manipulate, and analyze images step-by-step—grounding answers in visual evidence gathered through code execution.

Agentic Vision in Action

Watch the model zoom, rotate, and scan a document

Document View

INVOICE #2024-0847SN-4827-XK

Agent Log

Click "Start Demo" to see agentic vision in action
ThinkZoomRotateScanDone

The Think-Act-Observe Loop

At the core of agentic vision is a rigorous iterative process that mirrors how humans investigate complex visual information.

1

Think

The model analyzes the user's request and the initial image, then formulates a multi-step plan for how to extract the needed information.

2

Act

The model generates and executes Python code to manipulate or analyze the image—cropping regions of interest, running calculations, counting objects, or drawing annotations.

3

Observe

The transformed image is appended back into the model's context window, allowing it to inspect the results before deciding on the next action or producing a final answer.

Key Capabilities

Agentic vision enables several powerful capabilities that passive vision models cannot match.

Zoom & Inspect

The model detects when details are too small to read (like a distant gauge or serial number) and writes code to crop and re-examine the area at higher resolution.

Visual Math

Run multi-step calculations using code—summing line items on a receipt, measuring angles in a diagram, or generating charts from extracted data.

Image Annotation

Draw arrows, bounding boxes, or other annotations directly onto images to answer spatial questions like "Where should this item go?"

Iterative Refinement

If the first approach doesn't yield clear results, the model can try alternative strategies—different crop regions, image enhancement, or multiple counting methods.

How It Works

When you ask an agentic vision model a question about an image, it doesn't just look and respond. It reasons about what operations would help answer the question, executes code to perform those operations, and uses the results to inform its answer.

1

Receive Query

User asks a question about an image that requires detailed analysis.

2

Plan Operations

Model determines what visual operations (crop, zoom, annotate) would help answer the question.

3

Execute Code

Python code is generated and run to manipulate the image as planned.

4

Analyze Results

The modified image is fed back to the model for inspection.

5

Iterate or Answer

Model either performs additional operations or provides the final answer with evidence.

Example: Reading a Distant Serial Number

Imagine asking "What's the serial number on that device in the corner of the photo?"

1
Model identifies the device location is in the bottom-right corner
2
Generates code to crop that region and upscale it 4x
3
Inspects the zoomed image and identifies the serial number text
4
Returns the serial number with confidence, noting the crop used

Models with Agentic Vision

Several frontier models now support agentic vision capabilities.

Google Gemini 3 Flash

First major model to introduce "Agentic Vision" as a named feature, combining visual reasoning with code execution. Shows 5-10% quality boost on vision benchmarks when code execution is enabled.

NVIDIA Cosmos Reason

A 7B parameter reasoning VLM designed for physical AI applications. Can understand and act in real-world environments using prior knowledge and physics understanding.

OpenAI Computer-Using Agent

Combines large reasoning models with reinforcement-learned UI interaction, enabling pixel-precise pointing at objects and UI elements.

Real-World Applications

Agentic vision is already being deployed in production systems.

Document Processing

Automatically zoom into tables, charts, and fine print to extract accurate data from complex documents.

Quality Inspection

Detect defects by systematically inspecting different regions of product images at high resolution.

Spatial Reasoning

Answer "where should this go?" questions by annotating images with arrows and placement guides.

Receipt Analysis

Extract line items, calculate totals, and verify math by combining OCR with code-based computation.

Passive vs Agentic Vision

Understanding the fundamental difference in approach.

Passive Vision

Single forward pass through the model. What you see is what you get. Limited by initial image resolution and model attention.

Agentic Vision

Iterative investigation loop. Can zoom, crop, enhance, and re-examine. Grounds answers in executed code and visual evidence.

Key Takeaways

  • 1Agentic vision treats image understanding as an active investigation, not passive perception
  • 2The Think-Act-Observe loop enables models to zoom, crop, and analyze images iteratively
  • 3Code execution provides verifiable, grounded visual reasoning
  • 4Enabling agentic capabilities shows 5-10% improvement on vision benchmarks
  • 5This paradigm bridges the gap between how humans and AI systems investigate visual information