Google DeepMind added this week agentic vision capabilities to its Gemini 3 Flash model, turning image analysis an active rather than passive task.
While typical multimodal models process images in a single “glance,” by introducing agentic capabilities, Google allows its model to actively study a picture and home in on specific details, such as street signs or a serial number on a microchip.
The new feature works by generating and running Python code that zooms, manipulates and inspects images methodically.
“By combining visual reasoning with code execution, one of the first tools supported by Agentic Vision, the model formulates plans to zoom in, inspect and manipulate images step-by-step, grounding answers in visual evidence,” Rohan Doshi, product manager at Google DeepMind, wrote in a blog post about the announcement.
The feature uses a Think-Act-Observe loop, whereby Gemini 3 Flash will study a user query and image and formulate a plan, use Python code to actively conduct an image analysis, and then inspect the results before generating its final response.
According to Google, the update saw a quality improvement of between 5% to 10% across vision benchmarks.
A range of new agentic behaviors have, Google said, already been demonstrated from the update via Google AI Studio, such as iterative zooming, direct image annotation and visual plotting. The latter is said to reduce hallucinations — a common problem with visual math tasks.
Looking ahead, the company said it plans to add more implicit code-driven behaviors into the model, meaning certain capabilities that currently require a specific prompt will become an autonomous feature.
More features, such as web and reverse image search, as well as a greater range of model sizes, are also expected to be rolled out in the future.



