Google DeepMind Introduces Vision Banana: An Instruction-Tuned Image Generator That Beats SAM 3 on Segmentation and Depth Anything V3 on Metric Depth Estimation
For years, the computer vision community has operated on two separate tracks: generative models (which produce images) and discriminative models (which understand them). The assumption was straightforward — models good at making pictures aren’t necessarily good at reading them. A new paper from Google, titled “Image Generators are Generalist Vision Learners” (arXiv:2604.20329), published April 22, 2026, blows that assumption apart. A team of Google DeepMind researchers introduced Vision Banana , a single unified model that surpasses or matches state-of-the-art specialist systems across a wide range of visual understanding tasks — including semantic segmentation, instance segmentation, monocular metric depth estimation, and surface normal estimation — while simultaneously retaining the original image generation capabilities of its base model. https://ift.tt/IBGPuRc The LLM Analogy That Changes Everything If you’ve worked with large language models, you already understand the...

