Subscribe and Sign In Buttons

‘Visual’ AI Models Might Not See Anything at All

July 14, 2024
1 min read

Recent advancements in AI, such as GPT-4o and Gemini 1.5 Pro, boast multimodal capabilities, claiming proficiency in understanding images, audio, and text. However, a new study suggests they may not perceive visual information in the traditional human sense.

While no one asserts these AIs see like humans, marketing rhetoric often refers to their “vision capabilities” and “visual understanding.” They are portrayed as capable of tasks ranging from analyzing images to interpreting videos.

The study, conducted by researchers from Auburn University and the University of Alberta, scrutinized leading multimodal models on basic visual tasks. These tasks included determining if two shapes overlap, counting pentagons in an image, and identifying a circled letter in a word—tasks elementary enough for a first-grader to excel at.

“Our seven tasks are extremely simple, where humans would perform at 100% accuracy. We expect AIs to do the same, but they are currently NOT,” explained co-author Anh Nguyen in correspondence with TechCrunch. “Our findings emphasize that even the best models are still struggling.”

For instance, when tasked with identifying overlapping shapes like circles, the models showed inconsistent performance. GPT-4o performed well when circles were far apart but struggled with close distances, achieving correct responses only 18% of the time under these conditions. Gemini Pro 1.5 performed better but still faltered with close distances, achieving correct responses in only 70% of cases.

Counting tasks also revealed stark limitations. While the models excelled when identifying five interlocking circles, their accuracy plummeted when additional rings were introduced. This variability suggests a lack of true visual understanding and the reliance on patterns ingrained during training, such as the Olympic Rings, commonly featured in their datasets.

Researchers speculate that while these models process visual data abstractly, like noting the presence of a circle in an image, they lack the capability for nuanced visual judgment. This discrepancy leads to erratic performance on seemingly straightforward tasks, undermining claims of comprehensive visual comprehension.

In conclusion, while these AI models excel in certain domains, such as recognizing human actions or everyday objects, they operate without true visual perception. Research such as this is crucial in dispelling misconceptions propagated by AI marketing and illuminating the actual capabilities and limitations of these advanced systems.

Forbes Staff

Forbes Staff is an official member of the esteemed Forbes team, dedicated to delivering high-quality content and insightful journalism. With a deep understanding of the industry and a passion for uncovering compelling stories, Forbes Staff brings their expertise to the world of fashion. As a trusted member of the Forbes team, they contribute to the renowned Forbes platform, providing readers with valuable insights into the global fashion landscape.

Previous Story

Kudos Raises $3M for Healthier, Cotton-Based Disposable Diapers

Next Story

‘Wild Wild Space’ Documentary Captures Risks and Rivalries of the New Space Race

Latest from Blog

Subscribe and Sign In Buttons