On Jagged AGI: o3, Gemini 2.5, and everything after

04/21/2025

Amid today’s AI boom, it’s disconcerting that we still don’t know how to measure how smart, creative, or empathetic these systems are. Our tests for these traits, never great in the first place, were made for humans, not AI. Plus, our recent paper testing prompting techniques finds that AI test scores can change dramatically based simply on how questions are phrased. Even famous challenges like the Turing Test, where humans try to differentiate between an AI and another person in a text conversation, were designed as thought experiments at a time when such tasks seemed impossible. But now that a new paper shows that AI passes the Turing Test, we need to admit that we really don’t know what that actually means. So, it should come as little surprise that one of the most important milestones in AI development, Artificial General Intelligence, or AGI, is badly defined and much debated. Everyone agrees that it has something to do with the ability of AIs to perform human-level tasks, though no one agrees whether this means expert or average human performance, or how many tasks and which kinds an AI would need to master to qualify. Given the definitional morass surrounding AGI, illustrating its nuances and history from its precursors to its initial coining by Shane Legg, Ben Goertzel and Peter Voss to today is challenging.

Full summary : Models like o3 and Gemini 2.5 Pro feel like “Jagged AGI”: unreliable, even at some mundane tasks, but still offering superhuman capabilities in many areas.

Tagged: AI ChatGPT Gemini AI Large Language Models

Subscribe Sign In

Related Posts