When A.I. Passes This Test, Look Out

01/23/2025

If you’re looking for a new reason to be nervous about artificial intelligence, try this: Some of the smartest humans in the world are struggling to create tests that A.I. systems can’t pass. For years, A.I. systems were measured by giving new models a variety of standardized benchmark tests. Many of these tests consisted of challenging, S.A.T.-caliber problems in areas like math, science and logic. Comparing the models’ scores over time served as a rough measure of A.I. progress. But A.I. systems eventually got too good at those tests, so new, harder tests were created — often with the types of questions graduate students might encounter on their exams. Those tests aren’t in good shape, either. New models from companies like OpenAI, Google and Anthropic have been getting high scores on many Ph.D.-level challenges, limiting those tests’ usefulness and leading to a chilling question: Are A.I. systems getting too smart for us to measure?

Full report : CAIS and Scale AI release “Humanity’s Last Exam”, which they claim is the hardest- AI test yet, consisting of ~3,000 multiple-choice and short answer questions.

Tagged: AI Benchmark

Subscribe Sign In

Related Posts