Start your day with intelligence. Get The OODA Daily Pulse.
It took just one weekend for the new, self-proclaimed king of open source AI models to have its crown tarnished. Reflection 70B, a variant of Meta’s Llama 3.1 open source large language model (LLM) — or wait, was it a variant of the older Llama 3? — that had been trained and released by small New York startup HyperWrite (formerly OthersideAI) and boasted impressive, leading benchmarks on third-party tests, has now been aggressively questioned as other third-party evaluators have failed to reproduce some of said performance measures. The model was triumphantly announced in a post on the social network X by HyperWrite AI co-founder and CEO Matt Shumer on Friday, September 6, 2024 as “the world’s top open-source model.” In a series of public X posts documenting some of Reflection 70B’s training process and subsequent interview over X Direct Messages with VentureBeat, Shumer explained more about how the new LLM used “Reflection Tuning,” a previously documented technique developed by other researchers outside the company that sees LLMs check the correctness of or “reflect” on their own generated responses before outputting them to users, improving accuracy on a number of tasks in writing, math, and other domains. However, on Saturday September 7, a day after the initial HyperWrite announcement and VentureBeat article were published, Artificial Analysis, an organization dedicated to “Independent analysis of AI models and hosting providers” posted its own analysis on X stating that “our evaluation of Reflection Llama 3.170B’s MMLU score” — referencing the commonly used Massive Multitask Language Understanding (MMLU) benchmark — “resulted in the same score as Llama 3 70B and significantly lower than Meta’s Llama 3.1 70B,” showing a major discrepancy with HyperWrite/Shumer’s originally posted results.
Full story : New open source AI leader Reflection 70B’s performance questioned, accused of ‘fraud.’