Thoughts by a non-economist on AI and economics

11/05/2025

METR has had a very influential work by Kwa and West et al on measuring AI’s ability to complete long tasks. Its main result is the following remarkable graph; On the X axis is the release date of flagship LLMs. On the Y axis is the following measure of their capabilities: take software-engineering tasks that these models can succeed in solving 50% of the time, and measure the time it takes humans to solve them. While it is not surprising that models improve over time, the main reason this graph is remarkable is because the Y axis is on a log scale. This means that there is a fixed period of time after which models have doubled the length of tasks they can complete successfully. Specifically METR estimates this “doubling time” (which is proportional to the inverse of the slope of the line in this graph) at about 7 months, although they note that it may have accelerated recently (to as little as 3 months if considering only models after 2024).

Full research : AI’s ability to complete long and complex software engineering tasks doubles every 6-7 months, but there is a “messiness tax” for real-world tasks.

Tagged: Large Language Models Productivity research

Subscribe Sign In

Related Posts