Video models are zero-shot learners and reasoners

09/29/2025

Fascinating new paper from Google DeepMind which makes a very convincing case that their Veo 3 model – and generative video models in general – serve a similar role in the machine learning visual ecosystem as LLMs do for text. LLMs took the ability to predict the next token and turned it into general purpose foundation models for all manner of tasks that used to be handled by dedicated models – summarization, translation, parts of speech tagging etc can now all be handled by single huge models, which are getting both more powerful and cheaper as time progresses. Generative video models like Veo 3 may well serve the same role for vision and image reasoning tasks.

Full research : DeepMind says video models like Veo 3 could become general purpose foundation models for vision, like LLMs for text, using zero-shot “chain-of-frames” reasoning.

Tagged: AI Future DeepSeek Large Language Model

Subscribe Sign In

Related Posts