Start your day with intelligence. Get The OODA Daily Pulse.
Fascinating new paper from Google DeepMind which makes a very convincing case that their Veo 3 model – and generative video models in general – serve a similar role in the machine learning visual ecosystem as LLMs do for text. LLMs took the ability to predict the next token and turned it into general purpose foundation models for all manner of tasks that used to be handled by dedicated models – summarization, translation, parts of speech tagging etc can now all be handled by single huge models, which are getting both more powerful and cheaper as time progresses. Generative video models like Veo 3 may well serve the same role for vision and image reasoning tasks.