As competition in the generative AI field shifts toward multimodal models, Meta has released a preview of what can be its answer to the models released by frontier labs. Chameleon, its new family of models, has been designed to be natively multi-modal instead of putting together components with different modalities. While Meta has not released the models yet, their reported experiments show that Chameleon achieves state-of-the-art performance in various tasks, including image captioning and visual question answering (VQA), while remaining competitive in text-only tasks. The architecture of Chameleon can unlock new AI applications that require a deep understanding of both visual and textual information. The popular way to create multimodal foundation models is to patch together models that have been trained for different modalities. This approach is called “late fusion,” in which the AI system receives different modalities, encodes them with separate models and then fuses the encodings for inference. While late fusion works well, it limits the ability of the models to integrate information across modalities and generate sequences of interleaved images and text. Chameleon uses an “early-fusion token-based mixed-modal” architecture, which means it has been designed from the ground up to learn from an interleaved mixture of images, text, code and other modalities. Chameleon transforms images into discrete tokens, as language models do with words. It also uses a unified vocabulary that consists of text, code and image tokens. This makes it possible to apply the same transformer architecture to sequences that contain both image and text tokens. According to the researchers, the most similar model to Chameleon is Google Gemini, which also uses an early-fusion token-based approach. However, Gemini uses separate image decoders in the generation phase, whereas Chameleon is an end-to-end model that both processes and generates tokens.
Full report : Meta introduces Chameleon, a state-of-the-art multimodal model.