Molmo and PixMo: a new way to build vision-language models

3 min read

Molmo and PixMo: A New Way to Build Vision-Language Models

The field of vision-language models (VLMs) is rapidly evolving, but much of the progress is locked behind closed doors. Proprietary models and datasets hinder research and collaboration. Molmo and PixMo offer a refreshing change, ushering in a new era of open development for VLMs.

The Problem with Closed-Source VLMs

Most of today's cutting-edge VLMs are proprietary. Companies keep their models, training data, and even their code secret. This secrecy makes it difficult for the research community to understand how these models work, reproduce their results, and build upon existing advancements. It stifles innovation and limits the potential of VLMs.

Enter Molmo and PixMo: Open for All

Molmo is a family of open-source VLMs, and PixMo is the unique, open dataset they are trained on. This openness is a game-changer. Researchers can now access the models, the data, and the code, fostering transparency and collaboration. Learn more in the paper and try the demo online.

The Secret Sauce: PixMo's Innovative Data

What makes PixMo stand out? Unlike many open VLM datasets that rely on synthetic data generated by other (often proprietary) VLMs, PixMo is built from the ground up with rich, human-generated descriptions of images. The team employed a novel data collection method: they asked people to talk about images rather than write, encouraging more natural and detailed descriptions. They even recorded the audio to ensure the data's authenticity.

Pointing the Way to the Future

PixMo also includes a groundbreaking new data type: pointing data. This data teaches Molmo to pinpoint specific elements within images, enabling it to answer location questions ("where is it?"), count objects precisely, and even provide visual explanations for its answers. This pointing ability opens up exciting possibilities for VLMs interacting with the real world, such as controlling robots or interpreting visual information in complex environments.

Impressive Performance

Evaluation Results for Molmo

The results speak for themselves. Molmo, trained on PixMo, achieves impressive performance, rivaling and even surpassing some of the leading proprietary VLMs. MolmoE-1B, the most efficient model, performs comparably to GPT-4V, a powerful closed-source VLM. The most advanced Molmo model outperforms several other proprietary models, demonstrating the potential of open-source development.

A Win for Open Science and Open Source

Molmo and PixMo represent a significant step forward for VLM research. Their open-source nature democratizes access to cutting-edge technology, fosters collaboration, and accelerates innovation. This commitment to open source empowers the community to build upon these foundations, driving progress and broadening the impact of VLMs. This is a victory for open science and paves the way for a more transparent and collaborative future in the development of vision-language models.