Overview.
We are thrilled to announce the release of OpenFlamingo, an open-source reproduction of DeepMind’s Flamingo model. At its core, OpenFlamingo is a framework that enables training and evaluation of large multimodal models (LMMs). Check out our GitHub repository and demo to get started!
For this first release, our contributions are as follows:
- 🏋️ A Python framework to train Flamingo-style LMMs (based on Lucidrains’ flamingo implementation and David Hansmair’s flamingo-mini repository).
- 🪅 A large-scale multimodal dataset with interleaved image and text sequences.
- 🧪 An in-context learning evaluation benchmark for vision-language tasks.
- 🤖 A first version of our OpenFlamingo-9B model based on LLaMA, with much better models to come!
The recent progress in open-source LMMs with the release of BLIP-2 and FROMAGe has shown the exciting potential of multimodal systems. We hope that OpenFlamingo will help drive progress in multimodal machine learning, and we have more exciting contributions in the pipeline, so stay tuned!
Goal.
Our goal with OpenFlamingo is to develop a multimodal system that can tackle a diverse range of vision-language tasks. Ultimately, we aim to match the power and versatility of GPT-4 in handling visual and text input. To achieve this goal, we are creating an open-source version of DeepMind’s Flamingo model, a LMM capable of processing and reasoning about images, videos, and text. We are committed to build fully open-source models, and believe this transparency is essential for fostering collaboration, accelerating progress, and democratizing access to state-of-the-art LMMs. Our release is the first step towards this goal.
We are sharing the first checkpoint of our OpenFlamingo-9B model. While the model is not yet fully optimized, it demonstrates the potential of this project. By working together and receiving feedback from the community, we can train better LMMs. We encourage the community to participate in the development process by providing feedback and contributing to the repository.
Technical Details.
Our implementation largely follows that of Flamingo. Flamingo models are trained on large-scale web corpora containing interleaved text and images, which is crucial for endowing them with in-context few-shot learning capabilities. OpenFlamingo implements the same architecture (Perceiver resamplers, cross-attention layers) proposed in the original Flamingo paper. However, since the training data for Flamingo is not available to the public, we use open-source datasets for training our models. Specifically, the released OpenFlamingo-9B checkpoint is trained on 5M samples from our new Multimodal C4 dataset and 10M samples from LAION-2B.
Multimodal C4
The Multimodal-C4 dataset is an expansion of the text-only C4 dataset, which was used to train T5 models. For each document in the C4 en.clean dataset, we retrieve the original webpage from Common Crawl, then collect the downloadable images. Data cleaning is carried out through deduplication and content filtering, which aims to eliminate non-safe for work (NSFW) and unrelated images, such as advertisements. Additionally, we run face detection and discard images with positive identifications. Finally, images and sentences are interleaved using bipartite matching within a document: CLIP ViT/L-14 image-text similarities serve as edge weights. Multimodal-C4