Today I am announcing OpenOrca, an open-source dataset and series of instruct-tuned language models.
As I read Orca: Progressive Learning from Complex Explanation Traces of GPT-4 by Mukherjee et. al. of Microsoft, I had to consider the implications for Open Source AI.
This was pretty awesome stuff. But, I realized that while Microsoft would probably release their LLaMA-13b based model (as of the time of this writing they still haven’t) I concluded that they might not release the dataset.
Therefore, I resolved to replicate their efforts, download the data myself, and train the model myself, so that OpenOrca can be released on other sizes of LLaMA as well as other foundational models such as Falcon, OpenLLaMA, RedPajama, MPT, RWKV.
This was a nontrivial undertaking. With the help of an all-star team of open-source AI/ML engineers, we have completed the OpenOrca dataset.
Our dataset consists of:
-
~1 million of FLANv2 augmented with GPT-4 completions
-
~3.5 million of FLANv2