annas-blog.org, 2023-10-04, Chinese version 中文版, Discuss on Hacker News
TL;DR: Anna’s Archive acquired a unique collection of 7.5 million / 350TB Chinese non-fiction books — larger than Library Genesis. We’re willing to give an LLM company exclusive access, in exchange for high-quality OCR and text extraction.
This is a short blog post. We’re looking for some company or institution to help us with OCR and text extraction for a massive collection we acquired, in exchange for exclusive early access. After the embargo period, we will of course release the entire collection.
High-quality academic text is extremely useful for training of LLMs. While our collection is Chinese, this should be even useful for training English LLMs: models seem encode concepts and knowledge regardless of the source language.
For this, text needs to be extracted from the scans. What does Anna’s Archive get out of it? Full-text search of the books for its users.
Because our goals align with that of LLM developers, we’re looking for a collaborator. We’re willing to give you exclusive early access to this collection in bulk for 1 year, if you can do proper OCR and text extraction. If you’re willing to share the entire code of your pipeline with us, we’d be willing to embargo the