LLM companies to access largest Chinese non-fiction book collection by sillysaurusx

Share This Article

Sed ut perspiciatis unde.

annas-blog.org, 2023-10-04, Chinese version 中文版, Discuss on Hacker News

TL;DR: Anna’s Archive acquired a unique collection of 7.5 million / 350TB Chinese non-fiction books — larger than Library Genesis. We’re willing to give an LLM company exclusive access, in exchange for high-quality OCR and text extraction.

This is a short blog post. We’re looking for some company or institution to help us with OCR and text extraction for a massive collection we acquired, in exchange for exclusive early access. After the embargo period, we will of course release the entire collection.

High-quality academic text is extremely useful for training of LLMs. While our collection is Chinese, this should be even useful for training English LLMs: models seem encode concepts and knowledge regardless of the source language.

For this, text needs to be extracted from the scans. What does Anna’s Archive get out of it? Full-text search of the books for its users.

Because our goals align with that of LLM developers, we’re looking for a collaborator. We’re willing to give you exclusive early access to this collection in bulk for 1 year, if you can do proper OCR and text extraction. If you’re willing to share the entire code of your pipeline with us, we’d be willing to embargo the

LLM companies to access largest Chinese non-fiction book collection by sillysaurusx

LLM companies to access largest Chinese non-fiction book collection by sillysaurusx

Share This Article

Newsletter

HackTech

Leave a comment Cancel reply

Editor's Choice

LLM companies to access largest Chinese non-fiction book collection by sillysaurusx

LLM companies to access largest Chinese non-fiction book collection by sillysaurusx

Share This Article

Newsletter

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter