searchcode.com’s SQLite database is probably one of the largest in the world, at least for a public facing website. It’s actual size is 6.4 TB. Which is probably 6 terabytes bigger than yours.
-rw-r--r-- 1 searchcode searchcode 6.4T Feb 17 04:30 searchcode.db
At least, I think its bigger. I have no evidence to the contrary, being by far the largest I have ever heard of. Poking around the internet did not find anyone talking publicly about anything bigger. The largest I could find using any search or LLM was 1 TB (without source), and a few people on HN and Reddit claiming to be working with SQLite database’s around the size of 10’s of gigabytes.
However lack of evidence does not mean such as thing does not exist, and so if you do have a larger database please let me know, either by mocking me on some comment section somewhere, or by direct abuse using my email below and I will be appropriately scolded.
Probably more interesting is why searchcode.com has such a large SQLite database. For those who don’t know https://searchcode.com/ is my side/desperately needs to pay for itself/ passion project, which as the name suggests is a place to search source code. It has multiple sources including, github, bitbucket, codeplex, sourceforge, gitlab and 300+ languages. I have written about it a lot, which probably one of the more interesting posts about building its own index https://boyter.org/posts/how-i-built-my-own-index-for-searchcode/
searchcode.com itself is now officially a ship of Theseus project, as every part I started with has now been replaced. A brief history,
- First version released using PHP, CodeIgniter, MySQL, Memcached, Apache2 and Sphinx search
- Rewritten using Python, Django, MySQL, Memcached, Sphinx search, Nginx and RabbitMQ
- Never publicly released version using Java, MySQL, Memcached, Nginx and Sphinx search
- Start of Covid 19 rewritten using Go, MySQL, Redis, Caddy and Manticore search
- Replaced Manticore search with custom index now the stack consists Go, MySQL, Redis and Caddy
- As of a few days ago Go, SQLite, Caddy
The constant between everything till now was the use of MySQL as the storage layer. The reasons for using it initially was it was there, I knew how to work it and it would scale along with my needs fairly well. So what changed? If you look at my previous choices you will see there is in general a move to reducing the number of dependencies. The older and more crusty I get the more I appreciate having a single binary I can just deploy. Single binary deploys are very simple to reason about.
So why SQLite? Well mostly because you can compile it directly into the binary, so I can have my single binary deploy. No need to install any dependencies. Why not write my own? I am not confident enough in my abilities to write something like this myself, at least not in any reasonable time frame. While I may be crazy enough to write my own index engine, I am not crazy enough to write my own storage persistence layer.
Looking at embedded databases I had previously used and played around with embedded Go databases such as bbolt but they never worked at the sort of scale I was expecting to deal with, and since the data I was working with was already mostly relational I wanted to stay in the SQL world.
SQLite has not been totally fault free however. Previously I had used it a
7 Comments
antithesis-nl
Yup, they win. My biggest SQLite database is 1.7TB with, as of just now 2314851188 records (all JSON documents with a few keyword indexes via json_extract).
Works like a charm, as in: the web app consuming the API linked to it returns paginated results for any relevant search term within a second or so, for a handful of concurrent users.
1f60c
searchcode doesn't seem to work for me. All queries (even the ones recommended by the site) unfortunately return zero results. Maybe it got hugged?
https://searchcode.com/?q=re.compile+lang%3Apython
bborud
I've been using RWMutex'es around SQLite calls as a precaution since I couldn't quite figure out if it was safe for concurrent use. This is perhaps overkill?
Since I do a lot of cross-compiling I have been using https://modernc.org/sqlite for SQLite. Does anyone have some knowledge/experience/observations to share on concurrency and SQLite in Go?
lokimedes
I’ll bet you some CERN PhD student has a forgotten 100 TB detector calibration database in sqlite somewhere in the dead caverns of collaboration effort.
leighleighleigh
I've been looking for a service just like searchcode, to try and track down obscure source code. All the best, hope it can be sustainable for you.
Alifatisk
Is the site like grep.app?
feverzsj
I'd consider no relational db scales reads vertically better than SQLite. For writes, you can batch them or distribute them to attached dbs. But, either way, you may lose some transaction guarantee.