
Show HN: how I built the largest open database of Australian law by ubutler
Late last year, while attempting to train a large language model to solve legal problems, I made a surprising discovery — there weren’t any open databases of Australian law to train my model on. While there were certainly a few free-to-access legal databases, none were truly open, at least not in the sense of being able to just download their data and start training models without fear of infringing on anyone’s copyright. They all had policies against web scraping, and they were all either unable or unwilling to license their content.
So, before I could start training an LLM on Australian legal data, I’d need to get my hands on that data first. As with most of my projects, this sounded much easier than it would actually turn out to be. Almost a year later, and I am still hard at work on expanding my database to encompass all of Australia’s legal code.
In this article, I’ll walk you through the entire process of how I built the Open Australian Legal Corpus, the largest open database of Australian law, from months-long negotiations with governments to reverse engineering ancient web technologies to hacking together a multitude of different solutions for extracting text from documents.
The first step was to ask Australian governments and courts for permission to scrape their data. For some jurisdictions like New South Wales and Queensland, which incidentally both use the same legislation management system, this was relatively easy, and I was helpfully directed to the endpoints of their public APIs. For others, however, the process was anything but simple. There were a few that imposed restrictions on my access to and use of their data, one that simply didn’t respond to my enquiries, and a great many that outright refused my requests, often times without giving a reason. Perhaps my most memorable interaction with a data source was when I was told that I would overburden their team by requiring them to manually sift through and