(fyi, here’s the code for everything I built below, you can clone and deploy the full app here as well.)
“AskHN” is a GPT-3 bot I trained on a corpus of over 6.5 million Hacker News comments to represent the collective wisdom of the HN community in a single bot. You can query it here in our discord using the /askhn
command if you want to play around (I’ve rate-limited the bot for now to keep my OpenAI bill from bankrupting me, so you might have to wait around for your spot).
More details on how I built it below, but I found the LLM-embedded embodiment of HN’s wisdom to be an impressive and genuinely useful resource. Thousands of the sharpest minds in technology have volunteered up their expertise, wisdom, and occasional flame over the years. There are very few topics Ask HN Bot does not have a solid answer to, even the question of finding love!
It does well with job advice:
…classic flame wars:
…the important questions:
…and unanswerable questions:
The GPT-3 prompted summaries are pretty good! And the linked articles are all very relevant — the 1500 dimensions of the OpenAI GPT embeddings are enough to slice the semantic space for this use case.
It’s main failure mode is summarizing contradictory / irrelevant comments into one response, which can lead to incoherence.
I built it in Patterns Studio over the course of a few days. Here was my rough plan at the start:
- Ingest the entire HN corpus (bulk or API) – 34.8 million stories and comments as of this writing
- Play around with filtering to get the right set of content (top articles, top comments)
- Bundle article title and top comment snippets into a document and fetch the OpenAI embedding via text-ada-embedding-002
- Index the embeddings in a database
- Accept questions via Discord bot (as well as manual input for testing)
- Look-up closest embeddings
- Put top matching content into a prompt and ask GPT-3 to summarize
- Return summary along with direct links to comments back to Discord user
This plan went ok, but results tended to be a bit to generic — I didn’t have enough of the specific relevant comment content in the prompt to give technical and specific answers. Embedding every single one of the 6.5 eligible comments was prohibitively time-consuming and expensive (12 hours and ~$2,000). So I ended up with this extension for step 6:
- Look-up closest embeddings (stories)
- With list of top N matching stories, get embeddings for all top comments on th