Reader view is Firefox’s answer bloated websites and content-filled websites with non-readable content, but how does it function underneath?
The main crux is utilizing a big bag of heuristics and knowledge about how semantic HTML elements should be used and structured to strip the page of content that offers no immediate relevance to the reader. All this while also making the font size bigger and more readable, and the page more distraction-free.
If you haven’t tried it out (and are still sticking with Firefox), please do! If you have, then welcome to an episode of how does it work (with more code than usual).
The reader view code is in a library form on Mozilla’s Github and is pure Javascript. The code is split into modules for checking the readability of a page (Readability-readerable.js) and the actual conversion of a document to reader view (Readability.js).
The conversion function takes in the document
and pushes out the article title, HTML string of processed article content, length of an article, in characters, article description (or excerpt automatically extracted from the content), and author metadata (byline
). The whole process is a bit heuristical in approach as can be seen in for example hardcoding of site names and attributes, calculation of title similarities using the length difference, and node removal thresholds, but in a crude sense the codebase feels pragmatic. Almost industrial. Thankfully nobody has decided to call the if statements with arbitrary numeric thresholds machine learning yet.
Retrieving the title from article is simple enough; just get the page title or alternatively whatever metadata we’re trying to push using Dublin Core (dc:title
), T