In recent years, the web has gotten very hostile to the lowly web scraper. It’s a result of the natural progression of web technologies away from statically rendered pages to dynamic apps built with frameworks like React and CSS-in-JS. Developers no longer need to label their data with class-names or ids – it’s only a courtesy to screen readers now.
There’s also been a concerted effort by large companies to protect their public data. Facebook, for example, employs a team of over 100 people to make sure it is as difficult as possible for any data to escape the black hole. Granted, some of these large companies do offer APIs for their data but rarely is this unrestricted. You’re usually at the whim of their app review process or granted access only to a partial view of the data. Data that would be otherwise public if you were to do a Google search and click through to their website manually.
How HTML looks nowadays
This can be frustrating if you’re like me – somebody who wanted to build a small, local, non-profit app that uses data hosted on a closed platform. The data is public but completely inaccessible to machines because of aggressive anti-web-scraping measures. That gave me two options – input the data manually or play the web-scraping game. Of course, I chose the latter.
puppeteer-heap-snapshot
is born.
After a couple of attempts at extracting the data using the usual CSS selector method of web-scraping – that is, fetching the raw HTML or booting up a browser via something like Puppeteer and trying to pick the data embedded in the HTML structure – I was close to admitting defeat. It wasn’t until I had an epiphany: the data is inside the web page.
I plucked out a unique string from the data visible on the web page and took a heap snapshot of the browser’s Javascript runtime via Chrome’s Dev Tools. A heap snapshot is a raw dump of everything in the web app’s memory (or heap). Using the Dev Tools’ search function and my unique string, I managed to find a nice, well-structured object containing my string and, adjacently, all the data that my app needed. It was from this point that I focused my energy on automating this process to find and extract the data from the heap snapshot. puppeteer-heap-snapshot
is born.
Chrome Dev Tools’ Memory Profiler
puppeteer-heap-snapshot
is a Node.js module that, given a Puppeteer browser page, can capture and parse a heap snapshot and deserialize objects that contain a set of properties. It comes with a nifty CLI tool too so we can quickly prototype scrapers from our terminal.
For example, let’s fetch the metadata from the above video:
$ puppeteer-heap-snapshot query --url https://www.youtube.com/watch?v=L_o_O7v1ews --properties channelId,viewCount,keywords --no-headless >> Opening Puppeteer page at: https://www.youtube.com/watch?v=L_o_O7v1ews >> Taking heap snapshot.. [ { "videoId": "L_o_O7v1ews", "title": "Zoolander - The Files are IN the Computer!", "lengthSeconds": "21", "keywords": [ "Zoolander", "Movie Quotes", "2000s", "Humor", "Files", "IN the Computer", "Hansel" ], "channelId": "UCGQ6kU3NRI3WDxvTGIsaVjA", "isOwnerViewing": false, "shortDescription": "", "isCrawlable": true, "thumbnail": { "thumbnails": [ { "url": "https://i.ytimg.com/vi/L_o_O7v1ews/hqdefault.jpg?sqp=-oaymwEbCKgBEF5IVfKriqkDDggBFQ