Web Scraping via JavaScript Runtime Heap Snapshots (2022) by djoldman

Share This Article

Sed ut perspiciatis unde.

In recent years, the web has gotten very hostile to the lowly web scraper. It’s a result of the natural progression of web technologies away from statically rendered pages to dynamic apps built with frameworks like React and CSS-in-JS. Developers no longer need to label their data with class-names or ids – it’s only a courtesy to screen readers now.

There’s also been a concerted effort by large companies to protect their public data. Facebook, for example, employs a team of over 100 people to make sure it is as difficult as possible for any data to escape the black hole. Granted, some of these large companies do offer APIs for their data but rarely is this unrestricted. You’re usually at the whim of their app review process or granted access only to a partial view of the data. Data that would be otherwise public if you were to do a Google search and click through to their website manually.

How HTML looks nowadays

This can be frustrating if you’re like me – somebody who wanted to build a small, local, non-profit app that uses data hosted on a closed platform. The data is public but completely inaccessible to machines because of aggressive anti-web-scraping measures. That gave me two options – input the data manually or play the web-scraping game. Of course, I chose the latter.

`puppeteer-heap-snapshot` is born.

After a couple of attempts at extracting the data using the usual CSS selector method of web-scraping – that is, fetching the raw HTML or booting up a browser via something like Puppeteer and trying to pick the data embedded in the HTML structure – I was close to admitting defeat. It wasn’t until I had an epiphany: the data is inside the web page.

I plucked out a unique string from the data visible on the web page and took a heap snapshot of the browser’s Javascript runtime via Chrome’s Dev Tools. A heap snapshot is a raw dump of everything in the web app’s memory (or heap). Using the Dev Tools’ search function and my unique string, I managed to find a nice, well-structured object containing my string and, adjacently, all the data that my app needed. It was from this point that I focused my energy on automating this process to find and extract the data from the heap snapshot. puppeteer-heap-snapshot is born.

Chrome Dev Tools’ Memory Profiler

puppeteer-heap-snapshot is a Node.js module that, given a Puppeteer browser page, can capture and parse a heap snapshot and deserialize objects that contain a set of properties. It comes with a nifty CLI tool too so we can quickly prototype scrapers from our terminal.

For example, let’s fetch the metadata from the above video:

$ puppeteer-heap-snapshot query --url https://www.youtube.com/watch?v=L_o_O7v1ews --properties channelId,viewCount,keywords --no-headless >> Opening Puppeteer page at: https://www.youtube.com/watch?v=L_o_O7v1ews >> Taking heap snapshot.. [ { "videoId": "L_o_O7v1ews", "title": "Zoolander - The Files are IN the Computer!", "lengthSeconds": "21", "keywords": [ "Zoolander", "Movie Quotes", "2000s", "Humor", "Files", "IN the Computer", "Hansel" ], "channelId": "UCGQ6kU3NRI3WDxvTGIsaVjA", "isOwnerViewing": false, "shortDescription": "", "isCrawlable": true, "thumbnail": { "thumbnails": [ { "url": "https://i.ytimg.com/vi/L_o_O7v1ews/hqdefault.jpg?sqp=-oaymwEbCKgBEF5IVfKriqkDDggBFQ

Web Scraping via JavaScript Runtime Heap Snapshots (2022) by djoldman

Web Scraping via JavaScript Runtime Heap Snapshots (2022) by djoldman

Share This Article

Newsletter

`puppeteer-heap-snapshot` is born.

HackTech

Leave a comment Cancel reply

Editor's Choice

Web Scraping via JavaScript Runtime Heap Snapshots (2022) by djoldman

Web Scraping via JavaScript Runtime Heap Snapshots (2022) by djoldman

Share This Article

Newsletter

puppeteer-heap-snapshot is born.

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter

`puppeteer-heap-snapshot` is born.