Show HN: An API that takes a URL and returns a file with browser screenshots by gkamer8

Share This Article

Sed ut perspiciatis unde.

You run the API on your machine, you send it a URL, and you get back the website data as a file plus screenshots of the site. Simple as.

This project was made to support Abbey, an AI platform. Its author is Gordon Kamer.

Some highlights:

Scrolls through the page and takes screenshots of different sections
Runs in a docker container
Browser-based (will run websites’ Javascript)
Gives you the HTTP status code and headers from the first request
Automatically handles 302 redirects
Handles download links properly
Tasks are processed in a queue with configurable memory allocation
Blocking API
Zero state or other complexity

This web scraper is resource intensive but higher quality than many alternatives. Websites are scraped using Playwright, which launches a Firefox browser context for each job.

You should have Docker and docker compose installed.

Clone this repo
Run docker compose up (a docker-compose.yml file is provided for your use)

…and the service will be available at http://localhost:5006. See the Usage section below for details on how to interact with it.

You may set an API key using a .env file inside the /scraper folder (same level as app.py).

You can set as many API keys as you’d like; allowed API keys are those that start with SCRAPER_API_KEY. For example, here is a .env file that has three available keys:

SCRAPER_API_KEY=should-be-secret
SCRAPER_API_KEY_OTHER=can-also-be-used
SCRAPER_API_KEY_3=works-too

API keys are sent to the service using the Authorization Bearer scheme.

The root path / returns status 200 if online, plus some Gilbert and Sullivan lyrics (you can go there in your browser to see if it’s online).

The only other path is /scrape, to which you send a JSON formatted POST request and (if all things go well) receive a multipart/mixed type response.

The response will be either:

Status 200: multipart/mixed response where the first part is type application/json with information about the request; the second part is the website data (usually text/html); and the remaining parts are up to 5 screenshots.
Not status 200: applicati

Post Author

_nolram

Posted February 6, 2025 at 7:31 pm

[dead]

0Likes Log in to Reply
Post Author

tantaman

Posted February 6, 2025 at 7:37 pm

us ai?

0Likes Log in to Reply
Post Author

xnx

Posted February 6, 2025 at 7:39 pm

For anyone who might not be aware, Chrome also has the ability to save screenshots from the command line using:
chrome –headless –screenshot="path/to/save/screenshot.png" –disable-gpu –window-size=1280,720 "https://www.example.com"

0Likes Log in to Reply
Post Author

aspeckt-112

Posted February 6, 2025 at 7:40 pm

I’m looking forward to giving this a go. Great idea!

0Likes Log in to Reply
Post Author

manmal

Posted February 6, 2025 at 7:54 pm

Being a bit frustrated with Linkwarden’s resource usage, I’ve thought about making my own self hosted bookmarking service. This could be a low effort way of loading screenshots for these links, very cool! It‘ll be interesting how many concurrent requests this can process.

0Likes Log in to Reply
Post Author

synthomat

Posted February 6, 2025 at 8:13 pm

That's nice and everything but what to do about the EU cookie banners? Does hosting outside of the EU help?

0Likes Log in to Reply
Post Author

quink

Posted February 6, 2025 at 8:17 pm

> SCREENSHOT_JPEG_QUALITY

Not two words that should be near each other, and JPEG is the only option.

Almost like it’s designed to nerd-snipe someone into a PR to change the format based on Accept headers.

0Likes Log in to Reply
Post Author

mpetrovich

Posted February 6, 2025 at 8:45 pm

Reminds me of this open source library I wrote to do the same thing: https://github.com/nextbigsoundinc/imagely

It uses puppeteer and chrome headless behind the scenes.

0Likes Log in to Reply
Post Author

joshstrange

Posted February 6, 2025 at 8:49 pm

This is cool but at this point MCP is the clear choice for exposing tools to LLMs, I'm sure someone will write a wrapper around this to provide the same functionality as an MCP-SSE server.

I want to try this out though and see how I like it compared to the MCP Puppeteer I'm using now (which does a great job of visiting pages, taking screenshots, interacting with the page, etc).

0Likes Log in to Reply
Post Author

jot

Posted February 6, 2025 at 8:56 pm

If you’re worried about the security risks, edge cases, maintenance pain and scaling challenges of self hosting there are various solid hosted alternatives:

– https://browserless.io – low level browser control

– https://scrapingbee.com – scraping specialists

– https://urlbox.com – screenshot specialists*

They’re all profitable and have been around for years so you can depend on the businesses and the tech.

* Disclosure: I work on this one and was a customer before I joined the team.

0Likes Log in to Reply
Post Author

morbusfonticuli

Posted February 6, 2025 at 9:01 pm

Similar project: gowitness [1].

A really cool tool i recently discovered. Next to scraping and performing screenshots of websites and saving it in multiple formats (including sqlite3), it can grab and save the headers, console logs & cookies and has a super cool web GUI to access all data and compare e.g the different records.

I'm planning to build my personal archive.org/waybackmachine-like web-log tool via gowitness in the not-so-distant future.

[1] https://github.com/sensepost/gowitness

0Likes Log in to Reply
Post Author

westurner

Posted February 6, 2025 at 9:35 pm

simonw/shot-scraper has a number of cli args, a GitHub actions repo template, and docs:
https://shot-scraper.datasette.io/en/stable/

From https://news.ycombinator.com/item?id=30681242 :

> Awesome Visual Regression Testing > lists quite a few tools and online services: https://github.com/mojoaxel/awesome-regression-testing

> "visual-regression": https://github.com/topics/visual-regression

0Likes Log in to Reply
Post Author

kevinsundar

Posted February 6, 2025 at 11:07 pm

I'm looking for something similar that can also extract the diff of content on the page over time, in addition to screenshots. Any suggestions?

I have a homegrown solution using an LLM and scrapegraphai for https://getchangelog.com but would rather offload that to a service that does a better job rendering websites. There's some websites that I get error pages from using playwright, but they load fine in my usual Chrome browser.

0Likes Log in to Reply
Post Author

mlunar

Posted February 6, 2025 at 11:47 pm

Similar one I wrote a while ago using Pupetteer for the IoT low power display purposes. Neat trick is that it learns the refresh interval, so that it takes a snapshot just before it's requested :) https://github.com/SmilyOrg/website-image-proxy

0Likes Log in to Reply
Post Author

rpastuszak

Posted February 6, 2025 at 11:52 pm

Cool! In using sth similar on my site to generate screenshots of tweets (for privacy purposes):

https://untested.sonnet.io/notes/xitterpng-privacy-friendly-…

0Likes Log in to Reply
Post Author

ranger_danger

Posted February 7, 2025 at 12:04 am

No license?

0Likes Log in to Reply
Post Author

jchw

Posted February 7, 2025 at 12:04 am

One thing to be cognizant of: if you're planning to run this sort of thing against potentially untrusted URLs, the browser might be able to make requests to internal hosts in whatever network it is on. It would be wise, on Linux, to use network namespaces, and block any local IP range in the namespace, or use a network namespace to limit the browser to a wireguard VPN tunnel to some other network.

0Likes Log in to Reply

Show HN: An API that takes a URL and returns a file with browser screenshots by gkamer8

Show HN: An API that takes a URL and returns a file with browser screenshots by gkamer8

Share This Article

Newsletter

HackTech

17 Comments

_nolram

tantaman

xnx

aspeckt-112

manmal

synthomat

quink

mpetrovich

joshstrange

jot

morbusfonticuli

westurner

kevinsundar

mlunar

rpastuszak

ranger_danger

jchw

Leave a comment Cancel reply

Editor's Choice

Show HN: An API that takes a URL and returns a file with browser screenshots by gkamer8

Show HN: An API that takes a URL and returns a file with browser screenshots by gkamer8

Share This Article

Newsletter

17 Comments

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter