You run the API on your machine, you send it a URL, and you get back the website data as a file plus screenshots of the site. Simple as.
This project was made to support Abbey, an AI platform. Its author is Gordon Kamer.
Some highlights:
- Scrolls through the page and takes screenshots of different sections
- Runs in a docker container
- Browser-based (will run websites’ Javascript)
- Gives you the HTTP status code and headers from the first request
- Automatically handles 302 redirects
- Handles download links properly
- Tasks are processed in a queue with configurable memory allocation
- Blocking API
- Zero state or other complexity
This web scraper is resource intensive but higher quality than many alternatives. Websites are scraped using Playwright, which launches a Firefox browser context for each job.
You should have Docker and docker compose
installed.
- Clone this repo
- Run
docker compose up
(adocker-compose.yml
file is provided for your use)
…and the service will be available at http://localhost:5006
. See the Usage section below for details on how to interact with it.
You may set an API key using a .env
file inside the /scraper
folder (same level as app.py
).
You can set as many API keys as you’d like; allowed API keys are those that start with SCRAPER_API_KEY
. For example, here is a .env
file that has three available keys:
SCRAPER_API_KEY=should-be-secret
SCRAPER_API_KEY_OTHER=can-also-be-used
SCRAPER_API_KEY_3=works-too
API keys are sent to the service using the Authorization Bearer scheme.
The root path /
returns status 200 if online, plus some Gilbert and Sullivan lyrics (you can go there in your browser to see if it’s online).
The only other path is /scrape
, to which you send a JSON formatted POST request and (if all things go well) receive a multipart/mixed
type response.
The response will be either:
- Status 200:
multipart/mixed
response where the first part is typeapplication/json
with information about the request; the second part is the website data (usuallytext/html
); and the remaining parts are up to 5 screenshots. - Not status 200:
applicati
17 Comments
_nolram
[dead]
tantaman
us ai?
xnx
For anyone who might not be aware, Chrome also has the ability to save screenshots from the command line using:
chrome –headless –screenshot="path/to/save/screenshot.png" –disable-gpu –window-size=1280,720 "https://www.example.com"
aspeckt-112
I’m looking forward to giving this a go. Great idea!
manmal
Being a bit frustrated with Linkwarden’s resource usage, I’ve thought about making my own self hosted bookmarking service. This could be a low effort way of loading screenshots for these links, very cool! It‘ll be interesting how many concurrent requests this can process.
synthomat
That's nice and everything but what to do about the EU cookie banners? Does hosting outside of the EU help?
quink
> SCREENSHOT_JPEG_QUALITY
Not two words that should be near each other, and JPEG is the only option.
Almost like it’s designed to nerd-snipe someone into a PR to change the format based on Accept headers.
mpetrovich
Reminds me of this open source library I wrote to do the same thing: https://github.com/nextbigsoundinc/imagely
It uses puppeteer and chrome headless behind the scenes.
joshstrange
This is cool but at this point MCP is the clear choice for exposing tools to LLMs, I'm sure someone will write a wrapper around this to provide the same functionality as an MCP-SSE server.
I want to try this out though and see how I like it compared to the MCP Puppeteer I'm using now (which does a great job of visiting pages, taking screenshots, interacting with the page, etc).
jot
If you’re worried about the security risks, edge cases, maintenance pain and scaling challenges of self hosting there are various solid hosted alternatives:
– https://browserless.io – low level browser control
– https://scrapingbee.com – scraping specialists
– https://urlbox.com – screenshot specialists*
They’re all profitable and have been around for years so you can depend on the businesses and the tech.
* Disclosure: I work on this one and was a customer before I joined the team.
morbusfonticuli
Similar project: gowitness [1].
A really cool tool i recently discovered. Next to scraping and performing screenshots of websites and saving it in multiple formats (including sqlite3), it can grab and save the headers, console logs & cookies and has a super cool web GUI to access all data and compare e.g the different records.
I'm planning to build my personal archive.org/waybackmachine-like web-log tool via gowitness in the not-so-distant future.
[1] https://github.com/sensepost/gowitness
westurner
simonw/shot-scraper has a number of cli args, a GitHub actions repo template, and docs:
https://shot-scraper.datasette.io/en/stable/
From https://news.ycombinator.com/item?id=30681242 :
> Awesome Visual Regression Testing > lists quite a few tools and online services: https://github.com/mojoaxel/awesome-regression-testing
> "visual-regression": https://github.com/topics/visual-regression
kevinsundar
I'm looking for something similar that can also extract the diff of content on the page over time, in addition to screenshots. Any suggestions?
I have a homegrown solution using an LLM and scrapegraphai for https://getchangelog.com but would rather offload that to a service that does a better job rendering websites. There's some websites that I get error pages from using playwright, but they load fine in my usual Chrome browser.
mlunar
Similar one I wrote a while ago using Pupetteer for the IoT low power display purposes. Neat trick is that it learns the refresh interval, so that it takes a snapshot just before it's requested :) https://github.com/SmilyOrg/website-image-proxy
rpastuszak
Cool! In using sth similar on my site to generate screenshots of tweets (for privacy purposes):
https://untested.sonnet.io/notes/xitterpng-privacy-friendly-…
ranger_danger
No license?
jchw
One thing to be cognizant of: if you're planning to run this sort of thing against potentially untrusted URLs, the browser might be able to make requests to internal hosts in whatever network it is on. It would be wise, on Linux, to use network namespaces, and block any local IP range in the namespace, or use a network namespace to limit the browser to a wireguard VPN tunnel to some other network.