Tools for taking automated screenshots of websites
For background on this project see shot-scraper: automated screenshots for documentation, built on Playwright.
To get started without installing any software, use the shot-scraper-template template to create your own GitHub repository which takes screenshots of a page using shot-scraper
. See Instantly create a GitHub repository to take screenshots of a web page for details.
- The shot-scraper-demo repository uses this tool to capture recently spotted owls in El Granada, CA according to this page, and to generate an annotated screenshot illustrating a Datasette feature as described in my blog.
- Ben Welsh built @newshomepages, a Twitter bot that uses
shot-scraper
and GitHub Actions to take screenshots of news website homepages and publish them to Twitter. The code for that lives in palewire/news-homepages. - scrape-hacker-news-by-domain uses
shot-scraper javascript
to scrape a web page. See Scraping web pages from the command-line with shot-scraper for details of how this works.
Install this tool using pip
:
pip install shot-scraper
This tool depends on Playwright, which first needs to install its own dedicated Chromium browser.
Run shot-scraper install
once to install that:
% shot-scraper install
Downloading Playwright build of chromium v965416 - 117.2 Mb [====================] 100% 0.0s
Playwright build of chromium v965416 downloaded to /Users/simon/Library/Caches/ms-playwright/chromium-965416
Downloading Playwright build of ffmpeg v1007 - 1.1 Mb [====================] 100% 0.0s
Playwright build of ffmpeg v1007 downloaded to /Users/simon/Library/Caches/ms-playwright/ffmpeg-1007
If you want to use other browsers such as Firefox you should install those too:
% shot-scraper install -b firefox
Full --help
for the shot-scraper install
command:
Usage: shot-scraper install [OPTIONS]
Install the Playwright browser needed by this tool.
Usage:
shot-scraper install
Or for browsers other than the Chromium default:
shot-scraper install -b firefox
Options:
-b, --browser [chromium|firefox|webkit|chrome|chrome-beta]
Which browser to install
-h, --help Show this message and exit.
To take a screenshot of a web page and write it to datasette-io.png
run this:
shot-scraper https://datasette.io/
If a file called datasette-io.png
already exists the filename datasette-io.1.png
will be used instead.
You can use the -o
option to specify a filename:
shot-scraper https://datasette.io/ -o datasette.png
Use -o -
to write the PNG image to standard output:
shot-scraper https://datasette.io/ -o - > datasette.png
If you omit the protocol http://
will be added automatically, and any redirects will be followed:
shot-scraper datasette.io -o datasette.png
The browser window used to take the screenshots defaults to 1280px wide and 780px tall.
You can adjust these with the --width
and --height
options:
shot-scraper https://datasette.io/ -o small.png --width 400 --height 800
If you provide both options, the resulting screenshot will be of that size. If you omit --height
a full page length screenshot will be produced (the default).
To take a screenshot of a specific element on the page, use --selector
or -s
with its CSS selector:
shot-scraper https://simonwillison.net/ -s '#bighead'
When using --selector
the height and width, if provided, will set the size of the browser window when the page is loaded but the resulting screenshot will still be the same dimensions as the element on the page.
You can pass --selector
multiple times. The resulting screenshot will cover the smallest area of the page that contains all of the elements you specified, for example:
shot-scraper https://simonwillison.net/ \
-s '#bighead' -s .overband \
-o bighead-multi-selector.png
To capture a rectangle around every element that matches a CSS selector, use --selector-all
:
shot-scraper https://simonwillison.net/ \
--selector-all '.day' \
-o just-the-day-boxes.png
You can add --padding 20
to add 20px of padding around the elements when the shot is taken.
The --js-selector
and --js-selector-all
options can be used to use JavaScript expressions to select elements that cannot be targetted just using CSS selectors.
The options should be passed JavaScript expression that operates on the el
variable, returning true
if that element should be included in the screenshot selection.
To take a screenshot of the first paragraph on the page that contains the text "shot-scraper" you could run the following:
shot-scraper https://github.com/simonw/shot-scraper \
--js-selector 'el.tagName == "P" && el.innerText.includes("shot-scraper")'
The el.tagName == "P"
part is needed here because otherwise the <html>
element on the page will be the first to match the expression.
The generated JavaScript that will be executed on the page looks like this:
Array.from(document.getElementsByTagName('*')).find(
el => el.tagName == "P" && el.innerText.includes("shot-scraper")
).classList.add("js-selector-a1f5ba0fc4a4317e58a3bd11a0f16b96");
The --js-selector-all
option will select all matching elements, in a similar fashion to the --selector-all
option described above.
Sometimes a page will not have completely loaded before a screenshot is taken. You can use --wait X
to wait the specified number of milliseconds after the page load event has fired before taking the screenshot:
shot-scraper https://simonwillison.net/ --wait 2000
You can use custom JavaScript to modify the page after it has loaded (after the 'onload' event has fired) but before the screenshot is taken using the --javascript
option:
shot-scraper https://simonwillison.net/ \
-o simonwillison-pink.png \
--javascript "document.body.style.backgroundColor = 'pink';"
Screenshots default to PNG. You can save as a JPEG by specifying a -o
filename that ends with .jpg
.
You can also use --quality X
to save as a JPEG with the specified quality, in order to reduce the filesize. 80 is a good value to use here:
shot-scraper https://simonwillison.net/ \
-h 800 -o simonwillison.jpg --quality 80
% ls -lah simonwillison.jpg
-rw-r--r--@ 1 simon staff 168K Mar 9 13:53 simonwillison.jpg
The --retina
option sets a device scale factor of 2. This means that an image will have its resolution effectively doubled, emulating the display of images on retina or higher pixel density screens.
shot-scraper https://simonwillison.net/ -o simon.png \
--width 400 --height 600 --retina
This example will produce an image that is 800px wide and 1200px high.
Sometimes it's useful to be able to manually interact with a page before the screenshot is captured.
Add the --interactive
option to open a browser window that you can interact with. Then hit <enter>
in the terminal when you are ready to take the shot and close the window.
shot-scraper https://simonwillison.net/ -o after-interaction.png \
--height 800 --interactive
This will output:
Hit <enter> to take the shot and close the browser window:
# And after you hit <enter>...
Screenshot of 'https://simonwillison.net/' written to 'after-interaction.png'
You can pass the path to an HTML file on disk to take a screenshot of that rendered file:
shot-scraper index.html -o index.png
CSS and images referenced from that file using relative paths will also be included.
Full --help
for this command:
Usage: shot-scraper shot [OPTIONS] URL
Take a single screenshot of a page or portion of a page.
Usage:
shot-scraper www.example.com
This will write the screenshot to www-example-com.png
Use "-o" to write to a specific file:
shot-scraper https://www.example.com/ -o example.png
You can also pass a path to a local file on disk:
shot-scraper index.html -o index.png
Using "-o -" will output to standard out:
shot-scraper https://www.example.com/ -o - > example.png
Use -s to take a screenshot of one area of the page, identified using one or
more CSS selectors:
shot-scraper https://simonwillison.net -s '#bighead'
Options:
-a, --auth FILENAME Path to JSON authentication context file
-w, --width INTEGER Width of browser window, defaults to 1280
-h, --height INTEGER Height of browser window and shot - defaults
to the full height of the page
-o, --output FILE
-s, --selector TEXT Take shot of first element matching this CSS
selector
--selector-all TEXT Take shot of all elements matching this CSS
selector
--js-selector TEXT Take shot of first element matching this JS
(el) expression
--js-selector-all TEXT Take shot of all elements matching this JS
(el) expression
-p, --padding INTEGER When using selectors, add this much padding in
pixels
-j, --javascript TEXT Execute this JS prior to taking the shot
--retina Use device scale factor of 2
--quality INTEGER Save as JPEG with this quality, e.g. 80
--wait INTEGER Wait this many milliseconds before taking the
screenshot
--timeout INTEGER Wait this many milliseconds before failing
-i, --interactive Interact with the page in a browser before
taking the shot
--devtools Interact mode with developer tools
-b, --browser [chromium|firefox|webkit|chrome|chrome-beta]
Which browser to use
--user-agent TEXT User-Agent header to use
--reduced-motion Emulate 'prefers-reduced-motion' media feature
--help Show this message and exit.
If you want to take screenshots of a site that has some form of authentication, you will first need to authenticate with that website manually.
You can do that using the shot-scraper auth
command:
shot-scraper auth https://datasette-auth-passwords-demo.datasette.io/-/login auth.json
(For this demo, use username = root
and password = password!
)
This will open a browser window on your computer showing the page you specified.
You can then sign in using that browser window - including 2FA or CAPTCHAs or other more complex form of authentication.
When you are finished, hit <enter>
at the shot-scraper
command-line prompt. The browser will close and the authentication credentials (usually cookies) for that browser session will be written out to the auth.json
file.
To take authenticated screenshots you can then use the -a
or --auth
options to point to the JSON file that you created:
shot-scraper https://datasette-auth-passwords-demo.datasette.io/ \
-a auth.json -o authed.png
Full --help
for shot-scraper auth
:
Usage: shot-scraper auth [OPTIONS] URL CONTEXT_FILE
Open a browser so user can manually authenticate with the specified site, then
save the resulting authentication context to a file.
Usage:
shot-scraper auth https://github.com/ auth.json
Options:
-b, --browser [chromium|firefox|webkit|chrome|chrome-beta]
Which browser to use
--user-agent TEXT User-Agent header to use
-h, --help Show this message and exit.
You can configure multiple screenshots using a YAML file. Create a file called shots.yml
that looks like this:
- output: example.com.png
url: http://www.example.com/
- output: w3c.org.png
url: https://www.w3.org/
Then run the tool like so:
shot-scraper multi shots.yml
This will create two image files, www-example-com.png
and w3c.org.png
, containing screenshots of those two URLs.
You can set url:
to a path to a file on disk as well:
- output: index.png
url: index.html
Use --retina
to take all screenshots at retina resolution instead, doubling the dimensions of the files:
shot-scraper multi shots.yml --retina
Use --fail-on-error
to fail noisily on error (may be helpful in CI):
shot-scraper multi shots.yml --fail-on-error
To take a screenshot of just the area of a page defined by a CSS selector, add selector
to the YAML block:
- output: bighead.png
url: https://simonwillison.net/
selector: "#bighead"
You can pass more than one selector using a selectors:
list. You can also use padding:
to specify additional padding:
- output: bighead-multi-selector.png
url: https://simonwillison.net/
selectors:
- "#bighead"
- .overband
padding: 20
You can use selector_all:
to capture every element matching a selector, or selectors_all:
to pass a list of such selectors:
- output: selectors-all.png
url: https://simonwillison.net/
selectors_all:
- .day
- .entry:nth-of-type(1)
padding: 20
The --js-selector
and --js-selector-all
options can be provided using the js_selector:
, js_selectors:
, js_selector_all:
and js_selectors_all:
keys:
- output: js-selector-all.png
url: https://github.com/simonw/shot-scraper
js_selector: |-
el.tagName == "P" && el.innerText.includes("shot-scraper")
padding: 20
To execute JavaScript after the page has loaded but before the screenshot is taken, add a javascript
key:
- output: bighead-pink.png
url: https://simonwillison.net/
selector: "#bighead"
javascript: |
document.body.style.backgroundColor = 'pink'
You can include desired height
, width
, quality
and wait
options on each item as well:
- output: simon-narrow.jpg
url: https://simonwillison.net/
width: 400
height: 800
quality: 80
wait: 500
Full --help
for this command:
Usage: shot-scraper multi [OPTIONS] CONFIG
Take multiple screenshots, defined by a YAML file
Usage:
shot-scraper multi config.yml
Where config.yml contains configuration like this:
- output: example.png
url: http://www.example.com/
https://github.com/simonw/shot-scraper/blob/main/README.md#multi
Options:
-a, --auth FILENAME Path to JSON authentication context file
--retina Use device scale factor of 2
--timeout INTEGER Wait this many milliseconds before failing
--fail-on-error Fail noisily on error
-b, --browser [chromium|firefox|webkit|chrome|chrome-beta]
Which browser to use
--user-agent TEXT User-Agent header to use
--reduced-motion Emulate 'prefers-reduced-motion' media feature
-h, --help Show this message and exit.
The shot-scraper pdf
command saves a PDF version of a web page - the equivalent of using Print -> Save to PDF
in Chromium.
shot-scraper pdf https://datasette.io/
This will save to datasette-io.pdf
. You can use -o
to specify a filename:
shot-scraper pdf https://datasette.io/tutorials/learn-sql \
-o learn-sql.pdf
Full --help
for this command:
Usage: shot-scraper pdf [OPTIONS] URL
Create a PDF of the specified page
Usage:
shot-scraper pdf https://datasette.io/
Use -o to specify a filename:
shot-scraper pdf https://datasette.io/ -o datasette.pdf
Options:
-a, --auth FILENAME Path to JSON authentication context file
-o, --output FILE
-j, --javascript TEXT Execute this JS prior to creating the PDF
--wait INTEGER Wait this many milliseconds before taking the
screenshot
--media-screen Use screen rather than print styles
--landscape Use landscape orientation
-h, --help Show this message and exit.
The shot-scraper javascript
command can be used to execute JavaScript directly against a page and return the result as JSON.
This command doesn't produce a screenshot, but has interesting applications for scraping.
To retrieve a string title of a document:
shot-scraper javascript https://datasette.io/ "document.title"
This returns a JSON string:
"Datasette: An open source multi-tool for exploring and publishing data"
To return a JSON object, wrap an object literal in parenthesis:
shot-scraper javascript https://datasette.io/ "({
title: document.title,
tagline: document.querySelector('.tagline').innerText
})"
This returns:
{
"title": "Datasette: An open source multi-tool for exploring and publishing data",
"tagline": "An open source multi-tool for exploring and publishing data"
}
You can pass an async
function if you want to use await
, including to import modules from external URLs. This example loads the Readability.js library from Skypack and uses it to extract the core content of a page:
shot-scraper javascript https://simonwillison.net/2022/Mar/14/scraping-web-pages-shot-scraper/ "
async () => {
const readability = await import('https://cdn.skypack.dev/@mozilla/readability');
return (new readability.Readability(document)).parse();
}"
To use functions such as setInterval()
, for example if you need to delay the shot for a second to allow an animation to finish running, return a promise:
shot-scraper javascript datasette.io "
new Promise(done => setInterval(
() => {
done({
title: document.title,
tagline: document.querySelector('.tagline').innerText
});
}, 1000
));"
You can also save JavaScript to a file and execute it like this:
shot-scraper javascript datasette.io -i script.js
Or read it from standard input like this:
echo "document.title" | shot-scraper javascript datasette.io
If a JavaScript error occurs, a stack trace will be written to standard error and the tool will terminate with an exit code of 1.
This can be used to run JavaScript tests in continuous integration environments, by taking advantage of the throw "error message"
JavaScript statement.
This example uses GitHub Actions:
- name: Test page title
run: |-
shot-scraper javascript datasette.io "
if (document.title != 'Datasette') {
throw 'Wrong title detected';
}"
Full --help
for this command:
Usage: shot-scraper javascript [OPTIONS] URL [JAVASCRIPT]
Execute JavaScript against the page and return the result as JSON
Usage:
shot-scraper javascript https://datasette.io/ "document.title"
To return a JSON object, use this:
"({title: document.title, location: document.location})"
To use setInterval() or similar, pass a promise:
"new Promise(done => setInterval(
() => {
done({
title: document.title,
h2: document.querySelector('h2').innerHTML
});
}, 1000
));"
If a JavaScript error occurs an exit code of 1 will be returned.
Options:
-i, --input FILENAME Read input JavaScript from this file
-a, --auth FILENAME Path to JSON authentication context file
-o, --output FILENAME Save output JSON to this file
-b, --browser [chromium|firefox|webkit|chrome|chrome-beta]
Which browser to use
--user-agent TEXT User-Agent header to use
--reduced-motion Emulate 'prefers-reduced-motion' media feature
-h, --help Show this message and exit.
The shot-scraper accessibility
command dumps out the Chromium accessibility tree for the provided URL, as JSON:
shot-scraper accessibility https://datasette.io/
Use -o filename.json
to write the output to a file instead of displaying it.
Add --javascript SCRIPT
to execute custom JavaScript before taking the snapshot.
Full --help
for this command:
Usage: shot-scraper accessibility [OPTIONS] URL
Dump the Chromium accessibility tree for the specifed page
Usage:
shot-scraper accessibility https://datasette.io/
Options:
-a, --auth FILENAME Path to JSON authentication context file
-o, --output FILENAME
-j, --javascript TEXT Execute this JS prior to taking the snapshot
--timeout INTEGER Wait this many milliseconds before failing
-h, --help Show this message and exit.
If you are using the --javascript
option to execute code, that code will be executed after the page load event has fired but before the screenshot is taken.
You can use that code to do things like hide or remove specific page elements, click on links to open menus, or even add annotations to the page such as this pink arrow example.
This code hides any element with a [data-ad-rendered]
attribute and the element with id="ensNotifyBanner"
:
document.querySelectorAll(
'[data-ad-rendered],#ensNotifyBanner'
).forEach(el => el.style.display = 'none')
You can execute that like so:
shot-scraper https://www.latimes.com/ -o latimes.png --javascript "
document.querySelectorAll(
'[data-ad-rendered],#ensNotifyBanner'
).forEach(el => el.style.display = 'none')
"
In some cases you may need to add a pause that executes during your custom JavaScript before the screenshot is taken - for example if you click on a button that triggers a short fading animation.
You can do that using the following pattern:
new Promise(takeShot => {
// Your code goes here
// ...
setTimeout(() => {
// Resolving the promise takes the shot
takeShot();
}, 1000);
});
If your custom code defines a Promise
, shot-scraper
will wait for that promise to complete before taking the screenshot. Here the screenshot does not occur until the takeShot()
function is called.
To contribute to this tool, first checkout the code. Then create a new virtual environment:
cd shot-scraper
python -m venv venv
source venv/bin/activate
Or if you are using pipenv
:
pipenv shell
Now install the dependencies and test dependencies:
pip install -e '.[test]'
To run the tests:
pytest
Some of the tests exercise the CLI utility directly. Run those like so:
tests/run_examples.sh