Selecting elements for screenshotting based on tag content #43

psychemedia · 2022-03-14T12:17:55Z

Selecting elements as the target for screenshots based on CSS selectors does not (currently) allow for the selection of elements based on tag content, or relative to DOM elements selected based on tag content.

However, elements can be selected based on tag content using Javascript. It would be useful to allow for the selection of elements via Javascript as well as CSS.

Alternatively, support a method that can be called from a javascript scraper call that will apply a screen shot to a Javascript selected element.

At the moment, selector based screenshots seem to be focused in _selector_javascript(selectors) by:

shot-scraper/shot_scraper/cli.py

Line 520 in e40070d

let els = %s.map(s => document.querySelector(s));

As well as passing --selector(s), s, one approach might be to pass element(s) el returned from a --js-selector script?

The text was updated successfully, but these errors were encountered:

simonw · 2022-03-20T23:09:55Z

This is an interesting idea. Right now there is a workaround for this, but it's pretty messy. You can use JavaScript to assign an additional class to an element, then take a screenshot of that:

shot-scraper https://simonwillison.net -s '.screenshot-me' --javascript '
var el = Array.from(
  document.querySelectorAll("p")
).filter(el => el.innerText.includes("search"))[0];
el.classList.add("screenshot-me");
'

Running that now gets me this:

simonw · 2022-03-20T23:14:56Z

I'm trying to think of a syntactically elegant way to support this.

One way would be to support a fragment of JavaScript which is executed against every element on the page - so you would do this:

shot-scraper ... --js-selector 'el.innerText.includes("search")'

And it would effectively run the equivalent of my code above, except filtering against the list of elements returned by document.querySelectorAll('*')

That could actually work pretty well.

simonw · 2022-03-20T23:15:48Z

Would --js-selector X return just the first element matching that filter, or would it select ALL elements matching that selector? Maybe a --js-selector-all option could select all of them?

simonw · 2022-03-20T23:17:03Z

Alternative names for this option:

--js-select X and --js-select-all X
--js-filter X and --js-filter-all X
--filter X and --filter-all X

I'm not completely convinced by any of those yet.

simonw · 2022-03-20T23:17:40Z

As a reminder, the current options for the shot-selector shot command are:

Options:
  -a, --auth FILENAME    Path to JSON authentication context file
  -w, --width INTEGER    Width of browser window, defaults to 1280
  -h, --height INTEGER   Height of browser window and shot - defaults to the
                         full height of the page
  -o, --output FILE
  -s, --selector TEXT    Take shot of first element matching this CSS selector
  -p, --padding INTEGER  When using selectors, add this much padding in pixels
  -j, --javascript TEXT  Execute this JS prior to taking the shot
  --retina               Use device scale factor of 2
  --quality INTEGER      Save as JPEG with this quality, e.g. 80
  --wait INTEGER         Wait this many milliseconds before taking the
                         screenshot
  -i, --interactive      Interact with the page in a browser before taking the
                         shot
  --devtools             Interact mode with developer tools
  --help                 Show this message and exit.

simonw · 2022-03-20T23:18:30Z

I don't like --filter X meaning "first of the things that match this filter" because I would assume that --filter meant "ALL things that match this filter".

simonw · 2022-03-20T23:19:13Z

Oh hang on...

  -s, --selector TEXT    Take shot of first element matching this CSS selector

We already have a precedent for --selector meaning "first thing that matches" rather than "all things that match".

simonw · 2022-03-20T23:19:57Z

--js-selector is the most consistent with --selector, maybe that's the best option.

Or --js-filter since it's more of a filter than a selector operation. But again, that implies match all, not match first to me.

simonw · 2022-03-20T23:20:50Z

I think I've talked myself around to your original suggestion of --js-selector here!

simonw · 2022-03-20T23:23:50Z

If it's returning the first match it can use Array.find() instead of Array.filter(): https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/find

The find() method returns the first element in the provided array that satisfies the provided testing function. If no values satisfy the testing function, undefined is returned.

psychemedia · 2022-03-21T10:10:27Z

TBH, your original work-around isn't that clunky.. it's really elegant in how it riffs on what's currently available, feature wise

rdmurphy · 2022-03-29T06:22:27Z

It'd require some reworking of a couple pieces of shot-scraper but I think what you're after @psychemedia could probably be accomplished using XPath via document.evaluate. Playwright supports XPath out of the box but as you mentioned the hard-coded querySelector is the limiting factor.

There could of course be an --xpath-selector or --xpath option, but it might be interesting to match Playwright's logic on this:

Selector starting with // or .. is assumed to be an xpath selector. For example, Playwright converts '//html/body' to 'xpath=//html/body'.

Then you could just use --selector (or add --multi-selector or something) and have it swap into XPath mode when it picks up on a XPath signal. (xpath= or // or ..)

Relevant Playwright client code:

https://github.com/microsoft/playwright/blob/1961959dcb726143586a384799f25f149b694214/packages/playwright-core/src/server/injected/xpathSelectorEngine.ts

simonw · 2022-04-09T04:03:51Z

OK, I built a protoype. @psychemedia you can try it out by running:

pip install https://github.com/simonw/shot-scraper/archive/8a61c7e1ba90f1210868cf7f84ece78e5f8300db.zip

Here's a demo:

shot-scraper https://simonwillison.net/2022/Apr/8/weeknotes/ \
  --js-selector 'el.tagName == "P" && el.innerText.includes("shot-scraper")' \
  --padding 15 --retina

Which produced this image (of the first paragraph on the page to contain the term shot-scraper):

Note that I had to include el.tagName == "P" in the expression - because it scans through every tag on the page, so without that it returns the html tag since that includes the shot-scraper text somewhere in its innerText!

The demo also currently outputs debug info showing the JavaScript it generated - I'd remove that before shipping the feature:

% shot-scraper https://simonwillison.net/2022/Apr/8/weeknotes/ \
      --js-selector 'el.tagName == "P" && el.innerText.includes("shot-scraper")' \
      --padding 15 --retina
() => {
Array.from(
  document.getElementsByTagName('*')
).find(el => el.tagName == "P" && el.innerText.includes("shot-scraper")).classList.add("js-selector-a1f5ba0fc4a4317e58a3bd11a0f16b96");
} ['.js-selector-a1f5ba0fc4a4317e58a3bd11a0f16b96']
Screenshot of '.js-selector-a1f5ba0fc4a4317e58a3bd11a0f16b96' on 'https://simonwillison.net/2022/Apr/8/weeknotes/' written to 'simonwillison-net-2022-Apr-8-weeknotes.1.png'

psychemedia · 2022-04-09T16:52:30Z

Ah, wonderful.. will look for something to try it out on...

psychemedia · 2022-04-09T17:06:28Z

I notice that you need to uppercase the tagname?

psychemedia · 2022-04-09T17:21:09Z

Ooh... nifty...

shot-scraper https://rallydatajunkie.com/sweden_2022_timing/overall-standings.html#overall-standings --js-selector 'el.tagName == "TR" && el.innerText.includes("ROV")' --padding 0 --retina

psychemedia · 2022-04-09T17:21:40Z

Sort of thing you could easily weave into a twitterbot..

simonw · 2022-04-11T17:35:22Z

The thing that's holding me off from merging this right now is that it's got me thinking about the multiple element case.

In trying this out I found myself wanting the ability to say "take a screenshot that incorporates all of the matching elements on the page" - where this mechanism currently filters for just the first match (using .find()):

shot-scraper/shot_scraper/cli.py

Lines 656 to 658 in 8a61c7e

    
                   Array.from( 
        
                     document.getElementsByTagName('*') 
        
                   ).find(el => {}).classList.add("{}");

This then made me think that I actually want the same ability for just raw CSS selectors. But what should that look like?

simonw · 2022-04-11T18:14:02Z

I implemented --selector-all. The next challenge is how the JavaScript equivalent of that should work.

--js-selector currently takes a JavaScript expression, e.g. --js-selector 'el.tagName == "TR" && el.innerText.includes("ROV")'

The obvious complement to the new --selector-all option would be --js-selector-all - does this make intuitive sense?

shot-scraper $URL --js-selector-all 'el.tagName == "TR" && el.innerText.includes("ROV")'

I was worried that it wouldn't make sense because there's no equivalent to the querySelector/querySelectorAll methods used by the CSS selector version - but having written it down I think it's OK. And the consistence between --selector-all and --js-selector-all feels right to me.

So I think the answer here is a --js-selector-all option, in addition to --js-selector.

simonw · 2022-04-11T18:37:21Z

Manual testing (with debug output):

% shot-scraper https://simonwillison.net/ --js-selector-all 'Array.from(el.classList).includes("day")' --padding 10
() => {
Array.from(
  document.getElementsByTagName('*')
).filter(el => Array.from(el.classList).includes("day")).forEach(el => el.classList.add("js-selector-all-d43c969104a6001c9d50b56c59430864"));
} []
Screenshot of '.js-selector-all-d43c969104a6001c9d50b56c59430864' on 'https://simonwillison.net/' written to 'simonwillison-net.2.png'

Imitates --selector-all from #64

simonw · 2022-04-11T19:02:17Z

Documentation: https://github.com/simonw/shot-scraper/blob/59b156ddd7963fd19edca214edc0ac59552efbb2/README.md#specifying-elements-using-javascript-filters

Refs #43, #56, #63, #64

simonw added the enhancement New feature or request label Mar 20, 2022

simonw added a commit that referenced this issue Apr 9, 2022

--js-selector prototype, refs #43

8a61c7e

simonw mentioned this issue Apr 9, 2022

--js-selector and --js-selector-all options #62

Merged

simonw mentioned this issue Apr 11, 2022

--selector-all to screenshot of all elements matching a selector #64

Closed

simonw added a commit that referenced this issue Apr 11, 2022

Implemented --js-selector-all, refs #43

0c7b42e

Imitates --selector-all from #64

simonw added a commit that referenced this issue Apr 11, 2022

Documentation for --js-selector and --js-selector-all, refs #43

59b156d

simonw closed this as completed in 518b6cc Apr 11, 2022

simonw added a commit that referenced this issue Apr 11, 2022

Release 0.12

7f04706

Refs #43, #56, #63, #64

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Selecting elements for screenshotting based on tag content #43

Selecting elements for screenshotting based on tag content #43

psychemedia commented Mar 14, 2022

simonw commented Mar 20, 2022

simonw commented Mar 20, 2022 •

edited

Loading

simonw commented Mar 20, 2022

simonw commented Mar 20, 2022

simonw commented Mar 20, 2022

simonw commented Mar 20, 2022

simonw commented Mar 20, 2022 •

edited

Loading

simonw commented Mar 20, 2022 •

edited

Loading

simonw commented Mar 20, 2022

simonw commented Mar 20, 2022

psychemedia commented Mar 21, 2022

rdmurphy commented Mar 29, 2022 •

edited

Loading

simonw commented Apr 9, 2022 •

edited

Loading

psychemedia commented Apr 9, 2022

psychemedia commented Apr 9, 2022

psychemedia commented Apr 9, 2022

psychemedia commented Apr 9, 2022

simonw commented Apr 11, 2022

simonw commented Apr 11, 2022

simonw commented Apr 11, 2022

simonw commented Apr 11, 2022

Selecting elements for screenshotting based on tag content #43

Selecting elements for screenshotting based on tag content #43

Comments

psychemedia commented Mar 14, 2022

simonw commented Mar 20, 2022

simonw commented Mar 20, 2022 • edited Loading

simonw commented Mar 20, 2022

simonw commented Mar 20, 2022

simonw commented Mar 20, 2022

simonw commented Mar 20, 2022

simonw commented Mar 20, 2022 • edited Loading

simonw commented Mar 20, 2022 • edited Loading

simonw commented Mar 20, 2022

simonw commented Mar 20, 2022

psychemedia commented Mar 21, 2022

rdmurphy commented Mar 29, 2022 • edited Loading

simonw commented Apr 9, 2022 • edited Loading

psychemedia commented Apr 9, 2022

psychemedia commented Apr 9, 2022

psychemedia commented Apr 9, 2022

psychemedia commented Apr 9, 2022

simonw commented Apr 11, 2022

simonw commented Apr 11, 2022

simonw commented Apr 11, 2022

simonw commented Apr 11, 2022

simonw commented Mar 20, 2022 •

edited

Loading

simonw commented Mar 20, 2022 •

edited

Loading

simonw commented Mar 20, 2022 •

edited

Loading

rdmurphy commented Mar 29, 2022 •

edited

Loading

simonw commented Apr 9, 2022 •

edited

Loading