shot-scraper pdf -i #92

honzajde · 2022-10-12T12:27:08Z

I think it should support interactive mode for pdf as well... It does not right now...

BTW. Can we by any chance have html mode?

simonw · 2022-10-14T19:09:50Z

What would HTML mode do?

You can run shot-scraper against an HTML file on disk already, like this:

shot-scraper example.html

simonw · 2022-10-14T19:17:41Z

Unfortunately it doesn't look like it's possible to provide interactive mode for PDF printing.

I tried this prototype:

diff --git a/shot_scraper/cli.py b/shot_scraper/cli.py
index 04e4ef5..b5157b5 100644
--- a/shot_scraper/cli.py
+++ b/shot_scraper/cli.py
@@ -550,6 +550,12 @@ def javascript(
     type=click.FloatRange(min=0.1, max=2.0),
     help="Scale of the webpage rendering",
 )
+@click.option(
+    "-i",
+    "--interactive",
+    is_flag=True,
+    help="Interact with the page in a browser before taking the shot",
+)
 @click.option("--print-background", is_flag=True, help="Print background graphics")
 def pdf(
     url,
@@ -563,6 +569,7 @@ def pdf(
     width,
     height,
     scale,
+    interactive,
     print_background,
 ):
     """
@@ -584,13 +591,22 @@ def pdf(
     if output is None:
         output = filename_for_url(url, ext="pdf", file_exists=os.path.exists)
     with sync_playwright() as p:
-        context, browser_obj = _browser_context(p, auth)
-        page = context.new_page()
-        page.goto(url)
-        if wait:
-            time.sleep(wait / 1000)
-        if javascript:
-            _evaluate_js(page, javascript)
+        context, browser_obj = _browser_context(p, auth, interactive=interactive)
+        if interactive:
+            page = context.new_page()
+            page.goto(url)
+            context = page
+            click.echo(
+                "Hit <enter> to take the shot and close the browser window:", err=True
+            )
+            input()
+        else:
+            page = context.new_page()
+            page.goto(url)
+            if wait:
+                time.sleep(wait / 1000)
+            if javascript:
+                _evaluate_js(page, javascript)
 
         kwargs = {
             "landscape": landscape,

But when I run it I get this error:

% shot-scraper pdf -i simonwillison.net
Hit <enter> to take the shot and close the browser window:

Traceback (most recent call last):
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/bin/shot-scraper", line 33, in <module>
    sys.exit(load_entry_point('shot-scraper', 'console_scripts', 'shot-scraper')())
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/Users/simon/Dropbox/Development/shot-scraper/shot_scraper/cli.py", line 625, in pdf
    pdf = page.pdf(**kwargs)
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/playwright/sync_api/_generated.py", line 9274, in pdf
    self._sync(
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/playwright/_impl/_sync_base.py", line 111, in _sync
    return task.result()
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/playwright/_impl/_page.py", line 869, in pdf
    encoded_binary = await self._channel.send("pdf", params)
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 39, in send
    return await self.inner_send(method, params, False)
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 63, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.Error: Protocol error (Page.printToPDF): Printing is not available

It looks like the problem is that save to PDF is only available in headless mode: https://stackoverflow.com/a/70937997/6083

PDF creation is only supported in headless mode.

honzajde · 2022-10-15T16:36:38Z

My bad! Now we both know it...

By HTML MODE I meant possibility to save html of the scraped page....

simonw · 2022-10-15T18:28:39Z

You could do that using shot-scraper javascript like this:

shot-scraper javascript datasette.io 'document.body.innerHTML' | jq -r > page.html

The | jq -r bit is because without that you get back a JavaScript string with newlines converted to \n and suchlike - piping through jq -r turns that into a regular string which you can then save to a file.

simonw · 2022-10-15T18:32:28Z

Actually that only gets everything inside <body> - if you want <html> and downwards this seems to do the trick:

shot-scraper javascript datasette.io 'document.body.parentElement.outerHTML' | jq -r > page.html

Given how non-obvious this is I wonder if it does deserve having its own special feature?

simonw · 2022-10-15T18:37:04Z

Actually this is better:

shot-scraper javascript datasette.io 'document.documentElement.outerHTML' | jq -r

simonw · 2022-10-15T18:40:33Z

I realized that pattern doesn't give you the doctype.

If you want the doctype, there's a Playwright API that can do it: https://playwright.dev/python/docs/api/class-page#page-content

page.content()

Added in: v1.8

returns: <str>#

Gets the full HTML contents of the page, including the doctype.

This has convinced me that shot-scraper html would be worth adding! I'll open a new issue for that.

simonw · 2022-10-15T19:28:25Z

I built that feature - documentation is here: https://shot-scraper.datasette.io/en/latest/html.html

honzajde · 2022-10-17T13:53:01Z

Thank you so much!

simonw added the enhancement New feature or request label Oct 14, 2022

simonw closed this as completed Oct 14, 2022

simonw added the wontfix This will not be worked on label Oct 14, 2022

simonw reopened this Oct 15, 2022

simonw mentioned this issue Oct 15, 2022

shot-scraper javascript -r/--raw option #95

Closed

simonw closed this as completed Oct 15, 2022

simonw mentioned this issue Oct 15, 2022

shot-scraper html command #96

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

shot-scraper pdf -i #92

shot-scraper pdf -i #92

honzajde commented Oct 12, 2022 •

edited

Loading

simonw commented Oct 14, 2022 •

edited

Loading

simonw commented Oct 14, 2022

honzajde commented Oct 15, 2022

simonw commented Oct 15, 2022

simonw commented Oct 15, 2022

simonw commented Oct 15, 2022

simonw commented Oct 15, 2022

page.content()

simonw commented Oct 15, 2022

honzajde commented Oct 17, 2022

shot-scraper pdf -i #92

shot-scraper pdf -i #92

Comments

honzajde commented Oct 12, 2022 • edited Loading

simonw commented Oct 14, 2022 • edited Loading

simonw commented Oct 14, 2022

honzajde commented Oct 15, 2022

simonw commented Oct 15, 2022

simonw commented Oct 15, 2022

simonw commented Oct 15, 2022

simonw commented Oct 15, 2022

page.content()

simonw commented Oct 15, 2022

honzajde commented Oct 17, 2022

honzajde commented Oct 12, 2022 •

edited

Loading

simonw commented Oct 14, 2022 •

edited

Loading