Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shot-scraper pdf -i #92

Closed
honzajde opened this issue Oct 12, 2022 · 9 comments
Closed

shot-scraper pdf -i #92

honzajde opened this issue Oct 12, 2022 · 9 comments
Labels
enhancement New feature or request wontfix This will not be worked on

Comments

@honzajde
Copy link

honzajde commented Oct 12, 2022

I think it should support interactive mode for pdf as well... It does not right now...

BTW. Can we by any chance have html mode?

@simonw
Copy link
Owner

simonw commented Oct 14, 2022

What would HTML mode do?

You can run shot-scraper against an HTML file on disk already, like this:

shot-scraper example.html

@simonw simonw added the enhancement New feature or request label Oct 14, 2022
@simonw
Copy link
Owner

simonw commented Oct 14, 2022

Unfortunately it doesn't look like it's possible to provide interactive mode for PDF printing.

I tried this prototype:

diff --git a/shot_scraper/cli.py b/shot_scraper/cli.py
index 04e4ef5..b5157b5 100644
--- a/shot_scraper/cli.py
+++ b/shot_scraper/cli.py
@@ -550,6 +550,12 @@ def javascript(
     type=click.FloatRange(min=0.1, max=2.0),
     help="Scale of the webpage rendering",
 )
+@click.option(
+    "-i",
+    "--interactive",
+    is_flag=True,
+    help="Interact with the page in a browser before taking the shot",
+)
 @click.option("--print-background", is_flag=True, help="Print background graphics")
 def pdf(
     url,
@@ -563,6 +569,7 @@ def pdf(
     width,
     height,
     scale,
+    interactive,
     print_background,
 ):
     """
@@ -584,13 +591,22 @@ def pdf(
     if output is None:
         output = filename_for_url(url, ext="pdf", file_exists=os.path.exists)
     with sync_playwright() as p:
-        context, browser_obj = _browser_context(p, auth)
-        page = context.new_page()
-        page.goto(url)
-        if wait:
-            time.sleep(wait / 1000)
-        if javascript:
-            _evaluate_js(page, javascript)
+        context, browser_obj = _browser_context(p, auth, interactive=interactive)
+        if interactive:
+            page = context.new_page()
+            page.goto(url)
+            context = page
+            click.echo(
+                "Hit <enter> to take the shot and close the browser window:", err=True
+            )
+            input()
+        else:
+            page = context.new_page()
+            page.goto(url)
+            if wait:
+                time.sleep(wait / 1000)
+            if javascript:
+                _evaluate_js(page, javascript)
 
         kwargs = {
             "landscape": landscape,

But when I run it I get this error:

% shot-scraper pdf -i simonwillison.net
Hit <enter> to take the shot and close the browser window:

Traceback (most recent call last):
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/bin/shot-scraper", line 33, in <module>
    sys.exit(load_entry_point('shot-scraper', 'console_scripts', 'shot-scraper')())
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/Users/simon/Dropbox/Development/shot-scraper/shot_scraper/cli.py", line 625, in pdf
    pdf = page.pdf(**kwargs)
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/playwright/sync_api/_generated.py", line 9274, in pdf
    self._sync(
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/playwright/_impl/_sync_base.py", line 111, in _sync
    return task.result()
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/playwright/_impl/_page.py", line 869, in pdf
    encoded_binary = await self._channel.send("pdf", params)
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 39, in send
    return await self.inner_send(method, params, False)
  File "/Users/simon/.local/share/virtualenvs/shot-scraper-sQHOtKI2/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 63, in inner_send
    result = next(iter(done)).result()
playwright._impl._api_types.Error: Protocol error (Page.printToPDF): Printing is not available

It looks like the problem is that save to PDF is only available in headless mode: https://stackoverflow.com/a/70937997/6083

PDF creation is only supported in headless mode.

@simonw simonw closed this as completed Oct 14, 2022
@simonw simonw added the wontfix This will not be worked on label Oct 14, 2022
@honzajde
Copy link
Author

My bad! Now we both know it...

By HTML MODE I meant possibility to save html of the scraped page....

@simonw
Copy link
Owner

simonw commented Oct 15, 2022

You could do that using shot-scraper javascript like this:

shot-scraper javascript datasette.io 'document.body.innerHTML' | jq -r > page.html

The | jq -r bit is because without that you get back a JavaScript string with newlines converted to \n and suchlike - piping through jq -r turns that into a regular string which you can then save to a file.

@simonw simonw reopened this Oct 15, 2022
@simonw
Copy link
Owner

simonw commented Oct 15, 2022

Actually that only gets everything inside <body> - if you want <html> and downwards this seems to do the trick:

shot-scraper javascript datasette.io 'document.body.parentElement.outerHTML' | jq -r > page.html

Given how non-obvious this is I wonder if it does deserve having its own special feature?

@simonw
Copy link
Owner

simonw commented Oct 15, 2022

Actually this is better:

shot-scraper javascript datasette.io 'document.documentElement.outerHTML' | jq -r

@simonw
Copy link
Owner

simonw commented Oct 15, 2022

I realized that pattern doesn't give you the doctype.

If you want the doctype, there's a Playwright API that can do it: https://playwright.dev/python/docs/api/class-page#page-content

page.content()

Added in: v1.8

Gets the full HTML contents of the page, including the doctype.

This has convinced me that shot-scraper html would be worth adding! I'll open a new issue for that.

@simonw
Copy link
Owner

simonw commented Oct 15, 2022

I built that feature - documentation is here: https://shot-scraper.datasette.io/en/latest/html.html

@honzajde
Copy link
Author

Thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants