-
-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
shot-scraper pdf -i #92
Comments
What would HTML mode do? You can run
|
Unfortunately it doesn't look like it's possible to provide interactive mode for PDF printing. I tried this prototype: diff --git a/shot_scraper/cli.py b/shot_scraper/cli.py
index 04e4ef5..b5157b5 100644
--- a/shot_scraper/cli.py
+++ b/shot_scraper/cli.py
@@ -550,6 +550,12 @@ def javascript(
type=click.FloatRange(min=0.1, max=2.0),
help="Scale of the webpage rendering",
)
+@click.option(
+ "-i",
+ "--interactive",
+ is_flag=True,
+ help="Interact with the page in a browser before taking the shot",
+)
@click.option("--print-background", is_flag=True, help="Print background graphics")
def pdf(
url,
@@ -563,6 +569,7 @@ def pdf(
width,
height,
scale,
+ interactive,
print_background,
):
"""
@@ -584,13 +591,22 @@ def pdf(
if output is None:
output = filename_for_url(url, ext="pdf", file_exists=os.path.exists)
with sync_playwright() as p:
- context, browser_obj = _browser_context(p, auth)
- page = context.new_page()
- page.goto(url)
- if wait:
- time.sleep(wait / 1000)
- if javascript:
- _evaluate_js(page, javascript)
+ context, browser_obj = _browser_context(p, auth, interactive=interactive)
+ if interactive:
+ page = context.new_page()
+ page.goto(url)
+ context = page
+ click.echo(
+ "Hit <enter> to take the shot and close the browser window:", err=True
+ )
+ input()
+ else:
+ page = context.new_page()
+ page.goto(url)
+ if wait:
+ time.sleep(wait / 1000)
+ if javascript:
+ _evaluate_js(page, javascript)
kwargs = {
"landscape": landscape, But when I run it I get this error:
It looks like the problem is that save to PDF is only available in headless mode: https://stackoverflow.com/a/70937997/6083
|
My bad! Now we both know it... By HTML MODE I meant possibility to save html of the scraped page.... |
You could do that using
The |
Actually that only gets everything inside
Given how non-obvious this is I wonder if it does deserve having its own special feature? |
Actually this is better:
|
I realized that pattern doesn't give you the doctype. If you want the doctype, there's a Playwright API that can do it: https://playwright.dev/python/docs/api/class-page#page-content
This has convinced me that |
I built that feature - documentation is here: https://shot-scraper.datasette.io/en/latest/html.html |
Thank you so much! |
I think it should support interactive mode for pdf as well... It does not right now...
BTW. Can we by any chance have html mode?
The text was updated successfully, but these errors were encountered: