Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraper logs 1gb #186

Closed
supercoolspy opened this issue Jul 15, 2024 · 4 comments · Fixed by #190
Closed

Scraper logs 1gb #186

supercoolspy opened this issue Jul 15, 2024 · 4 comments · Fixed by #190
Labels
S-cannot-reproduce The maintainers cannot reproduce the bug.

Comments

@supercoolspy
Copy link

Scraper seems to be logging around 1 gb of things when I request a webpage is there a reason for this? or feature flag to turn it off

@adamreichold
Copy link
Member

Hard to say without seeing an example of the logs. You can certainly suppress them by configuring the log framework you employ in your program. Most of them support a per-module configuration that can be used to suppress only some sources of logs.

@cfvescovo
Copy link
Member

Could you please help us reproduce this issue by showing us your code?

@cfvescovo cfvescovo added the S-cannot-reproduce The maintainers cannot reproduce the bug. label Jul 16, 2024
@adamreichold
Copy link
Member

One specific thing we had to do to limit log volume was to set

RUST_LOG=html5ever::tree_builder=off

to prevent some messages on being unable to fix-up invalid HTML from html5ever, the HTML parsed used by this crate.

@gfaster
Copy link
Contributor

gfaster commented Jul 30, 2024

This is due to the derived Debug on ElementRef. ego_tree::NodeRef prints the entire tree in its derived Debug impl, but there are already outstanding (soundness?) bugs in ego_tree and I don't think they'll be fixed anytime soon.

To reproduce, you need to have an active log::Log logger running enabled at log::Level::Debug

Snipped of produced output
Matching complex selector body .entry-content* p for ElementRef { node: NodeRef { id: NodeId(173), tree: Tree { vec: [Node { parent: None, prev_sibling: None, next_sibling: None, children: Some((NodeId(2), NodeId(1001))), value: Document }, Node { parent: Some(NodeId(1)), prev_sibling: None, next_sibling: Some(NodeId(3)), children: None, value: Doctype(<!DOCTYPE html PUBLIC "" "">) }, Node { parent: Some(NodeId(1)), prev_sibling: Some(NodeId(2)), next_sibling: Some(NodeId(999)), children: Some((NodeId(4), NodeId(245))), value: Element(<html lang="en-US" data-wp-dark-mode-preset="2" data-wp-dark-mode-active="true" data-wp-dark-mode-loading="true">) }, Node { parent: Some(NodeId(3)), prev_sibling: None, next_sibling: Some(NodeId(244)), children: Some((NodeId(5), NodeId(243))), value: Element(<head>) }, Node { parent: Some(NodeId(4)), prev_sibling: None, next_sibling: Some(NodeId(6)), children: None, value: Text("\n") }, Node { parent: Some(NodeId(4)), prev_sibling: Some(NodeId(5)), next_sibling: Some(NodeId(7)), children: None, value: Element(<meta charset="UTF-8">) }, Node { parent: Some(NodeId(4)), prev_sibling: Some(NodeId(6)), next_sibling: Some(NodeId(8)), children: None, value: Text("\n") }, Node { parent: Some(NodeId(4)), prev_sibling: Some(NodeId(7)), next_sibling: Some(NodeId(9)), children: None, value: Element(<meta name="viewport" content="width=device-width, initial-scale=1">) }, Node { parent: Some(NodeId(4)), prev_sibling: Some(NodeId(8)), next_sibling: Some(NodeId(10)), children: None, value: Text("\n") }, Node { parent: Some(NodeId(4)), prev_sibling: Some(NodeId(9)), next_sibling: Some(NodeId(11)), children: None, value: Element(<link rel="profile" href="https://gmpg.org/xfn/11">) }, Node { parent: Some(NodeId(4)), prev_sibling: Some(NodeId(10)), next_sibling: Some(NodeId(12)), children: None, value: Text("\n") }, Node { parent: Some(NodeId(4)), prev_sibling: Some(NodeId(11)), next_sibling: Some(NodeId(13)), children: No

Running my program produced 4 GB in 40 seconds before I stopped it.

I will submit a PR for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-cannot-reproduce The maintainers cannot reproduce the bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants