Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[META] Summary of HPLT requested changes #40

Open
6 of 8 tasks
jelmervdl opened this issue Oct 4, 2023 · 0 comments
Open
6 of 8 tasks

[META] Summary of HPLT requested changes #40

jelmervdl opened this issue Oct 4, 2023 · 0 comments

Comments

@jelmervdl
Copy link
Member

jelmervdl commented Oct 4, 2023

  • Further extend the JSONL output to contain all text and metadata so it forms the complete output, without base64 encoding on the document. We'll need proper JSON escaping to deal with non-unicode data, or guarantee all text coming out of warc2text is always valid unicode. See alternative output format based on JSONlines #34 and Add --jsonl option #35.
  • For each text segment (i.e. line) in the text, also mark the block-level tag it was found in. This should help identify the short <li> and <td> data, although I would not be surprised if we'll see a lot of <div>. Track html tags #46
  • Output crawl timestamp with metadata. Add --jsonl option #35
  • Output byte offset of where the gzip compressed warc record begins.
  • Replace fasttext with fastertext. It's free speed. Except that that repo is currently missing the string_view modification.
  • Add an option to skip langid entirely, and just write a single output. We can then do langid downstream if we decide to. The idea being that any mistake we make with langid in warc2text is irrecoverable: once a document is wrongly classified, the only correction we can do is remove it at the end. We don't have a method of moving the document into the correct stream. We discussed improving the langid inside warc2text, but the argument was that developing good langid in just C++ was harder.
    Right now you could decide to ignore the language attribute in the JSON output, since that doesn't get split into multiple files anyway. I don't think current lang-id is slow enough to add a special bypass option for it.
  • Add an option a la pdf-pass to write the robots.txt responses to a separate warc. Also include 404s etc, so we know which domains were asked but did not give us a robots.txt (which we'll interpret as crawling allowed). Shunt robots.txt responses to separate warc #41
  • Boilerplate detection` like trafilatura might work but is relatively expensive since it needs to build a proper DOM tree, and will be a lot of work to port over to C++. We will try some simpler rule/classification-based document prefix/suffix removal on the text data itself first.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant