[META] Summary of HPLT requested changes #40

jelmervdl · 2023-10-04T10:08:59Z

Further extend the JSONL output to contain all text and metadata so it forms the complete output, without base64 encoding on the document. We'll need proper JSON escaping to deal with non-unicode data, or guarantee all text coming out of warc2text is always valid unicode. See alternative output format based on JSONlines #34 and Add --jsonl option #35.
For each text segment (i.e. line) in the text, also mark the block-level tag it was found in. This should help identify the short <li> and <td> data, although I would not be surprised if we'll see a lot of <div>. Track html tags #46
Output crawl timestamp with metadata. Add --jsonl option #35
Output byte offset of where the gzip compressed warc record begins.
Replace fasttext with fastertext. It's free speed. Except that that repo is currently missing the string_view modification.
Add an option to skip langid entirely, and just write a single output. We can then do langid downstream if we decide to. The idea being that any mistake we make with langid in warc2text is irrecoverable: once a document is wrongly classified, the only correction we can do is remove it at the end. We don't have a method of moving the document into the correct stream. We discussed improving the langid inside warc2text, but the argument was that developing good langid in just C++ was harder.
Right now you could decide to ignore the language attribute in the JSON output, since that doesn't get split into multiple files anyway. I don't think current lang-id is slow enough to add a special bypass option for it.
Add an option a la pdf-pass to write the robots.txt responses to a separate warc. Also include 404s etc, so we know which domains were asked but did not give us a robots.txt (which we'll interpret as crawling allowed). Shunt robots.txt responses to separate warc #41
Boilerplate detection` like trafilatura might work but is relatively expensive since it needs to build a proper DOM tree, and will be a lot of work to port over to C++. We will try some simpler rule/classification-based document prefix/suffix removal on the text data itself first.

The text was updated successfully, but these errors were encountered:

jelmervdl self-assigned this Oct 4, 2023

jelmervdl mentioned this issue Oct 6, 2023

Add --robotspass shunt for records related to robots.txt #43

Merged

jelmervdl mentioned this issue Nov 9, 2023

Track html tags #46

Draft

jelmervdl removed their assignment Feb 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[META] Summary of HPLT requested changes #40

[META] Summary of HPLT requested changes #40

jelmervdl commented Oct 4, 2023 •

edited

Loading

[META] Summary of HPLT requested changes #40

[META] Summary of HPLT requested changes #40

Comments

jelmervdl commented Oct 4, 2023 • edited Loading

jelmervdl commented Oct 4, 2023 •

edited

Loading