xh_scanner loses data when tag name or attribute is too long #32

jelmervdl · 2021-12-13T21:03:22Z

I was debugging browsermt/bergamot-translator#273 when I noticed that xh_scanner does test for MAX_TOKEN_SIZE everywhere it adds characters to buffer, but does not call push_back(c) if the limit is hit. As a result, if any of the for-loops that add characters to its internal buffers do hit that limit, a character may be lost.

I think this only affects CDATA sections, comments, attribute values and tag names. So for the main use case of warc2text there is little impact for this bug.

Edit: Thinking about it, it would only affect the tag filters.

The text was updated successfully, but these errors were encountered:

jelmervdl mentioned this issue Dec 13, 2021

Remove value length limit from HTML parser & interpolated alignments browsermt/bergamot-translator#274

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xh_scanner loses data when tag name or attribute is too long #32

xh_scanner loses data when tag name or attribute is too long #32

jelmervdl commented Dec 13, 2021 •

edited

Loading

xh_scanner loses data when tag name or attribute is too long #32

xh_scanner loses data when tag name or attribute is too long #32

Comments

jelmervdl commented Dec 13, 2021 • edited Loading

jelmervdl commented Dec 13, 2021 •

edited

Loading