Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xh_scanner loses data when tag name or attribute is too long #32

Open
jelmervdl opened this issue Dec 13, 2021 · 0 comments
Open

xh_scanner loses data when tag name or attribute is too long #32

jelmervdl opened this issue Dec 13, 2021 · 0 comments

Comments

@jelmervdl
Copy link
Member

jelmervdl commented Dec 13, 2021

I was debugging browsermt/bergamot-translator#273 when I noticed that xh_scanner does test for MAX_TOKEN_SIZE everywhere it adds characters to buffer, but does not call push_back(c) if the limit is hit. As a result, if any of the for-loops that add characters to its internal buffers do hit that limit, a character may be lost.

I think this only affects CDATA sections, comments, attribute values and tag names. So for the main use case of warc2text there is little impact for this bug.

Edit: Thinking about it, it would only affect the tag filters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant