Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements #91

Closed
wants to merge 4 commits into from
Closed

Performance improvements #91

wants to merge 4 commits into from

Conversation

cfvescovo
Copy link
Member

@cfvescovo cfvescovo commented Oct 31, 2022

I have tried to improve parsing performance by replacing the standard hash function with fnv and by replacing LocalNames with strings (in node.rs) as suggested in #45.
@teymour-aldridge are there any other changes we could make to improve parsing speed? Could someone that needs to parse multiple/huge HTMLs perform some benchmarks and report back?

@cfvescovo
Copy link
Member Author

I had to amend my commit because cargo fmt was complaining the imports were not sorted...

@adamreichold
Copy link
Member

I don't think using a unkeyed hash like FNV is a reasonable choice here as it opens programs using this library up to denial of service attacks. If better hashing speed is wanted, a keyed hash like ahash (default hasher used by hashbrown) would seem more appropriate IMHO.

@cfvescovo
Copy link
Member Author

You are right. Good catch! I did not think of this as an attack surface but it could definitely be exploited. Will switch to ahash ASAP

@cfvescovo cfvescovo marked this pull request as ready for review January 11, 2023 09:17
@cfvescovo cfvescovo self-assigned this Jan 11, 2023

/// The element classes.
pub classes: HashSet<LocalName>,
pub classes: HashSet<String>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we include #101, it might be preferable to keep this as LocalName as class names should be highly redundant seen over one or even multiple documents. Lazy initialization would then potentially allow avoiding some of the cost of deduplicating these strings when they are never matched for a given element.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Maybe we should consider merging #91 and #101 at the same time after #101 is ready. However, this would require a major update to the crate since #101 causes breaking changes to be made to the API.

@teymour-aldridge
Copy link
Collaborator

Hi, sorry I haven't been responsive as of late - don't really have time to look at this, but if both of you think that the changes are worthwhile then very happy to approve so that we can merge this.

@cfvescovo
Copy link
Member Author

See #101

@cfvescovo cfvescovo closed this Mar 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants