Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficiency improvements in cd/alphabet_languages #122

Merged
merged 16 commits into from
Oct 23, 2021
Merged

Efficiency improvements in cd/alphabet_languages #122

merged 16 commits into from
Oct 23, 2021

Conversation

adbar
Copy link
Contributor

@adbar adbar commented Oct 14, 2021

  • I thought that the accentuation check would be more efficient on a set
  • The values for the features target_have_accents & target_pure_latin can be cached so the code is not run all the time

Another idea: if you don't use character position in the most frequent characters per language in assets/ the chars could be a set/frozenset instead of a list, that would make the line [c for c in language_characters if c in characters] faster.

@codecov-commenter
Copy link

codecov-commenter commented Oct 14, 2021

Codecov Report

Merging #122 (a0ceedd) into master (38cfa45) will increase coverage by 0.04%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #122      +/-   ##
==========================================
+ Coverage   90.30%   90.35%   +0.04%     
==========================================
  Files          11       11              
  Lines        1165     1171       +6     
==========================================
+ Hits         1052     1058       +6     
  Misses        113      113              
Impacted Files Coverage Δ
charset_normalizer/cd.py 96.49% <100.00%> (+0.08%) ⬆️
charset_normalizer/constant.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 38cfa45...a0ceedd. Read the comment docs.

Copy link
Member

@Ousret Ousret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a quick look over and there are some remarks. Make sure to run the bin/run_autofix.sh to ensure that linters won't fail.
Regarding the "performance" effect/benefit I don't have the time to verify it for now.

charset_normalizer/cd.py Outdated Show resolved Hide resolved
charset_normalizer/cd.py Outdated Show resolved Hide resolved
charset_normalizer/cd.py Outdated Show resolved Hide resolved
charset_normalizer/cd.py Outdated Show resolved Hide resolved
Copy link
Member

@Ousret Ousret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some final thought on this.

charset_normalizer/cd.py Outdated Show resolved Hide resolved
charset_normalizer/cd.py Outdated Show resolved Hide resolved
charset_normalizer/cd.py Outdated Show resolved Hide resolved
@Ousret Ousret added the enhancement New feature or request label Oct 17, 2021
@Ousret
Copy link
Member

Ousret commented Oct 23, 2021

I brought some minor modifications to your fork.
Well, after several tests. The performance gap between Chardet and Charset-Normalizer is slightly better.
Before x4.861
After x4.916

x0.1 faster.
Repeated the test several times.

@Ousret Ousret merged commit f1cf425 into jawah:master Oct 23, 2021
@Ousret Ousret mentioned this pull request Nov 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

Successfully merging this pull request may close these issues.

3 participants