Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create NON_PRINTABLE regex at runtime #301

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

therve
Copy link

@therve therve commented May 17, 2019

This moves Reader.NON_PRINTABLE to a property, instead of a class variable.

We're testing the 5.1 release on Python 2.7 with UCS4 support, and we saw an
increase of 30M due to this single regular expression. As we're using the C
extension, this code path is not even triggered, so making it evaluated at
runtime works around the issue.

I'm curious how the C extension does the job and if there isn't a difference in
behavior.

This isn't an issue with Python3.

@roehling
Copy link

I submitted a similar PR last year, where I replaced the regex with proper unicode character class queries: #124

@therve
Copy link
Author

therve commented May 22, 2019

That would work for me too. I'm still concerned with the behavior of the C code, but that removes the problematic regex.

@roehling
Copy link

AFAICT the C module relies on the libyaml implementation, which does not use regular expressions but a simple range check

@albertvaka
Copy link

albertvaka commented Jun 19, 2019

@nitzmahone Any chance we can get this merged? Or #124 for that matter. It impacts both the resource usage and load time.

@nitzmahone
Copy link
Member

We're planning on poking at stuff for the next release pretty soon- I need to go back and untangle why the current split impl was done the way it was before deciding what makes the most sense here (vs #124). Ideally we shouldn't need a regex at all, because clearly there's a significant expense to it, but I'd need to measure to see which is worse.

@albertvaka
Copy link

albertvaka commented Mar 13, 2020

@nitzmahone Any chance this can be merged? The regex changed so it's conflicting now :(

@mx-psi
Copy link

mx-psi commented Mar 3, 2021

@nitzmahone Are there any blockers to merge this PR (once the merge conflict is solved)? We would benefit from it, having to manually patch pyyaml is not ideal

@therve therve force-pushed the non-printable-regex branch from f84086e to 5080507 Compare March 3, 2021 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants