Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor request: \v for vertical spacing #477

Closed
dchaplinsky opened this issue Aug 15, 2022 · 13 comments
Closed

Minor request: \v for vertical spacing #477

dchaplinsky opened this issue Aug 15, 2022 · 13 comments

Comments

@dchaplinsky
Copy link

Hi!

I'm using the regex lib to make a port of language tool libs (originally java) for sentence and word tokenization.
Those are relying on \v\h heavily. Some of those rules are shipped in the xml files full of regexes and I'm willing not to alter those to not to maintain a separate copy. I can kind of workaround it by replacing \v with VERTICAL_SPACE: str = "\u000a\u000b\u000c\u000d\u0085\u2028\u2029" but it's another tiny nightmare, as those regexes can come in different fashions: \v*, [\v\t]*, etc.

Please review the possibility to add the \v flag

@dchaplinsky
Copy link
Author

I can see that code suggest \v pseudo, but I cannot understand why it doesn't work then:

In [3]: import regex as re

In [8]: re.search(r"\v", "\n") is None
Out[8]: True

In [9]: re.search(r"\v", "\n", flags=re.M | re.U | re.V1) is None
Out[9]: True

@mrabarnett
Copy link
Owner

\v already exists in Python as being short for \x0b (LINE TABULATION):

>>> '\v'
'\x0b'
>>> '\v' == '\N{LINE TABULATION}'
True

@dchaplinsky
Copy link
Author

Thanks for the prompt reply!

Any ideas on the matching of vertical space?

@mrabarnett
Copy link
Owner

There are far fewer characters that need to match: [\x0A\x0B\x0C\x0D\x85\u2028\u2029] or [\x0A-x0D\x85\u2028\u2029].

Maybe it could be added as \V, although that would be inconsistent with \h, and there are pairs of lowercase/uppercase escape codes where the uppercase one is the negative of the lowercase one, e.g. \d and \D. On the other hand, those implementations that have \h and \v don't have \H and \V.

Also, I don't want to add something that the re module might do differently if it were added later.

That's why it hasn't been added already.

@dchaplinsky
Copy link
Author

dchaplinsky commented Aug 16, 2022 via email

@mrabarnett
Copy link
Owner

I've come across a mention of \H and \V, so using \V would be a bad idea.

@dchaplinsky
Copy link
Author

dchaplinsky commented Aug 16, 2022 via email

@mrabarnett
Copy link
Owner

Now I'm thinking about \y and \Y, which look a little like \v and \V. ProgressSQL uses them instead of \b and \B, which every other implementation that I know of uses, possibly because \b normally represents \x08 outside regex, and does still within characters classes.

I want the regex module to remain compatible with the re module, and just in case they ever get added there in the future, I'm soliciting opinions on python-dev.

@mrabarnett
Copy link
Owner

I've added \p{HorizSpace} (\p{H}) and \p{VertSpace} (\p{V}) in regex 2022.8.17, which is currently being built on GitHub and should arrive on PyPI soon.

@dchaplinsky
Copy link
Author

Wow, many thanks!

@dchaplinsky
Copy link
Author

dchaplinsky commented Oct 11, 2022 via email

@mrabarnett
Copy link
Owner

Given the feedback on python-dev, I won't be adding \y and \Y. What I've already added should suffice.

@dchaplinsky
Copy link
Author

dchaplinsky commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants