-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add --format-email to perform full validation on "email" and "idn-email" formats #460
base: main
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
I was able to try this out locally and confirmed that strings like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for coming to the table with working code and putting forth so much initial effort! This is one where I need to think about what exactly to do.
There are three issues, as I see it:
- a featureset this size needs unit tests
- as a maintainer, I need to understand it
- what is the right CLI interface?
Regarding (1), adding tests is something that we can address, but having good tests involves some understanding of the regexes.
Which leads to (2), understanding the expressions. I left an inline comment on one of them -- I think it needs a little bit of work to clarify but not too much. The RFC6531 regex is, TBH, really intimidating. I need to take some time to figure out how to even approach learning what it does.
My third item, (3), is that I don't know that if we're adding these regexes they shouldn't be the only behavior. The presence of a new CLI switch for this gives me some pause. There's more nuance around the regex format, and I don't think it's necessarily a good example to follow. Too many CLI flags and options can overwhelm a user, so we should only expose a control if there's a strong reason to do so. Do these expressions have false negatives? If not, I think it would be better not to expose any kind of flag and simply make these the behavior.
I'm not sure what the exact next step is here. Adding tests and tinkering on the interface are relatively straightforward TODO items. Understanding that sizable regex has a less obvious path. I'll need to find some time to really sit down and read through it carefully, and probably sit with RFC 6531 as well. (I did this with the datetime formats and it wasn't too bad, but the email format is significantly more complex.)
\[[\t -Z^-~]*] | ||
) | ||
$ | ||
""", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm finding this regex a bit difficult to read. In particular, I'm seeing some unconventional range expressions in the character classes, like /-9
, -Z
, and ^-~
.
These are valid, but they aren't the way character classes are typically written. Perhaps other people have an expert and intuitive knowledge of what chr(ord(" ")+1)
will be, but I definitely don't.
Can these be rewritten such that the suite of characters matched is more obvious to a reader? For example, rather than [/-9]
, I would much rather see [/0-9]
. The fact that the lowercase letters are captured with ^-~
caught me particularly off-guard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair. I'm not the original author of these regexes either (links in PR description). I'll see what I can do about cleaning up those character classes though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some time this week to revisit this, and reworked the simpler RFC5321 validation. I also added in some length checks as well and tried validating it against the examples in this wikipedia page: https://en.wikipedia.org/wiki/Email_address#Examples
It validated/invalidated as expected except for a couple cases:
I❤️CHOCOLATE@example.com
was incorrectly found invalid, likely because of UTF-8- This is actually the correct behavior. UTF-8 email addresses should only be allowed in the
idn-email
format
- This is actually the correct behavior. UTF-8 email addresses should only be allowed in the
i.like.underscores@but_they_are_not_allowed_in_this_part
was incorrectly found invalid, we allow underscores in the domain part of the regex.
Let me know if the regex is easier to understand and if you think we should need to handle non-ascii strings like that utf-8 one.
I have not revisited the idn-email validator yet
Co-authored-by: Stephen Rosen <sirosen@globus.org>
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Add more idn email tests
for more information, see https://pre-commit.ci
Setting
--format-email=full
will perform a stronger regex matching on strings with format set toemail
oridn-email
, rather than doing the simple check from jsonschema. The default option of--format-email=default
will only check that there is a@
symbol in the string.idn-email regex based on: https://gist.github.com/baker-ling/3b4b014ee809aa9732f9873fe060c098
email regex based on: https://stackoverflow.com/questions/13992403/regex-validation-of-email-addresses-according-to-rfc5321-rfc5322