-
Notifications
You must be signed in to change notification settings - Fork 43
Support unicode email addresses #17
Comments
Just processed a few 100,000s of emails and ran into a number of unicode emails that were failing e.g. |
Ah, good to know this applies to someone. I'll take a look. Do you think you could submit a PR that adds some useful test cases for Unicode? |
what's the unicode character range we are aiming to check for as valid? |
I haven't had a chance to read and digest the existing RFCs pertaining to unicode in email addresses. My intention is to follow the standard, which I hope will include the full unicode range. Sorry this hasn't moved forward, my school year finally ended, so I hope to have some time to work on this. |
Re: RFC citations and full implementation - RFC 6530 |
Thanks for summarizing your research, that's super helpful! I'm busy at this exact moment, I'll try to follow up with some questions in about an hour. |
Do you think the Unicode v6 stipulation is important for the implementation itself, given js/v8 support for unicode, broadly speaking? Do you think, in pursuit of supporting unicode email addresses, that this module should provide a normalization function which would punycode the domain as appropriate? In the latter case, would the normalization function need to apply any transformations to the local part (apart from stripping out comments and folding whitespace) to make it compatible with any SMTP implementations? |
I wouldn't think so, no. Just as a basic "support at minimum these characters" guide. I think you're right that the Node/v8 unicode support would cover the RFC. From my understanding of the RFC (I most likely misunderstood at minimum one part :) ), punycoding should not be necessary. The RFC covers internationalization of not only the local/domain piece of email addresses, but "SMTP Extension for Internationalized Email Address" as well. I'm not 100% on that assertion though |
This is the part of the RFC I'm looking at:
|
Thanks for that. I brought up punycode because at some point, in order to be effective for a delivery mechanism, the domain name must be looked up in the DNS to find the IP address to connect to. As such, it must follow the requirements of a valid domain name, which I recall as being a subset of ASCII characters. Whether the email address is valid for the envelope is a separate issue, and I'd like to find a more definitive specification than "possibly using encoded words." |
Ah, ok, I'm with you. I'll keep digging through the RFC(s), but you may be right that punycoding the domain would be a useful feature. |
Yep. I have this dream (predicated on having free time) that I'll transform this library into a general-purpose email address validation, normalization and transformation library. There are parts of nodemailer that seem redundant wrt email address parsing and the subsequent paths of validation and delivery. I also seem to recall that there was an RFC that proposed specific encodings for unicode compatibility for email addresses, which was maybe deprecated in favor of UTF-8? Not sure. |
Oh, cool! I've only been able to find UTF-8 support in the RFCs I've crawled through so far (6530 being the main reference). |
To that end, it might just make sense to pull in a parser generator (or maybe PEG if it's sufficiently powerful) which supports the right Unicode features. Then this entire library would basically be generating the appropriate errors. Not entirely sure that error handing/recovery in existing parser generators is sufficient, though - the existing implementation does its best to output usable error codes with the hope that they might let users know what they did wrong. |
Gotcha. Maybe that idea was set aside in favor of the better supported existing UTF-8 encoding. |
So how would you feel about splitting this out into 2 bodies of work? One to allow for simple validation of emails containing non-ascii unicode characters, and one for building out the punycode normalizing function? That way the basic email validation is useable sooner rather than later, and the long-term goal of the project doesn't get dropped off? |
Sounds good to me. Feel free to submit a pull request based on your findings (bonus points if it references specific sections of RFCs, though it sounds like some of the references are simply that there aren't many specific things to reference). I'd like it to pass #18, and not break existing tests. I'd also like you to run the PR against the code style checker for the hapijs org using lab (pretty sure that's just included in One final thing to consider: I know UTF-8 has support for multibyte encoding of e.g. null characters. Pretty sure it's a non-issue for this, but just double-check that it won't have unintended impacts. If you don't get around to this, I'll try to block out some time in the next day or so to take care of it. |
Sounds great, thanks @skeggse. I'll see what we can pull out this afternoon/tomorrow AM. :) |
RE: comments on #18
I think it would be sufficient to add two tests:
I'd like the latter test to be separate because I'd prefer to avoid mocking/mimicking separately defined functionality. It's not the worse, but I'd prefer to find a canonical example i18n domain we can rely on, and use that in the normal test process. EDIT: maybe you could also pull in the other test example email addresses from #19 |
Relevant: nodejs/node#11218 |
So, here's the thing... Domain punycode is necessary, and the Node punycode API is being deprecated in favor of the community punycode.js library. nodejs/node#11218 As I move forward with this work I'm going to try out rewire and punycode.js. |
@skeggse looks like using I know that is technically against Hapi code standards since |
Ah, interesting. I'm tempted to refactor the API, and upgrade isemail to a new major rev, without support for I think using |
Thanks! I'll use Just as a side note, I would support a major rev that drops |
@Marsup I know joi uses this package - do you think I should release this as |
I am not worried about that "breaking change". I am worried about the new external dependency. |
Hm, granted. Alternative idea: I want to stop supporting the dns resolution check. As a result, we wouldn't need punycode.js, and would be able to simplify the API. On that note, @WesTyler: I just realized that while we're checking the number of octets it takes to represent the domain as octets, it doesn't account for the extra four bytes, or the actual encoding scheme used by punycode. The simplest solution there is to just use punycode.js, but maybe we can get it to work without punycode (and then could get rid of the external dependency altogether). |
FWIW, this module is the one advised by node core itself, so I'd assume this is a safe dependency. |
It is still an external dep that I need to review every time they change something. If there is a way not to require it, that would be preferred. |
@skeggse so are you proposing removing the normalization, or rolling out an internal normalization implementation...? |
No, the normalization is here to stay, and it doesn't handle punycode anyway. At the moment, it seems that dns checking is superfluous to the goals of this module, and getting rid of it would also mean we wouldn't need the output of punycode. We'd need to figure out how to calculate what a punycoded domain's length would be, though. Down the line, though, I'd like to have this module provide parsing and normalization (to avoid duplicate processing of email addresses), which would necessitate punycode. So not entirely sure that's the right way to go, but it would solve the problem temporarily. |
Oh, my bad, sorry. I forgot that the punycoding and normalization were 2 different steps for 2 different problems XD. I agree with the judgement that dns checking (and therefore punycoding) is likely superfluous for this module right now. Is it worth opening up a new issue to isolate that effort from the Unicode work in this ticket and the merged PR? If so I can open it up and pull the info from above into it. I can probably carve out some time this afternoon or tomorrow morning to work on it if nobody else gets to it first. |
Yeah that sgtm |
See also buildmail:lib/buildmain.js line 890, and RFC 6530.
The text was updated successfully, but these errors were encountered: