Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accept A-labels #31

Closed
jsha opened this issue Oct 24, 2016 · 23 comments
Closed

Accept A-labels #31

jsha opened this issue Oct 24, 2016 · 23 comments
Assignees

Comments

@jsha
Copy link
Contributor

jsha commented Oct 24, 2016

In letsencrypt/boulder#2278 we have an issue where TLDs that are IDNs are not recognized by Boulder as ending in a public suffix. Boulder currently calls publicsuffix.DefaultList.Find with the A-label form of the domain.

We have a few options:

  1. Boulder could convert to U-label for querying publicsuffix
  2. publicsuffix-go could switch to looking up by A-label by default
  3. publicsuffix-go could offer a different set of methods for looking up by A-label.

I think (2) is my ideal solution. What do you think?

@pfigel
Copy link

pfigel commented Oct 24, 2016

FWIW, Go's x/net/publicsuffix accepts both U- and A-labels, but does not work accurately for suffixes with multiple U-labels:

EffectiveTLDPlusOne("foo.example.xn--o1ach.xn--90a3ac") == "example.xn--o1ach.xn--90a3ac"
EffectiveTLDPlusOne("foo.example.упр.срб") == "упр.срб"

I don't think that's quite intentional, probably just a side-effect of how it's been implemented (and should probably be made more consistent, or at least documented).

Anyway, if compatibility with x/net/publicsuffix is a goal, the easiest way to replicate this behaviour would probably be to run idna.ToASCII when generating the rules (which is pretty much what x/net/publicsuffix does).

@weppos
Copy link
Owner

weppos commented Oct 24, 2016

Thanks for the ticket @jsha @PatF

Historically, the PSL never included Punycode entries. Instead, only Unicode values are stored (althought I frankly don't recall by heard the reason of the decision, assuming there was one).

So far, I've refrained from adding U/A conversion in most of my libraries, primarily because of the extra-overhead (and lack of support for good IDNA libraries).

However, Go provides a pretty good IDNA library and I may reconsider providing native support for both A/U directly in the library.

If I understood correctly, Let's Encrypt is currently storing and handling names as A labels. Is it correct?

@jsha I'm not thrilled by 2), mostly because it's exactly the opposite behavior of what the PSL is doing. 3) is a potentially solution, although I see a value in (potentially) not having to determine whether the input is A or U (and leave it up to the library). Of course, assuming the overhead is reasonable.

On other words, I'm inclined to consider having:

  • a default version that works with U as today
  • a potential helper methods that performs the conversion and/or
  • a single convenient method (in addition to Find) that will automagically determine if A/U and deal with that

What do you think?

@weppos
Copy link
Owner

weppos commented Oct 24, 2016

Anyway, if compatibility with x/net/publicsuffix is a goal, the easiest way to replicate this behaviour would probably be to run idna.ToASCII when generating the rules (which is pretty much what x/net/publicsuffix does).

It is, as long as it's not a bug. That seems more a bug than a feature to me. I have to investigate it, but if that's the case I will likely not support the buggy version.

In fact, the x/net/publicsuffix comes essentially as an handy transition mechanism (and I bet most of the tools transitioning are not fully supporting IDN).

@jsha
Copy link
Contributor Author

jsha commented Oct 25, 2016

Historically, the PSL never included Punycode entries. Instead, only Unicode values are stored (althought I frankly don't recall by heard the reason of the decision, assuming there was one).

I was aware of this, but thinking about it some more it's surprising to me. The IDN RFCs indicate that U-labels are intended for presentation, while A-labels should be used for under-the-hood processing. To me, this feels like a historic accident. But perhaps it was for a good reason!

If I understood correctly, Let's Encrypt is currently storing and handling names as A labels. Is it correct?

Correct.

I'm inclined to consider having:
a default version that works with U as today
a potential helper methods that performs the conversion and/or
a single convenient method (in addition to Find) that will automagically determine if A/U and deal with that

I'm in favor of (1) and (2). (3) seems unnecessarily risky. Callers should know whether they are handling A or U, and choose the appropriate function.

@weppos
Copy link
Owner

weppos commented Oct 25, 2016

I was aware of this, but thinking about it some more it's surprising to me. The IDN RFCs indicate that U-labels are intended for presentation, while A-labels should be used for under-the-hood processing. To me, this feels like a historic accident. But perhaps it was for a good reason!

That's a reasonable feedback, and in fact I must say that we do the same at DNSimple and in pretty much most of the environments where I worked with names.

In the light of that, switching to A name by default may be logical, although the list is Unicode. It's also true that the internal list representation is irrelevant, as I pre-process it in any case.

I need to take a couple of days to experiment a little bit on which internal representation may be more appropriate. I'd also have a quick check with @sleevi and @gerv (which have an extensive experience on how Chromium/Firefox browsers handles the A/U transformation) to see if they have some specific hint (in addition to a possible explanation of why we store Unicode values).

@jsha I see the issue is currently causing you a bug in the rate-limiting. Is it if I take a few days to research the issue, or do you need a super quick fix? In case, you can merge letsencrypt/boulder#2278 and I'll happily provide a patch once publicsuffix-go is updated. Does it work for you?

@jsha
Copy link
Contributor Author

jsha commented Oct 25, 2016

I'm willing to a wait a few days while you research the issue. The rate limiting issue makes things too strict rather than too lax, and I haven't heard from people who are having problems with it. Thanks!

@gerv
Copy link

gerv commented Oct 26, 2016

Why we chose U-label as the default (the A-label is in a comment) is lost in the mists of history, but perhaps it was because we thought that was the "proper" version of the name, and the A-label is an implementation detail. Anyway, as there are lossless conversions between the two, it doesn't seem like a big deal.

The two namespaces also don't overlap - it's not permitted to have a U-label of any sort, even an ASCII-only one, with hyphens in the third and fourth position. So it would be entirely reasonable (I'd say) for a smart API to take either, work out which, do the lookup, and return the results in the form submitted.

But if you don't want to do that, any of these other fixes are fine. But I don't think we'll be changing the PSL.

Gerv

@sleevi
Copy link

sleevi commented Oct 26, 2016

On Wednesday, October 26, 2016, Gervase Markham notifications@github.com
wrote:

Why we chose U-label as the default (the A-label is in a comment) is lost
in the mists of history, but perhaps it was because we thought that was the
"proper" version of the name, and the A-label is an implementation detail.
Anyway, as there are lossless conversions between the two, it doesn't seem
like a big deal.

IDNA2003 vs IDNA2008 was the concern at the time, and the fact that a given
name can, in theory, have two separate representations. At the time,
Mozilla was trying to keep its flexibility - at least based on past code I
can tell from around that time.

I haven't heard any substantial movement to IDNA2008 for quite some time.
Perhaps it's not happening? Or did it already happen and I missed it?

The two namespaces also don't overlap - it's not permitted to have a
U-label of any sort, even an ASCII-only one, with hyphens in the third and
fourth position. So it would be entirely reasonable (I'd say) for a smart
API to take either, work out which, do the lookup, and return the results
in the form submitted.

But if you don't want to do that, any of these other fixes are fine. But I
don't think we'll be changing the PSL.

Gerv


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#31 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABayJ6W9p1v4YdBIJYmYUJ2pd-u4DIzBks5q32cfgaJpZM4KfIXf
.

@gerv
Copy link

gerv commented Oct 26, 2016

We switched 8 months ago, in Firefox 45:
https://bugzilla.mozilla.org/show_bug.cgi?id=1218179

Feel free to join us :-)
https://bugs.chromium.org/p/chromium/issues/detail?id=505262

@jsha
Copy link
Contributor Author

jsha commented Oct 26, 2016

FWIW, the current ACME spec requires IDNA2008: https://ietf-wg-acme.github.io/acme/#rfc.section.6.1.4.

@sleevi
Copy link

sleevi commented Oct 26, 2016

@gerv I'll poke jshin on that, but I think it would at least highlight a 'possibility' where the U-label form is valuable - to allow consistent decomposition to the A-label in both IDNA2003 and IDNA2008 apps. Of course, I wish we'd just go with IDNA2008.

Perhaps an interim solution is convert to A-label across the board (which will certainly break Chrome's PSL implementation, and presumably Firefox, but we can fix both of those easily) in IDNA2008, and then perhaps add a consistency check as part of the PSL's tests to make sure the IDNA2003 and 2008 representations are consistent, until the PSL full throws IDNA2003 under the bus?

@gerv
Copy link

gerv commented Oct 27, 2016

Converting to A-label would presumably break everyone's implementation? What would be the advantages of doing that?

@sleevi
Copy link

sleevi commented Oct 27, 2016

@gerv - As jsha@ mentioned, it would allow applications to deal only with A-Labels when dealing with the PSL, without having to worry about the IDNA transformation. I certainly know that some of our mobile applications that don't support U-Labels don't want to carry the weight of an IDNA translation, to keep download size small.

@weppos
Copy link
Owner

weppos commented Oct 28, 2016

@sleevi @gerv I agree that changing the list now may potentially break implementation. In fact, we can even consider providing 2 alternate sources if we decide to go with the transition, and automatically convert one into another.

It's also worth to mention that consumers can perform the translation on their own, pre-processing the file.

In fact, my initial research goal is to understand which representation is better for the specific case of this library. If it happens to be A-labels, I'll be happy to add an extra step in the pre-processing that converts it into A-label strings.
https://github.com/weppos/publicsuffix-go/blob/master/cmd/gen/gen.go#L60

I can check directly, but you may save me some time. Given that Chrome/Firefox are pre-processing the list, do you know if the data is stored in A-labels or U-labels?

@gerv
Copy link

gerv commented Oct 28, 2016

weppos: if we are going to change the canonical representation, we should roll this change in with all the other format changes we want to make. Perhaps we need a wiki page to track them all? :-)

Firefox stores them as A-labels:
http://hg.mozilla.org/mozilla-central/file/default/netwerk/dns/prepare_tlds.py

@sleevi
Copy link

sleevi commented Oct 28, 2016 via email

@weppos
Copy link
Owner

weppos commented Oct 28, 2016

Perhaps we need a wiki page to track them all? :-)

That sounds like a plan @gerv. I'll start it, and post on the new PSL ml for discussion.

So it seems to me that A-labels seems really to be the best representation here. So I'm actually more convinced about pre-process the list into A-names, and switch it to use A-names by default as originally suggested by @jsha

Once done, I may provide U-label helpers.

Thanks for joining the discussion @sleevi and @gerv

@weppos weppos self-assigned this Oct 28, 2016
@kachkaev
Copy link

kachkaev commented Nov 4, 2016

Any updates or a roadmap on this by a chance? Just wondering as it seems that this issue makes it impossible to obtain let's encrypt certificates for certain IDN domains. Seems like thousands of users may be affected: letsencrypt/boulder#2277

If anyone knows a workaround, could you please share?

@weppos
Copy link
Owner

weppos commented Nov 7, 2016

@kachkaev I've spent this week finally releasing a major version of my whois library that has been under development for over 1 year. I honestly didn't have a lot of time to work on a patch.

I'll try to work on it this week. Unless any big objection, it's likely I'll switch to A-labels by default. However, it's not a 10-minutes change as I have to put in place a procedure to pre-process the definition, as I definitely don't want to do it at runtime.

I have a script that pulls the PSL every day and auto-submit PR for changes, and that needs to be changed as well.

I'll post updates here, and I'll make sure to share the code in a branch as soon as ready, so that it's possible to start testing it against letsencrypt/boulder#2277 and letsencrypt/boulder#2278

@pzb
Copy link

pzb commented Nov 7, 2016

I agree that A-labels are the way to go. Everytime I use one of @weppos' great libraries, I end up adding a check for the xn-- substring in the calling function and doing conversion there and then converting back. This is because the PSL is inherently tied to DNS, which uses A-labels. There is no need to guess how to encode 手表; if it s a TLD is it always xn--kpu716f.

As for IDNA2003 vs IDNA2008, isn't this solved by using A-labels? From what I can tell, the A-label to U-label conversion is the same for both, but the reverse may vary. If correct, then the only question is whether the IDNA2003 and IDNA2008 algorithms have different output for entries on the list, and the only answer the matters is what is in DNS.

@slavonnet
Copy link

example "рф". National russian domain!
https://xn----itbzecmx.xn--p1ai/

weppos added a commit that referenced this issue Nov 20, 2016
weppos added a commit that referenced this issue Nov 21, 2016
* Add test cases for GH-31

* Switch the list internal representation to A-labels by default

* Pre-process the list to A-labels

* Enforce official PSL test cases to ASCII

* Add note about A-labels in the README
@weppos
Copy link
Owner

weppos commented Nov 21, 2016

I want to personally thank everyone who provided feedback in this discussion, especially the folks at Let's Encrypt for reporting the issue, and @gerv @sleevi @pzb for providing their personal experience.

The issue is fixed by #40. The library now defaults to A-labels, and pre-process the PSL to make sure it is consistent with the expected internal representation.

New autopull requests will automatically provide the A-label form of any upstream PSL update.

@weppos
Copy link
Owner

weppos commented Nov 21, 2016

FYI @pzb I'll see if I can replicate the same work on the publicsuffix-ruby lib.

@weppos weppos closed this as completed Nov 21, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants