Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should PartialEq<str> for Locale be strict or lenient? #1709

Closed
sffc opened this issue Mar 17, 2022 · 18 comments · Fixed by #2020
Closed

Should PartialEq<str> for Locale be strict or lenient? #1709

sffc opened this issue Mar 17, 2022 · 18 comments · Fixed by #2020
Assignees
Labels
C-locale Component: Locale identifiers, BCP47 S-small Size: One afternoon (small bug fix or enhancement) T-core Type: Required functionality

Comments

@sffc
Copy link
Member

sffc commented Mar 17, 2022

Background: #1704

@sffc sffc added C-locale Component: Locale identifiers, BCP47 discuss Discuss at a future ICU4X-SC meeting v1 labels Mar 17, 2022
@zbraniecki
Copy link
Member

I believe for 1.0 we should have a mental model of how our API communicate different strictness levels of comparison and what is the role of traits such as Eq and PartialEq in this.

For example, we could say that Eq expects case normalization and -, while PartialEq would return true for en-US == eN_us.

We should also consider incomplete matching - en-US vs en-Latn-US or en-US-u-ca-buddhist vs en-US.
In language matching API for l10n needs I added matches method which allows to treat missing fields on either side as "catch-all" or "*".
You can read through the steps here to see how it's used: https://github.com/projectfluent/fluent-langneg-rs/blob/master/src/negotiate/mod.rs#L54

The case normalization vs unnormalized API design is mostly against str, but treatment of missing fields is for types as well. (should en-US-x-foo match en-US-x-bar?)

@zbraniecki
Copy link
Member

To recap - I don't think we need to implement all of them, but I think before 1.0 we should have a good directions on which API compares how so that as the needs arise we can implement the particular methods/traits consistently in shape that we defined and set prior to 1.0.

@sffc
Copy link
Member Author

sffc commented Mar 19, 2022

A few considerations:

  1. My reading of the docs is that PartialEq and PartialOrd must be consistent, and if Eq/Ord is implemented, their behavior must be the same as the PartialEq/PartialOrd behavior.
  2. PartialEq and PartialOrd can be implemented across types, but Eq and Ord are only defined on Rhs = Self.

I tend to have a fairly strong preference that the Self comparisons be strict in all cases where possible, because then we can implement Eq and Ord which enables the use of various Rust data structures.

Across types, my preference is not as strong, but I tend to prefer strict comparison where possible. I think we should offer lenient comparison, but I don't think it's what users should get by default if they don't know better. In most API design cases I've come across, "strict by default" usually ends up being the chosen option.

@zbraniecki
Copy link
Member

When you say strict I assume you mean case strict. Which is reasonable for me since Locale::from("en-US") == Locale::from("eN_us") will result in the same canonicalized locale.

Across types, I think in particular for comparing locale against a string, I think we should be lenient (diverge from your position) and Locale::from("en-US") == "eN_uS" should return true because I think most users don't care about canonicalization and they believe that differently cased strings still mean "American English".

On the other axis (how to indicate that und-US and en-US shoudl "match"), I lean toward special methods like match/matches.

@sffc
Copy link
Member Author

sffc commented Mar 19, 2022

When you say strict I assume you mean case strict.

What I mean is a comparison that satisfies the properties of a "total order", which implies case sensitive, yes.

How would you feel about just not implementing PartialEq across types and then having two methods users can choose from to make the choice between strict and lenient more explicit?

Here are four reasons I tend to prefer strict comparison by default:

  1. Can be used for binary search and other situations that require it
  2. More efficient and smaller code because we don't need special logic to massage the string into the correct form
  3. More well-defined; leniency can be on many levels, but there is only one way for strict to work. For example, leniency could make en-Latn-US equal any subset of the following depending on how lenient we make it: EN-LATN-US, en_Latn_US, en-US-Latn. On this topic, if we decided to change the leniency in the future, it may be a breaking change in behavior of PartialEq. It would be easier to be able to simply add another version of the lenient comparison function.
  4. What I said earlier about strict-by-default being less surprising.

@zbraniecki
Copy link
Member

How would you feel about just not implementing PartialEq across types and then having two methods users can choose from to make the choice between strict and lenient more explicit?

I feel that if there's no "canonical" expectation, that makes sense - we're effectively kicking developers out of their comfort zone and asking them to explicitly decide.

Below, I'll argue against the assumption that we don't know what is the canonical expectation tho :)

Can be used for binary search and other situations that require it

I think I disagree.

let loc_map: HashMap<Locale, SomeData>;

assert!(loc_map.binary_search("en_US").is_err());

I'd argue that if someone asks for it against a string that they expect en-US to match en_US. Simultaneously I see no scenario where a user would like to pass a string and not match it.
Why would they? To add en_US next to en-US entry? To say "we have no data for en_US"?)

If they really need to, they could check that their string is not canonical locale and reject it prior to matching.

More efficient and smaller code because we don't need special logic to massage the string into the correct form

Agree. The strict vs loose matching can be seen as performance question. I'd argue that by default we should be lenient on input, and allow for "strict_cmp" for cases that need it.

More well-defined;

I don't think your examples are plausible. en-US-Latn is an incorrect locale tag and will reject parsing. EN-LATN-US and en_Latn_US will both parse. I believe case-insensitive + -/_ is the lenient and it won't change.

What I said earlier about strict-by-default being less surprising.

I think it's a subjective argument. For one person it may be surprising to see loc!("en-US") == "en_US" for another it may be surprising to see it not match.

I think if we embrace "lenient on input, strict on output" then we should be lenient here. If we reject it, we should be strict and be very clear in docs that any string used for comparison tests against any component of a locale has to be canonicalized before for the matching to work.

We could also do any of the:

  • PartialEq<str> is strict, and component.matches(str) is lenient
  • PartialEq<strin> is lenient and component.matches_strict(str) is strict

and then separate matrix for the meaning of missing components (und-US matches en-US, en-US matches en-Latn-US etc.)

@sffc
Copy link
Member Author

sffc commented Mar 22, 2022

Can be used for binary search and other situations that require it

I think I disagree.

Rust binary_search is defined only for Ord types, and Ord is defined to mean types that have a "total order".

Is binary_search well-defined on non-total-order impls? If so, then why does the standard library require Ord instead of PartialOrd?

More efficient and smaller code because we don't need special logic to massage the string into the correct form

Agree. The strict vs loose matching can be seen as performance question. I'd argue that by default we should be lenient on input, and allow for "strict_cmp" for cases that need it.

Ok. My position, weakly held, is that the thing that is better for the ICU4X value proposition (performance, memory usage, or in this case code size) should be the default behavior.

One could make the argument that lenient comparison is better for the "i18n correctness" value proposition, an argument with which I disagree, since we should be expressly encouraging people to use BCP-47 language tags, rather than being lenient by default and allowing the old format to continue to exist and be supported in the wild.

More well-defined;

I don't think your examples are plausible. en-US-Latn is an incorrect locale tag and will reject parsing. EN-LATN-US and en_Latn_US will both parse. I believe case-insensitive + -/_ is the lenient and it won't change.

Does UTS-35 define exactly what sets of strings are considered valid but non-canonical?

I'd like to draw a corollary with the related issue of Date.parse() in ECMAScript. ECMA-262 has long said that Date.parse() has lenient parsing based on implementation-defined behavior. What has happened is that different browser engines have defined the lenient parsing differently, and there's now a big miss-mash of what Date.parse() can and cannot do. The Temporal proposal resolves this issue by only parsing ISO-8601 strings.

What I said earlier about strict-by-default being less surprising.

I think it's a subjective argument. For one person it may be surprising to see loc!("en-US") == "en_US" for another it may be surprising to see it not match.

My position here is that if users require lenient behavior, they will discover fairly quickly that the default impl is not lenient through testing. However, if the users require strict behavior, they may never find out that they are using the less-efficient lenient method, since the lenient method works in all the cases that the strict method works. It is therefore less likely to cause surprise if strict is default.

@zbraniecki
Copy link
Member

ECMA-262 has long said that Date.parse() has lenient parsing based on implementation-defined behavior.

I see the problem in the fact that they let it be "implementation-defined". We would be more strict allowing only for case-insensitive and -/_.

However, if the users require strict behavior,

I struggle to imagine a case where a user would like the behavior of matching a Locale to a stringified version of it to be strict.

@sffc
Copy link
Member Author

sffc commented Mar 22, 2022

I struggle to imagine a case where a user would like the behavior of matching a Locale to a stringified version of it to be strict.

If the user has a list of strings that they know are definitely BCP-47 canonical, and they don't want to carry the extra code size / performance cost of "canonicalization on the fly". I have such a case in the data provider.

@sffc
Copy link
Member Author

sffc commented Mar 22, 2022

ECMA-262 has long said that Date.parse() has lenient parsing based on implementation-defined behavior.

I see the problem in the fact that they let it be "implementation-defined". We would be more strict allowing only for case-insensitive and -/_.

If we're clear that these are the only exceptions, then this satisfies argument 3. Argument 2 is weakened, except for my side-note about the negative impact on the ecosystem of allowing the old stuff to propogate. Arguments 1 and 4 still stand.

@zbraniecki
Copy link
Member

If the user has a list of strings that they know are definitely BCP-47 canonical, and they don't want to carry the extra code size / performance cost of "canonicalization on the fly". I have such a case in the data provider.

I wasn't explicit enough. I struggle to find a use case where a consumer of our library who is not very familiar with the nuances of the canonicalization and wants to "just use" Locale class would use the convenience PartialEq trait and be disappointed that it matches en-US to en_US.

I imagine a special-case consumer who is very performance/code-size conscious and they would want to use matches_strict or something like that.

Arguments 1 and 4 still stand.

My responses still stand as well.

Anyhoo. I'm 0.2 on apache scale and would prefer not to invest much more time in bikeshedding since we seem to have our arguments laid out and I doubt we are in a position to convince one another.

Maybe we can discuss it in icu4x bi-weekly, vote and move on.

@sffc sffc added this to the ICU4X 1.0 milestone Apr 1, 2022
@sffc sffc removed the v1 label Apr 1, 2022
@dminor dminor added discuss-priority Discuss at the next ICU4X meeting and removed discuss Discuss at a future ICU4X-SC meeting labels May 26, 2022
@dminor
Copy link
Contributor

dminor commented May 26, 2022

Marking this as discuss-priority so we can can come to a decision on this at the next meeting.

@sffc
Copy link
Member Author

sffc commented May 26, 2022

Discussion:

  • @Manishearth - In general, PartialEq should generally not be doing magic stuff, so if it exists, it should generally be strict, but I'd prefer not having it for str.
  • @robertbastian - Having PartialEq<str> is confusing, anyway, because it's not clear what logic it uses
  • @nordzilla - Does it affect functions that assume you have an impl? I guess most of those let you pass a closure instead.

Proposal: Do not have PartialEq<str> and only have methods.

LGTM: @echeran @sffc @Manishearth @robertbastian @dminor

@sffc
Copy link
Member Author

sffc commented May 26, 2022

Naming for the two functions:

  • cmp_bytes is currently a strict comparison against &[u8].

New names:

  • normalizing_eq(&str) -> bool
  • strict_cmp(&[u8]) -> Ordering

LGTM: @QnnOkabayashi @Manishearth @robertbastian @nordzilla @zbraniecki

In the future we could add:

  • normalizing_cmp(&str) -> Ordering
  • strict_eq(&[u8]) -> bool

@sffc sffc added T-core Type: Required functionality S-small Size: One afternoon (small bug fix or enhancement) and removed discuss-priority Discuss at the next ICU4X meeting labels May 26, 2022
@robertbastian
Copy link
Member

Should we also remove PartialEq<str> from the subtags? LocaleCanonicalizer does a lot of subtag/str comparisons and I think it'd be better to make it explicit whether they're strict or normalizing.

@sffc
Copy link
Member Author

sffc commented Jun 8, 2022

I believe so, yes. I expect that the use cases in LocaleCanonicalizer might benefit anyway from using more [u8] rather than str for additional deserialization performance gains.

We should use strict comparison for data coming from the DataProvider, because we expect the data to have been pre-normalized in datagen.

@robertbastian
Copy link
Member

LocaleCanonicalizer's data is mainly unvalidated TinyStrs, and its input are valid Locales, but somehow the logic ends up comparing &strs internally.

@sffc
Copy link
Member Author

sffc commented Jun 8, 2022

Oh right, it uses TinyStr because the subtags might not all be the same length. You can get the &[u8] from the TinyStr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-locale Component: Locale identifiers, BCP47 S-small Size: One afternoon (small bug fix or enhancement) T-core Type: Required functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants