Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add joining-group sub-command #32

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

wezm
Copy link
Contributor

@wezm wezm commented Apr 24, 2020

This PR adds a joining-group sub-command that exposes the field of the same name from ArabicShaping.txt.

There is an initial commit that changes the joining-group field on ArabicShaping from a String to and enum.

This field is a little weird in that it's stored in ArabicShaping.txt in a different form to the "formal" form in PropertyValueAliases.txt. As a result there is a function to convert from the input ArabicShaping form to the PropertyValueAliases form. The enum and associated methods are implmented in terms of the PropertyValueAliases form.

For example in this row:

0629; TEH MARBUTA; R; TEH MARBUTA

TEH MARBUTA (last field) corresponds to Joining_Group Teh_Marbuta.

Sample output:

// DO NOT EDIT THIS FILE. IT WAS AUTOMATICALLY GENERATED BY:
//
//   ucd-generate joining-group --rust-enum /home/wmoore/Downloads/ucd-13.0
//
// Unicode version: 13.0.0.
//
// ucd-generate 0.2.8 is available on crates.io.

#[derive(Clone, Copy, Debug, Hash, Eq, PartialEq)]
pub enum JoiningGroup {
  AfricanFeh, AfricanNoon, AfricanQaf, Ain, Alaph, Alef, Beh, Beth,
  BurushaskiYehBarree, Dal, DalathRish, E, FarsiYeh, Fe, Feh, FinalSemkath,
  Gaf, Gamal, Hah, HamzaOnHehGoal, HanifiRohingyaKinnaYa, HanifiRohingyaPa,
  He, Heh, HehGoal, Heth, Kaf, Kaph, Khaph, KnottedHeh, Lam, Lamadh,
  MalayalamBha, MalayalamJa, MalayalamLla, MalayalamLlla, MalayalamNga,
  MalayalamNna, MalayalamNnna, MalayalamNya, MalayalamRa, MalayalamSsa,
  MalayalamTta, ManichaeanAleph, ManichaeanAyin, ManichaeanBeth,
  ManichaeanDaleth, ManichaeanDhamedh, ManichaeanFive, ManichaeanGimel,
  ManichaeanHeth, ManichaeanHundred, ManichaeanKaph, ManichaeanLamedh,
  ManichaeanMem, ManichaeanNun, ManichaeanOne, ManichaeanPe, ManichaeanQoph,
  ManichaeanResh, ManichaeanSadhe, ManichaeanSamekh, ManichaeanTaw,
  ManichaeanTen, ManichaeanTeth, ManichaeanThamedh, ManichaeanTwenty,
  ManichaeanWaw, ManichaeanYodh, ManichaeanZayin, Meem, Mim, NoJoiningGroup,
  Noon, Nun, Nya, Pe, Qaf, Qaph, Reh, ReversedPe, RohingyaYeh, Sad, Sadhe,
  Seen, Semkath, Shin, StraightWaw, SwashKaf, SyriacWaw, Tah, Taw, TehMarbuta,
  Teth, Waw, Yeh, YehBarree, YehWithTail, Yudh, YudhHe, Zain, Zhain,
}

pub const JOINING_GROUP: &'static [(u32, u32, JoiningGroup)] = &[
  (0, 1567, JoiningGroup::NoJoiningGroup), (1568, 1568, JoiningGroup::Yeh),
  (1569, 1569, JoiningGroup::NoJoiningGroup),
  (1570, 1571, JoiningGroup::Alef), (1572, 1572, JoiningGroup::Waw),
  (1573, 1573, JoiningGroup::Alef), (1574, 1574, JoiningGroup::Yeh),
  (1575, 1575, JoiningGroup::Alef), (1576, 1576, JoiningGroup::Beh),
  (1577, 1577, JoiningGroup::TehMarbuta), (1578, 1579, JoiningGroup::Beh),(68900, 1114111, JoiningGroup::NoJoiningGroup),
];

@@ -24,7 +24,114 @@ pub struct ArabicShaping {
/// The "joining type" of this codepoint.
pub joining_type: JoiningType,
/// The "joining group" of this codepoint.
pub joining_group: String,
pub joining_group: JoiningGroup,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change. Breaking changes should be called out in the PR description.

Could we please back this out? I'd really prefer not to release a new semver incompatible version of ucd-parse at this point. I could, but I would prefer such things be deliberate instead of a reactive "oh this PR came in with a breaking change so let's just put out an incompatible release." That's how churn happens. :-(

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative here is to add a new field that is an enum, and leave the existing string field. Maybe joining_group2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Breaking changes should be called out in the PR description.

Sorry I didn't think about the impact of the change when doing it so didn't think to call it out in the PR.


/// The Joining_Group field read from ArabicShaping.txt
#[derive(Clone, Copy, Debug, Eq, PartialEq)]
pub enum JoiningGroup {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just looking at the values, it kind of feels like this should be marked as #[non_exhaustive]. I don't really have any confidence that this list won't expand in future versions of Unicode.

Honestly might be a good reason to just drop the enum altogether. Why did you make this an enum?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly might be a good reason to just drop the enum altogether. Why did you make this an enum?

The addition of the joining-group sub-command was done to as part of our work to add support for Syriac shaping to Allsorts. We need to match on the joining group to do this:

https://github.com/yeslogic/allsorts/blob/ade4ef90dddecf63370d78b5ada58439e910faaf/src/scripts/syriac.rs

It was changed to an enum to be consistent with joining_type, which has been used just prior to implement Arabic shaping https://github.com/yeslogic/allsorts/blob/ade4ef90dddecf63370d78b5ada58439e910faaf/src/scripts/arabic.rs#L33-L67

I guess in the context of doing one then the other, it seems like an oversight that one was an enum and the other a string. In hindsight I should have realised this was a breaking change.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we add joining_group2 that is the enum and leave joining_group as a String? Then add a note to the (public) docs of joining_group that it will be dropped in the next semver bump of the crate.

What about the non_exhaustive bit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we add joining_group2 that is the enum and leave joining_group as a String? Then add a note to the (public) docs of joining_group that it will be dropped in the next semver bump of the crate.

Yes that sounds fine.

What about the non_exhaustive bit?

Giving this some thought I'm a bit on the fence. non_exhaustive allows backwards compatibility when new variants are added but I'm not sure this is desirable in this this case. I.e. maybe the default behaviour of reporting unhandled variants is desirable so that you get a prompt to handle the new cases versus having them silently hit a catch all case. The scenario where new variants are added coincides with new Unicode releases so perhaps it's reasonable to require updating code when adding support for newer Unicode releases. Then again if you're just a downstream consumer of some unicode related thing maybe you don't care enough about those details and would just like your code to keep compiling... 🤔

Perhaps it's worth noting that none of the other enums have it so far either and they also see new additions. E.g. I think it's beneficial that the addition of new scripts (which happens in most Unicode releases) prompts code to handle them.

Anyway I don't think I have strong enough feelings to advocate one way or another. Since you have more experience maintaining popular crates I'm happy to defer to what you think is best.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants