Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Organize properties code to be more like the other crates (get rid of free functions) #3573

Closed
Tracked by #2856
Manishearth opened this issue Jun 22, 2023 · 9 comments
Assignees
Labels
C-unicode Component: Props, sets, tries S-medium Size: Less than a week (larger bug fix or enhancement)

Comments

@Manishearth
Copy link
Member

Manishearth commented Jun 22, 2023

This is probably a 2.0 change (#2856)

The properties code uses a lot of free functions for loading property maps and such, rather than using constructors on CodePointSetData and CodePointMapData as would be tradition in the other crates.

This makes things a bit awkward. We should consider this design again for 2.0.

Some open questions:

  • How do we deal with the fact that CodePointMapData is generic, and that Rust users will be using CodePointMapData<GeneralCategory> (etc) whereas FFI uses an erased version?
  • How do we expose default baked data in this regime?

To-do items:

  • icu_properties::exemplar_chars fns should not need "exemplar" in the name
  • The mapper functions should be named load_ instead of get_

cc @robertbastian @sffc

Low-priority discussion since it's likely a 2.0 issue.

@Manishearth Manishearth added the discuss Discuss at a future ICU4X-SC meeting label Jun 22, 2023
@Manishearth Manishearth added this to the ICU4X 2.0 milestone Jun 22, 2023
This was referenced Jun 22, 2023
@sffc
Copy link
Member

sffc commented Jun 22, 2023

Discuss with:

Optional:

@sffc sffc added the discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band label Jun 22, 2023
@Manishearth
Copy link
Member Author

Proposal:

CodePointSetData/etc get const fn propname() and fn propname_unstable() with an option to add buffer/any as needed in the future.

All of these are added on CodePointSetData even if they return CPSDataBorrowed

We can alternatively or additionally add: const PROPNAME: &CPSDataBorrowed

@Manishearth
Copy link
Member Author

Manishearth commented Mar 11, 2024

March 11, 2024

(manish and robert talk about FFI)

  • @sffc: i think properties are weird so it's ok for them to be weird in how they are called, so "uniformity across components" isn't sufficient for me
  • @Manishearth: i think the free functions are weird in rust and discoverability.
  • @robertbastian: at the moment we group the free functions in modules, constructors can be naturally grouped in structs. This helps discoverability
  • @echeran: want to be clear about the problem we're solving. I don't think being weird in Rust is a technical problem. I think it's okay for properties to be different for the domain. It's not a formatter; it's not a struct being passed to the formatter.
  • @Manishearth: A pile of free functions is hard to find, bad documentation, etc.
  • @echeran: If you create a bunch of statics, does that have different implications?
  • @Manishearth: It's a difference between a function and a raw static.
  • @robertbastian: It's a const, not a static. (it's a static over ffi). And we already do this
  • @sffc: I don't think stuffing functions on CodePointSetData facilitates discoverability.
  • @robertbastian: We could rename CodePointSetData to BinaryProperty, which will be one of a handful of items in the crate, making it much more discoverable
  • @echeran: We have CodePointSetData because it emphasizes that the backend can change over time.
  • @Manishearth: BinaryProperty doesn't say anything about the implementation.
  • @echeran: "ennumerated property" lookups don't necesarily need to be backed by a map; CodePointMapData emphasizes that this is a map lookup. The interface could be different.
  • @sffc: what about BinaryProperty with static functions that return CodePointSetData
  • @Manishearth: I think the core issue is that there are two ways of slicing this and both are important. Firstly there's the discoverability: finding properties. There's also the interface being used, and it's important to not mix up the different kinds of lookups.

Proposals:

  1. free functions, better modules

icu_properties:

  • binary
    • free functions and/or statics
  • enumerated
    • ditto
  • script
  • etc
  • CodePointSetData, CodePointMapData
  • names?

Con: free functions still not super idiomatic for this kind of thing, but not as bad
Con: we still need some type to anchor over at FFI
Con: you need to go dig in a different module to find the constructors for CPSData
Pros:

  1. zero sized types, no modules

icu_properties:

  • CodePointSetData, CodePointMapData
  • BinaryProperty, EnumeratedProperty, etc
    • these are unit types with methods or statics
  • bidi, names, etc

Con: we kinda have a useless zero sized type here, also not idiomatic Rust when you just want to dangle methods
Con: you need to go dig in a different type to find the constructors for CPSData
Pro: Same on FFI and Rust

  1. zst marker types per property, traits per kind of property
  • struct EmojiProperty; (implements BinaryProperty?)
  • CPSData is still not generic, but the constructor is:
  • CodePointSetData::new::<EmojiProperty>()
  • FFI: you do CodePointSetData::new_emoji() which is kinda what we do now :/
  • open question: do we use marker types or just the value type for enum properties

Pro: no need to have five billion methods and we can do any/buffer/whatever ctors without exploding the API size
Con: idk but manish was going down this route originally with his properties refactor and then was asked not to and doesn't recall why lol
Con: FFI will still have to do 1 or 2 (2 because we don't have free functions)

Decision: per-property ZST approach (3)

Agreed: @Manishearth, @robertbastian, @echeran

@sffc was present for part of the discussion and hasn't seen the final agreement, may further discuss

@Manishearth
Copy link
Member Author

@sffc was a bit lukewarm on the marker types originally because he didn't want to have divergence between FFI and Rust, but we already have significant divergence here because of the free functions

He's happy with the per-property ZST approach, however. We can be more careful in FFI docs to help people comparing FFI and Rust.

@robertbastian noted that the BinaryProperty trait may need to be sealed with some hidden methods.

@sffc
Copy link
Member

sffc commented Mar 11, 2024

I like the marker type approach; but can we be more explicit about how we specify the marker types? For example, we already have pub struct CanonicalCombiningClass. Do we add these for all other properties? Maybe make them wrap booleans? Or should we make these be true marker types, pub enum EmojiProperty {}?

@Manishearth
Copy link
Member Author

Manishearth commented Mar 11, 2024

@sffc that was the "open question: do we use marker types or just the value type for enum properties"

Personally I think we should use the value type for the enum properties, the only potential downside is if we want to reuse them; but they're "open enum" newtypes anyway so i don't think there's a huge case for reuse.

naming can get tricky, are we going to have the binary properties called Ascii? Or are we going to have everyone be FooProperty?

@robertbastian robertbastian added the discuss-priority Discuss at the next ICU4X meeting label Jun 19, 2024
@robertbastian robertbastian self-assigned this Jun 19, 2024
@Manishearth
Copy link
Member Author

  • @sffc - We could have a marker type
CodePointSetData::<Alnum>::load()

which would allow a client to ask for a type CodePointSetData<Alnum> specifically.

  • @Manishearth - intriguing but not sure if worth it
  • @sffc - Could potentially move props into separate crate later, named icu_properties_types
    • Current proposal in the issue is CodePointSetData::new::<Alnum>();
  • @echeran -
  • @hsivonen - Two questions: will this change the API shape for obtaining the sets in the baked data case? If our API has a Rust generic type name, what does this imply for FFI?
  • @sffc - Yes it changes Rust code. Over FFI, we already have the functions on CodePointSetData.
    • What do we do about the baked things? Do they live on CPSDataBorrowed? or CPSData?
  • @sffc They should live on Borrowed. Actually CodePointSetData has new_unstable::<Foo>(provider) (etc) functions and CodePointSetDataBorrowed has new::<Foo>() functions. This facilitates feature-gating.
  • @sffc - I like this path because it gives us a clear way forward to split icu_properties to smaller crates or at least multiple features.
  • @hsivonen - I wasn't seriously suggesting one-crate-per-property for icu4x. A feature per property could be an area to explore.
  • @sffc - So we could split by property or by functionality. 4 functionalities:
    • Struct (no data)
    • Static property
    • Yoke property
    • Property names
  • @sffc - I think we should start with the split in this direction. We could add property-specific features in the future on a case by case basis.
  • @hsivonen - This generally sounds good but I don't know if I understand all the implications

Proposal:

  • icu_properties Module structure
    • CPSData,CPSDataBorrowed,CPMData,CPMDataBorrowed all live at the top level
    • props
      • BinaryProperty and EnumeratedProperty sealed traits (could be hoisted to lib.rs?)
      • enumerated props (NOT hoisted anymore at lib.rs)
      • binary props
    • no maps or sets module at the toplevel (each only has 2 types!!!)
    • script can stay, but gets rid of free functions
    • bidi/bidi_data can stay, but gets rid of free functions
  • CodePointSetData::new_unstable::<Alnum>(), CodePointSetDataBorrowed::new::<Alnum>(), etc. CodePointSetData stays non-generic
  • CodePointMapData::<GeneralCategory>::new_unstable(), CodePointMapDataBorrowed::<GeneralCategory>::new(), with methods for erasing.
  • Baked functions live on Borrowed. Non-borrowed has try_new_unstable etc. Docs properly advertise this.
  • FFI stays basically the same

LGTM: @sffc @Manishearth @robertbastian @echeran (@hsivonen)

@Manishearth
Copy link
Member Author

@robertbastian may have code for this

otherwise i can work on it

@sffc sffc added C-unicode Component: Props, sets, tries S-medium Size: Less than a week (larger bug fix or enhancement) and removed discuss Discuss at a future ICU4X-SC meeting discuss-priority Discuss at the next ICU4X meeting discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band labels Jul 11, 2024
This was referenced Sep 24, 2024
@Manishearth
Copy link
Member Author

I think this was done by #5548?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-unicode Component: Props, sets, tries S-medium Size: Less than a week (larger bug fix or enhancement)
Projects
Status: Done
Development

No branches or pull requests

4 participants