Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data provider trait definition #8

Closed
sffc opened this issue Mar 15, 2020 · 15 comments · Fixed by #61
Closed

Data provider trait definition #8

sffc opened this issue Mar 15, 2020 · 15 comments · Fixed by #61
Assignees
Labels
A-design Area: Architecture or design C-data-infra Component: provider, datagen, fallback, adapters T-docs-tests Type: Code change outside core library
Milestone

Comments

@sffc
Copy link
Member

sffc commented Mar 15, 2020

In data-pipeline.md, I explain the need for a data provider trait, with an open question being exactly what the definition of this trait should be. The purpose of this thread is to reach a cohesive solution for that trait.

Requirements list (evolving):

  1. Should not require unsafe code in Rust
  2. Should be fast when loading data from static memory (avoid requirement to clone)
  3. Should be amenable to data providers being connected in a chain, performing transformations on certain keys and passing through others
  4. Should be ergonomic to use across an FFI boundary
@sffc
Copy link
Member Author

sffc commented Mar 15, 2020

Option A: Enum returned by value

I have the following solution implemented in https://github.com/sffc/rust-wasm-i18n/blob/master/wasm-no-std/src/intl/util/data.rs:

pub struct CompactData {
	pub prefix: String,
	pub suffix: String,
}

pub struct CurrencyData {
	// ...
}

pub struct CompactCurrencyData {
	// ...
}

pub struct ScientificData {
	// ...
}

pub struct DecimalData {
	pub zero_digit: char,
	pub decimal_separator: char,
	pub grouping_separator: char,
	pub primary_group: Option<u8>,
	pub secondary_group: Option<u8>,
}

pub enum DataKey {
	// TODO: Upgrade Box to something that doesn't require owned data (Cow?)
	Compact(Option<Box<CompactData>>),
	Currency(Option<Box<CurrencyData>>),
	CompactCurrency(Option<Box<CompactCurrencyData>>),
	Scientific(Option<Box<ScientificData>>),
	Decimal(Option<Box<DecimalData>>),
}

pub trait DataProvider {
	fn get_data(&self, key: &DataKey) -> DataKey;
}

// Example implementation

pub struct DummyDataProvider {
}

impl DataProvider for DummyDataProvider {
	fn get_data(&self, key: &DataKey) -> DataKey {
		match key {
			DataKey::Compact(_) => DataKey::Compact(Some(Box::new(CompactData {
				prefix: String::from("p"),
				suffix: String::from("s"),
			}))),
			DataKey::Currency(_) => DataKey::Currency(Some(Box::new(CurrencyData {}))),
			DataKey::CompactCurrency(_) => DataKey::CompactCurrency(Some(Box::new(CompactCurrencyData {}))),
			DataKey::Scientific(_) => DataKey::Scientific(Some(Box::new(ScientificData {}))),
			DataKey::Decimal(_) => DataKey::Decimal(Some(Box::new(DecimalData {
				zero_digit: '0',
				decimal_separator: '.',
				grouping_separator: ',',
				primary_group: Some(3),
				secondary_group: Some(2),
			}))),
		}
	}
}

Pros:

  • Unsafe code not required
  • Easy for providers to "pass through" data they don't care about

Cons:

  • The returned enum is the same as the requested enum only by convention; the receiver would still need to check and have a failure path if the enum types are mismatched

@sffc
Copy link
Member Author

sffc commented Mar 24, 2020

@Manishearth @zbraniecki @hsivonen @hagbard

Thoughts on this? I'd like to have a more concrete data provider trait definition soon, so we can start implementing more pieces of it.

@sffc sffc self-assigned this Mar 24, 2020
@sffc sffc added this to the 2020 Q1 milestone Mar 24, 2020
@hagbard

This comment has been minimized.

@sffc
Copy link
Member Author

sffc commented Mar 26, 2020

Let just start with something vaguely sensible, iterate and do a proper review later.

Yes, of course. What I'm missing is a list of "vaguely sensible" options, beyond the one option I proposed earlier in the thread. My ask is to compile a list of options, pick one to start with, and we can change later if need be.

@hagbard
Copy link
Contributor

hagbard commented Mar 26, 2020

Ahh okay. Then since this is conceptually a "have key, get value" thing, any API that's semantically equivalent to "given URI, get resource" sounds fine. The issues of how to implement wrapped suppliers etc. (as suggested in the linked doc) all boil down to the semantics of the kays and what (if anything) can be assumed about the structure of the returned data.

A sketch of what a wrapper might do (i.e. pseudo code for a realistic example that interprets the key in order to fetch from one of several sub-suppliers) would be nice. E.g. how would I write a wrapper than deals with a legacy path, no longer supported in the default supplier but needed by my legacy library code for some bizarre reason (or is that never going to work?).

As a related aside, if the key were represented as a URI, what do you think it would look like (I think this would help me understand what properties things like version and the optional data key are meant to have and how they relate to each other). If this isn't something that could be represented as a URI, I think it says something quite interested about the design.

@sffc
Copy link
Member Author

sffc commented Mar 26, 2020

The issues of how to implement wrapped suppliers etc. (as suggested in the linked doc) all boil down to the semantics of the kays and what (if anything) can be assumed about the structure of the returned data.

There is a list of example keys in the draft, and the draft lists which specific types are allowed to be in the returned data.

A sketch of what a wrapper might do (i.e. pseudo code for a realistic example that interprets the key in order to fetch from one of several sub-suppliers) would be nice. E.g. how would I write a wrapper than deals with a legacy path, no longer supported in the default supplier but needed by my legacy library code for some bizarre reason (or is that never going to work?).

Sure. I'll give an example of a provider that tries several providers to get data for a particular locale.

// I'm writing this in Python because it's for illustration purposes only
class LocaleForkingProvider:
  def __init__(self, child_providers):
    self.child_providers = child_providers

  def get_data(self, key, request_vars):
    for provider in self.child_providers:
      (data, response_vars) = provider.get_data(key, request_vars)
      if response_vars.langid != "und":
        return (data, response_vars)
    # None of the providers had data other than root; return from the first one
    return child_providers[0].get_data(key, request_vars)

As a related aside, if the key were represented as a URI

My proposal as currently approved has keys as enums instead of string URIs. If there's a reasonable trait definition that is type-safe with string URIs, then that's something we could consider changing in the draft.


Just to rewind a bit, I want to make sure we're on the same page that this issue is not for general discussion of the data pipeline proposal. This thread is to answer the specific question of the Rust trait definition for the data provider. I would like to see some concrete options, such that we can discuss their pros and cons.

@hagbard
Copy link
Contributor

hagbard commented Mar 26, 2020

The issues of how to implement wrapped suppliers etc. (as suggested in the linked doc) all boil down to the semantics of the kays and what (if anything) can be assumed about the structure of the returned data.

There is a list of example keys in the draft, and the draft lists which specific types are allowed to be in the returned data.

Example keys don't tell me what the semantics are. I can't understand issues like how we deal gracefully with deprecated keys (if we can). I'm not questioning whether what's suggested will work now, I'm trying to understand what it feel like to be using this API in 5 years time after a bunch of things have changed.

Do clients end up having a mix of versioned keys in their code? I.e. "use v2 or this key, but version 7 of that key because reasons". How does that feel to a maintainer of this client code when (if) keys have mix of versions scattered around? How many keys is a "module" of code going to use? If 1 or 2 it might be fine, but if 10 or 20 it might get unmaintainable.

This is what I want to understand.

A sketch of what a wrapper might do (i.e. pseudo code for a realistic example that interprets the key in order to fetch from one of several sub-suppliers) would be nice. E.g. how would I write a wrapper than deals with a legacy path, no longer supported in the default supplier but needed by my legacy library code for some bizarre reason (or is that never going to work?).

Sure. I'll give an example of a provider that tries several providers to get data for a particular locale.

// I'm writing this in Python because it's for illustration purposes only
class LocaleForkingProvider:
  def __init__(self, child_providers):
    self.child_providers = child_providers

  def get_data(self, key, request_vars):
    for provider in self.child_providers:
      (data, response_vars) = provider.get_data(key, request_vars)
      if response_vars.langid != "und":
        return (data, response_vars)
    # None of the providers had data other than root; return from the first one
    return child_providers[0].get_data(key, request_vars)

So what order does it try in? Are suppliers going to advertise some "precedence" for ordering? Do they have version numbers? Or, if multiple suppliers provide data for a key, must they always provide identical results?

As a related aside, if the key were represented as a URI

My proposal as currently approved has keys as enums instead of string URIs. If there's a reasonable trait definition that is type-safe with string URIs, then that's something we could consider changing in the draft.

You misunderstand, I don't mind what the key is in code, I'm proposing a thought experiment where you say "if we were using a URI, this is what it would be" because a URI has parts which have different semantics (scheme, path, anchor). Know what it would look-like as a URI lets you frame the key data in a way that more easily explains its semantics and properties like stability.

E.g. a URI using an anchor has the property that "scheme://path#anchor" implies the existence of the resource "scheme://path", so if there's something in keys for getting "a bit of a data block", then it would be compatible with an anchor.

A path is different, but it shows order scheme://foo/bar may imply the existence of scheme://foo (but need not) but it does say how "foo" relates to "bar". In you case, you talk about an optional "key" (e.g. locale ID). It that "path like" or "anchor like"?

What about versioning? Would a URI be: scheme://version/path or scheme://path/version ?

Just to rewind a bit, I want to make sure we're on the same page that this issue is not for general discussion of the data pipeline proposal. This thread is to answer the specific question of the Rust trait definition for the data provider. I would like to see some concrete options, such that we can discuss their pros and cons.

I can't answer specifics until I understand the concepts better. An example, even working code, doesn't let me understand the intended semantics (which is what matters most in design).

@sffc
Copy link
Member Author

sffc commented Mar 28, 2020

I can't understand issues like how we deal gracefully with deprecated keys (if we can).

My intent is that the section Supported Keys in the explainer addresses this question. Versioning of various pieces of the pipeline (key version, CLDR version, etc) is something into which I put a lot of thought when writing the explainer.

Do clients end up having a mix of versioned keys in their code?

The keys are not something that client code generally needs to see, unless it is implementing a custom data provider (e.g., to override a certain piece of data). OmnICU requests data for a key, and the provider responds with data. The client just passes through the request and response.

When designing the data provider trait, a higher emphasis should be put on efficiency and type safety, and perhaps a bit less on ergonomics than usual, since the average OmnICU user won't be needing to call or implement it.

So what order does it try in?

In this example, they're tried in the order in which the user listed them in the constructor. I can imagine other data provider implementations doing some kind of sorting on the list.

Are suppliers going to advertise some "precedence" for ordering?

If there's a need, I think it would be very reasonable for a data provider to provide metadata about itself, like what version of CLDR it provides, etc.

Do they have version numbers? Or, if multiple suppliers provide data for a key, must they always provide identical results?

Data is versioned. The data version is one of the response variables explained in the explainer.

It's also worth pointing out that I was reminded in #19 that this API should be async. Since the data provider can be reading data from a file, service, etc., we want it done async.

@Manishearth
Copy link
Member

So, I think the passthrough here should be handled externally rather than internally. I.e. the provider is able to respond "don't know" (this could simply be a None result, though we may need more variants to distinguish between "don't know" and "bad query", and depending on how we handle async there may be another "ask again later" option where you need to run some async fetcher first). Whoever is using the provider can handle fallback, and we can also write a generic provider that can handle fallback between multiple existing providers.

Given that, I'd probably have DataKey not contain any data, and instead have a DataValue or Data for the return type.

This can also be typed: we can have fn get_data<D: DataKind>(&self) -> D where DataKind is one of the FooData types. This is a very nice, zero-cost Rusty API, however this makes it tricky to turn this trait into a trait object, which will likely be necessary (there are tricks to make this work, though, so this isn't insurmountable)

@sffc
Copy link
Member Author

sffc commented Apr 8, 2020

This can also be typed: we can have fn get_data<D: DataKind>(&self) -> D where DataKind is one of the FooData types. This is a very nice, zero-cost Rusty API

With function generics, a DataProvider would have to implement get_data for every known type, right? This goes against the model of a DataProvider being a trait exposing a single function that handles all data requests in one place.

however this makes it tricky to turn this trait into a trait object, which will likely be necessary (there are tricks to make this work, though, so this isn't insurmountable)

Can you elaborate?

If I were writing this in C++ using casts, I would do something like,

const void* get_data(DataKey key) { /* ... */ }

auto currency_data = static_cast<const CurrencyData*> get_data(DataKey::CURRENCY);

So I want the return type to be linked to the argument type in some way, but I also want this to work such that data providers can pass around keys and data hunks opaquely without generics.

@Ralith
Copy link

Ralith commented Apr 8, 2020

With function generics, a DataProvider would have to implement get_data for every known type, right?

No, you just write one implementation that's capable of dealing with any type that satisfies DataKind. This is the idiomatic approach if the caller statically knows what type it wants.

If I were writing this in C++ using casts, I would do something like:

The analog to this would be returning a Box<dyn Any> for owned data, or &dyn Any if you can refer to some other storage, but of course it's preferred to use static types because then you avoid any chance of getting the type wrong at runtime (and save overhead besides).

@sffc sffc added the T-docs-tests Type: Code change outside core library label Apr 16, 2020
@sffc
Copy link
Member Author

sffc commented Apr 16, 2020

Okay, so to make sure I understand, you're proposing something like this?

use std::borrow::Cow;

enum Encoding {
    Utf8,
    Utf16,
}

struct Request<T2> {
    // *Requested* LangID
    // Note: use LangID once it is available
    langid: String,
    encoding: Encoding,
    params: T2,
}

struct Response {
    // *Supported* LangID
    langid: String,
    encoding: Encoding,
    data_version: String,
}

trait DataProvider {
    fn get_data<T1: std::clone::Clone, T2>(req: Request<T2>) -> (Cow<'static, T1>, Response);
}

Unfortunately the following usage doesn't compile:

struct ForkingDataProvider<'a> {
    provider_1: &'a dyn DataProvider,
    provider_2: &'a dyn DataProvider,
}

impl<'a> DataProvider for ForkingDataProvider<'a> {
    fn get_data<T1: std::clone::Clone, T2>(&self, req: Request<T2>) -> (Cow<'static, T1>, Response) {
        if req.langid == "de-CH" {
            return self.provider_1.get_data(req);
        } else {
            return self.provider_2.get_data(req);
        }
    }
}

/*
error[E0038]: the trait `DataProvider` cannot be made into an object
  --> src/main.rs:28:5
   |
24 |     fn get_data<T1: std::clone::Clone, T2>(&self, req: Request<T2>) -> (Cow<'static, T1>, Response);
   |        -------- method `get_data` has generic type parameters
...
28 |     provider_1: &'a dyn DataProvider,
   |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ the trait `DataProvider` cannot be made into an object
*/

https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=ca6116e38393791dbeffa2480e6c7668

@Manishearth
Copy link
Member

Right, that's what I meant by "this is tricky to turn into a trait object".

With function generics, a DataProvider would have to implement get_data for every known type, right?

No, your DataKind implementation would handle the abstraction there, so you only need to write code for every known DataKind.

@Ralith
Copy link

Ralith commented Apr 16, 2020

If you need DataProvider to be suitable for use as a trait object, then one way or another you'll have to route through dyn Any, yeah.

@sffc sffc added C-data-infra Component: provider, datagen, fallback, adapters A-design Area: Architecture or design labels May 7, 2020
@sffc sffc linked a pull request Jun 17, 2020 that will close this issue
@sffc sffc modified the milestones: 2020 Q1, 2020 Q2 Jun 17, 2020
@sffc
Copy link
Member Author

sffc commented Jun 17, 2020

Fixed by #61

@sffc sffc closed this as completed Jun 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-design Area: Architecture or design C-data-infra Component: provider, datagen, fallback, adapters T-docs-tests Type: Code change outside core library
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants