Skip to content

Commit

Permalink
Merge pull request #55 from unicode-org/docs-meeting-notes-2020-02-24
Browse files Browse the repository at this point in the history
Create notes-2020-02-24.md
  • Loading branch information
romulocintra authored Feb 29, 2020
2 parents 68e7e56 + 76cc294 commit e22ee56
Showing 1 changed file with 208 additions and 0 deletions.
208 changes: 208 additions & 0 deletions meetings/2020/notes-2020-02-24.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,208 @@
#### February 24 Attendees:
- Romulo Cintra - CaixaBank (RCA)
- Mike McKenna - PayPal (MGM)
- George Rhoten - Apple (GWR)
- Eemeli Aro - OpenJSF (EAO)
- Steven R. Loomis - IBM (SRL) / at halftime
- Mark Davis (MED) — Google
- Nicolas Bouvrette - Expedia (NIC)
- Johan Jongman
- Staś Małolepszy (SMY)
- Jan Mühlemann - Locize (JMU)
- Hugo van der Merwe - Google (HUG) - fly on the wall / observer.
- Mick Monaghan - Guidewire - (MMN)
- Mihai Nita - Google (MIH)
- Shane Carr - Google (SFC)
- Elango Cheran - Google (ECH)
- Janne Tynkkynen - PayPal (JMT)
- Zibi Braniecki - Mozilla (ZBI)
- Nick Felker - Google (NFR)
- George Rhoten - Apple (GWR)
- Dan Chiba - Oracle (DCA)


## MessageFormat Working Group Contacts:

- [Mailing list](https://groups.google.com/a/chromium.org/forum/#!forum/message-format-wg)


## Next Meeting

March 23, 10am PDT (6pm GMT)

## Agenda
- [ Agenda on Github ](https://github.com/unicode-org/message-format-wg/issues/46)


## Presentations

### Introducing: Object model for Message Format

MIH: presents [slides](https://docs.google.com/presentation/d/1dyW29SlqjPRZVScobqEXjnP29fhbqMkCfgxPOWj3Tnw)


EAO: Clarifying Question: "data model" =?= "AST"

ECH: I kind-of disagree. I hope we can talk about terminology. AST and Data Model get used to mean the same thing, but they're not exactly the same thing.

MIH: An AST would contain a lot of fluff, like this is the beginning of an array marker, and a lot of tokens that are irrelevant.

EAO: So as an output of what we're doing, we consider not only an updated source language itself, but a data model that could be the target of messages coming from other formats as well?

MIH: I can receive stuff from a database, etc. It doesn't matter how they get to me.

RCA: Clarifying Question: The data model is something should be something "fixed" ?

MIH: One of the big things is allowing developers to plug in their own data types. The first is the placeholder, the second is the type (flag). We can say the type is flexible. You register a formatter for it. It's expandable, but it's fixed. The structure itself doesn't change. It allows you to add in pieces here and there. If something is a Set, it is a Set. If it's a Map, it's a Map.

RCA: Cool, so this is a model that can be extended.

MIH: You can also glue on top of it. Placeholder, style, key name foo, cannot be changed by translators. You can represent …

NIC: Clarifying Question: Isn't the data model largely influenced by the problems (e.g. linguistic) we want to solve? Do we know all the problems we want to solve? There is a relation between both.

MIH: I think we should take the initial one I presented in the beginning and modify it to represent the features we want. The data model I presented in the beginning might not be able to represent everything we want. I think it presents everything we have today, but we may need to change it to represent everything we want.

NIC: As we explore the problem space, we could discover something that could modify the model?

MIH: Hopefully not drastically, but yes, it's possible.

GWR: One thing that Siri has are the notions semantic concepts and sentence fragments? We need a library that can fill in information (ex: word inflections) so that you don't have to repeat info over & over. I think semantic concepts are very important. I think sentence fragments are iffy -- they can split, have case, and one fragment can interact with another fragment, so we need to be careful about that.

MIH: Yeah. Personally I have big doubts that we can represent everything that is needed to represent perfect formation of human speech. Google Assistant tries to solve the same problems as Siri I think. Generating speech from concepts is really hard. There are a lot of things behind it: you have servers that have data models, …

GWR: We used to have it that you fill in every single answer, and we found that doesn't scale, and now we have to generate stuff using a lexical dictionary.

MIH: I think I know what you mean, but if we're talking about cases, we could have a dative case, or accusative case, etc. The parameter of which noun case could come from the backend somewhere, instead of just coming from the user.

GWR: If we have a personal relationship: your mom called, your dad called. Those could have cases potentially. Those are relationships: a bounded set of things to consider. That's separate from the current MessageFormat. It's program-defined vocabulary, but it can be very generic across the language.

MIH: I think, if we can design our data structures to accommodate that we should try it.

GWR: Do you think we also need to define the grammatical information for each language? Sometimes linguists have trouble defining these things.

MIH: I know I've seen this several times, sometimes translators do not know about the grammar rules of their own language, they just know how to translate and what sounds correct.

MED: One technique we've found that works pretty well is if you pick a set of phrases that exemplify the particular grammatical forms in the language. Instead of a translator’s picking a technical linguistic term, they pick among the phrases. The software then maps that phrase back to the internal linguistic term for that (eg, dative).

MIH: I think the question is : Should we collect all possible cases across languages?

MED: I think we could define a set of grammatical terms which could be internal to the model. We've been doing some of that internally in CLDR, which focuses on noun grammatical features. Most things are noun phrases. We should come up with the internal terminology for describing this, but that's not what you pass to translators.

MIH: What we can do is add an registration mechanism for grammatical cases, etc. that can be similar to how BCP 47 represents locales (language/region/writing system/etc.)

RCA: calls for some conclusions on this agenda item. Do we discuss it now, or work on it for the next meeting?

MIH: If there are questions, I will try to answer them now, but I think people should spend time thinking about this. I don't expect people to wrap their head around this on the spot.

SFC: TCQ - if in doubt, use "New Topic".
- Clarifying question: something that can be answered within 10-15 seconds.
- Point of order: "I need help with notes / can't hear the call / audio isn't working", etc.


### Open Issues - #50 #26 #47

#### Issue [#50](https://github.com/unicode-org/message-format-wg/issues/50) - Design Principles by SMY.
Computational vs Manual; Developer Control vs Localizer Control; DRY vs. WET; Resilient vs. Brittle; People-Friendly vs. Machine-Friendly

SMY: This was a way for me to organize my thoughts. I feel some of these design principles for what we want and how we want to build. The examples I gave here are relevant when we were designing Fluent. I think they're relevant for syntax as well as APIs. I think the first one (HUG: Computational vs Manual?) is one of the most important ones. At the end of the day, we're building something for non-technical users. Maybe we can discuss each of these dimensions right now. Do you think that's a good idea? I'm happy to file separate issues for each of these dimensions. One of the goals of having these guidelines is the decision-making process. Pretty soon we need to have decisions

ECH: I think these are good things to discuss. If we don't tackle them head-on, we will have these discussions over and over.

MIH: agrees. Something like that is needed, to provide some consistency.

SMY: Should I file separate issues?

RCA: Let's talk about issue 30. I think SMY should organize these offline.

SMY: Sure. It's a little hard to extract a question from several different issues. For example, there's a good discussion going on about file format simplicity. It covers a lot of the decisions we need to make. Maybe we can start with the dimensions I put out there; they are pretty universal. One trouble that I may have myself is that some of these dimensions are required by some of the other requirements we're discussing. Some of the requirements may change the landscape of the design principles. For example, the manual versus computational dimension. Comments from GWR were very interesting to me. Seeing how Siri does things and how Siri needs to compute phrases out of smaller pieces changed the way I think about these.

EAO: I think we should start making decisions. We've been talking a lot and mapping the space, but I think we should start making decisions on specific things. We can iterate later, but if we don't decide on things, we can't make progress. I think we should, once we have specific issues on GitHub, we can go forward from there.

ECH: Do you consider data model vs. syntax as an example of a design principle dimension?

SMY: Yeah, I've been wondering that myself. I think it's a consequence of my design principles. I really like how MIH approached this. Focusing on the data model will get us faster to some decisions. It's a hard question.

ECH: I think it is an important one. To me, I think that's one of the most fundamental kinds of decisions we can make. I think it influences the process of designing what we're doing. I think we can compartmentalize, these are implementation details, etc., by addressing this particular dimension.

RCA: I see doing that as an easy way to segment issues and create focus groups to work on these specific issues.

ZBI: Some design principles may not be addressable via data model alone. I think there are questions that can be resolved at the data model level. One thing we did with Fluent is we did a quick iteration of data model change and map it onto a data model syntax. One of the things I have on my mind is that if we go with MIH's approach, that we still have some dedicated time/group to make a syntax that proves that the data model can be implemented.

MIH: I agree that the data model and design principles don't fully overlap. We need both.

ZBI: I don't fully agree with that. Depending on how far we go with error recovery, salvaging a message that has a broken fallback, etc., those kind of things are only verifiable if we try to implement the data model in some kind of syntax.

MIH: I think it does make sense. I think if we have doubts about the data model, we should try stuff, we should throw stuff at it. For multi-line, I don't see how it influences it. I would say that yeah, we should independently throw ideas at it and try stuff.

ECH: I agree with both MIH and ZBI. Some of the concerns ZBI was bringing up seem to relate to a dimension of a design principle, which is where does MessageFormat end and a Translation Management System begin? I would like to address that more explicitly, so that we know what is in scope or out of scope.

SMY: For the data model, should we use JSON? Do you have any suggestions?
- Could we have more concrete/experimental github-based examples to help make WG discussions flow better?

MIH: I did something on top of protobufs because I'm more familiar with it. JavaScript, Python, Dart, C, etc., support protobufs. I like that it's more opinionated than just JSON. I think Thrift would be an equally good format. I don't have a strong opinion. I think it would be useful to prototype on top of that.

SFC: JSON is probably the most general format that's not company-specific. Together with use of JSON Schema

MIH: is skeptical of JSON Schema, as something that's not as widely supported.

SRL notes that even if an implementation doesn't use JSON Schema specifically in their implementation, it would be a good way to move prototyping forward if JSON is used as the format.

MIH campaigns again for some prototyping with different tools. We'd be able to have more concrete discussions.

ZBI: Encourages use of a strongly-typed language. When implementing Fluent, we started with JavaScript, then Python, and I was the first person to attempt to implement the data model in strongly typed language (Rust). That exposed a number of areas that were underspecified.

##### Conclusion

SMY: I will file issues for each of these axes, and we should also file a separate issue for the data model vs syntax axis.

#### Issue [#30](https://github.com/unicode-org/message-format-wg/issues/30) -Define technical terms - by ECH.

ECH: I want to make sure we can agree on certain terms. Reduced vocabulary reduces the chance that we talk past each other (reduces ambiguity). I grouped these: some might be related, used in the same way, different, etc. If you've had the same experiences I've had, if you can chime in, that would be great. For example, "placeholder" and "placeable" might be referring to the same thing. "AST" also came up earlier. To me, AST means that you are parsing tokens from some syntax, usually in the context of a compiler. I don't see that quite the same as a representation of data. The way it gets used is synonymous with the other.

See list of terms / collective brainstorming at: https://github.com/unicode-org/message-format-wg/issues/30

RCA: Please also update the wiki. Should we place and add new terms at : https://github.com/unicode-org/message-format-wg/wiki/Glossary-&-Resources.

ECH: The wiki of terms right now is very basic. Hopefully we can start to boil it down.

SFC: Are there any particular terms you wanted to clarify right now?

ECH: The whole cluster of API argument syntax: I want to know how they relate to each other. I want us to define them so that we have our own definitions. I don't want us to make any assumptions about what they mean.

#### Issue [#47](https://github.com/unicode-org/message-format-wg/issues/30) - File Format - by NIC.

NIC: My preference is to push for file-format-agnostic. I think if we have a file syntax, it would hurt adoption. I think it should be as flexible as possible. Adopting a new file format is something that wouldn't be done quickly. If our goal is to have better adoption, do we have a way to solve all the problems we want to solve in a way that is file format agnostic? I think it would help guide discussion to decide which are the key features we want to solve and do a stack ranking. That can help us understand whether or not we need a special file format.

SFC: It would be useful to sort issues/features and make an explainer doc, check-in to the wiki. A shared list that takes the features and identifies what design principles are required to support those features. If there's a feature that can't be supported without a file format, we can determine whether we need to prioritise such a feature?

RCA: This is related with Data Model definition shouldn’t we wait to have it defined or more advanced to start working on that ?

EAO: Big decision we ought to be making: do we really want to be driving all of i18n/l10n to be using one message format, or do we want to build an environment which can support many or all message formatting languages? All of what we're considering and covering could work with either of those, but they're very different worlds.

Which is better? We need to answer this question.

SRL: We should have a syntax, a file format, as the core of this standardization effort.
Specifying transforms between formats could be done. For example, if the core file format is XLIFF or not, we should be able to transform it to XLIFF.

ECH: It's a good question that EAO brought up.
I think this is another way of looking at the same question of: Data model vs syntax?
E.g. let's standardize on an existing format: MessageFormat / Fluent / some other / etc?
Or yes: we're trying to define one new single syntax to use them all?

SMY: This discussion made me realize that maybe besides the design principles discussion, we should have a specific discussion about the goals and non-goals of the topic. For example, compatibility with XLIF is a bounding criterion. I have a couple of other ideas. I'll throw them into an issue.

* End of meeting (should ideally have been a point of order?)


## Not discussed Issues
- [#48](https://github.com/unicode-org/message-format-wg/issues/48)
- [#26](https://github.com/unicode-org/message-format-wg/issues/26)


## Next meeting actions/issues

- Create issues to split topics related with Design Principles #50
- Data Model Discussion, Open issue to design and work on that with examples , discussions etc …
- Decide/vote : “Big decision we ought to be making: do we really want to be driving all of i18n/l10n to be using one message - format, or do we want to build an environment which can support many or all message formatting languages? All of what we're considering and covering could work with either of those, but they're very different worlds.”
- Define goals and non-goals of MFWG related with #49

0 comments on commit e22ee56

Please sign in to comment.