Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc: updated file-structure for multi-locale .lg and .lu authoring #1922

Closed
wants to merge 11 commits into from

Conversation

cwhitten
Copy link
Member

@cwhitten cwhitten commented Jan 30, 2020

closes #1922

Multi-locale LG & LU authoring in Composer


The following RFC is intended to be guided by the following scenarios to be supported in Composer:

  1. I can create, modify, or delete a .lg or .lu file for a dialog or create a common.lg file in a language of my choosing.
  2. I can set the language for assets to be rendered in the authoring surface and forms.
  3. If the Shell cannot find the configured language, the original authored language of the asset will be used.
  4. I can create a full set of files in a language (base -> target(s)) that copies base as target(s) initial implementation.
  5. I can copy all Bot directory assets to a location of my choosing.
  6. I can load Bot Assets to replace the current assets if identical files exist to over-write implementations with new or modified versions
Implementation:

Providing a good experience to allow translations of these files can be complex. In considering the UX to provide support for language specific .lg and .lu files, we should take the opportunity to consider what has become convention for how Composer bot assets are represented on the filesytem. This RFC lays out different options to write files to disk logically and proposes an update to the current convention.

The distribution of .lg and .lu files in a set of Composer assets currently look like the following:

/ComposerDialogs
  /common
    common.lg
  /Main
    Main.dialog
    Main.lg
    Main.lu
  /DialogFoo
    DialogFoo.dialog
    DialogFoo.lg
    DialogFoo.lu
  /DialogBar
    DialogBar.dialog
    DialogBar.lg
    DialogBar.lu
  /DialogBaz
    DialogBaz.dialog
    DialogBaz.lg
    DialogBaz.lu
Problem

We want to allow an editing experience for these files as well as allow a user to add .lg and .lu in different languages and make sensible choices on the user's behalf in how we structure the asset directory.

What this would look like in today's file representation:

/ComposerDialogs
  /common
    common.en-us.lg
    common.fr.lg
    common.de.lg
  /Main
    Main.dialog
    Main.en-us.lg
    Main.fr.lg
    Main.de.lg
    Main.lu
  /DialogFoo
    DialogFoo.dialog
    DialogFoo.en-us.lg
    DialogFoo.fr.lg
    DialogFoo.de.lg
    DialogFoo.lu
  /DialogBar
    DialogBar.dialog
    DialogBar.en-us.lg
    DialogBar.fr.lg
    DialogBar.de.lg
    DialogBar.lu
  /DialogBaz
    DialogBaz.dialog
    DialogBaz.en-us.lg
    DialogBaz.fr.lg
    DialogBaz.de.lg
    DialogBaz.lu
Issues with current file structure

"Main" became the convention to note the entry dialog, but this is a heavy constraint. We can reconsider to something more expressive. Instead of generating a /BotName/Main.dialog, why can't we generate a /BotName/<BotName>.dialog as the entry point?

Representing the .lu and .lg locally with the .dialog file is logical in that it better places the files where they are being used. This makes a Dialog directory more portable in a world where Dialogs are not only used in a single bot. This file structure is a natural place to graduate to a system where Dialogs hold their own dependencies (.lu, .lg) and can be published or shared outside of the current bot.

A example downside of this approach is that this distribution of files may not be set up for domain specific work in one of the file-formats. One could prefer that all the .lg files exist in its own directly, and all the .lg files exist in its own directory, or all "en-us" files live in an "en-us" directory, and so on. Because of the anticipation of a Dialog and its associated content files (.lu, .lg) are intended to be shared via mechanisms currently planned to be built, a structure to imply a tigher binding between .dialog, .lg, .lu is currently the preferred approach.

Note
  1. This proposal only applies to a filesystem-based storage plugin, and has little bearing on a database-backed store plugin implementation.
  2. This is ideally the final time we make a significant naming or serialization decision before Composer hits GA. If we wanted to, for example, lowercase files and/or directories, this would be the time to do it.
Alternative structures
  1. Assets partitioned based on dialog and dependent assets

Benefit: Dependency encapsulation, recursive, convention can be applied to scenarios like publishing local dialogs and associated dependencies, or pulling down dialogs and associated dependencies from a external/third-party source.

/coolbot
  coolbot.dialog
  /language-generation
    /en-us
      common.en-us.lg
      coolbot.en-us.dialog
  /language-understanding
    main.en-us.lu
  /dialogs
    /foo
      foo.dialog
      /language-generation
        /en-us
          foo.en-us.dialog
      /language-understanding
        foo.en-us.lu
  1. Assets partitioned by asset type

Benefit: Physically maps to a content editing scenario (.lu, .lg)

/coolbot
  /dialogs
    coolbot.dialog
    foo.dialog
    bar.dialog
    baz.dialog
  /language-generation
    /en-us
      common.en-us.lg
      coolbot.en-us.lg
      foo.en-us.lg
      bar.en-us.lg
      baz.en-us.lg
  /language-understanding
    /en-us
      main.en-us.lu
      foo.en-us.lu
      bar.en-us.lu
      baz.en-us.lu
Proposal
  1. Adopt a lower-case naming convention for files and directories
  2. Remove hard-coded "Main" entrypoint requirement and key off of the bot name .dialog
  3. Adopt Update README.md #1 alternative structure option for physical layout of .dialog, .lu, .lg
Important consideration:

When attempting file lookups, we should try and be agnostic to the file structure as much as possible, in trying to support the scenario where one authors these assets outside of Composer. We shouldn't limit the realistic scenario that users would wish to author files in a different text editor or IDE and load them into Composer expecting a full experience. To fully support this, we aim to utilize the Adaptive Dialog ResourceManager and supporting modules so there is near to exact parity in how the runtime and authoring surface do file lookups and resolution. Whatever we choose for a directory convention, we should not hardcode it into the resolution logic.

@github-actions
Copy link

Coverage Status

Coverage remained the same at 42.413% when pulling f44ed7c on cwhitten/multi-locale into 12d77a0 on master.

@boydc2014 boydc2014 requested a review from vishwacsena January 30, 2020 05:40
@cwhitten cwhitten changed the title rfc: multi-locale .lg and .lu authoring rfc: updated file-structure for multi-locale .lg and .lu authoring Jan 30, 2020
@benbrown
Copy link
Contributor

A few thoughts:

  • As (currently) implemented, the storage system, even when database backed, represents things with "paths" that are compatible with this proposal and would not require major changes to how it works. This may not apply to all possible storage systems, but hard to say.
  • We cannot assume someone can just "copy" or "move" files around. Composer needs to provide an interface for this ala "import an asset" so that a literal file or group of files can be added into the storage system at a certain location. This would apply to all types of assets, not just LG files.
  • I definitely vote for lowercasing all file names!

@vishwacsena
Copy link
Contributor

I'm a bit lost on what we actually intend to do. Based on this,

Because of the anticipation of a Dialog and its associated content files (.lu, .lg) are intended to be shared via mechanisms currently planned to be built, a structure to imply a tigher binding between .dialog, .lg, .lu is currently the preferred approach.

I believe we are going to keep the related .lu, .lg files in the same location as the .dialog. Yes?

Experientially, we need to continue to push the concept of a file away from the user and hoist and provide a seamless, contextual authoring experience for the user.

ResourceExplorer and typeloader does not really care about file location but will continue to use fileName or combination of fileName and locale directly encoded as part of fileName to find and load the right resource.

@cwhitten
Copy link
Member Author

@vishwacsena the experience is out of scope of the doc, this is more of an infrastructure proposal that we can align on and defend. I am proposing we keep the existing convention and keep language files associated with the dialog file. The experience will continue to abstract the file metaphor away.


#####Note

1. This proposal only applies to a filesystem-based storage plugin, and has little bearing on a database-backed store plugin implementation. **It may have merit to choose a structure that better aligns with a database-driven index approach.**
Copy link
Contributor

@boydc2014 boydc2014 Feb 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's totally OK to me that Composer only have a fs-based abstraction layer for storage, and let any other backend storage implement a few fs primitives.

Similar to the idea of Unix\Linux\Plan9, everything is a file. Anyhow, i feel this is a very widely adopted approach to abstract storage and i feel no necessary to seek for a more generic storage abstraction than fs.

Main.de.lu
```

```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually like this alternative more, because the workflow i knew is that users tend to group assets by locale (VA is an example).

If our folder structure is like this

/Dialogs
/LanguageGeneration
  /en-us

I can image that the effort of adding a new language fr-fr would be as simple as copying the en-us folder into a fr-fr folder and do the editing in place.

In my opinion, this will also help team collaboration because it separate the concern of conversation designers and content write, and even model trainers.

Copy link
Member Author

@cwhitten cwhitten Feb 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't over-index on a physical layout of the files to align well with collaboration scenarios, though it was mentioned in the preface of the RFC and should be considered to an extent. Abstraction of the file metaphor will need to exist regardless to provide an appropriate experience for content writers.

That said, partitioning on dialog/lu/lg is a valid alternative, but I'd like to discuss a bit more. I agree that in a content editing scenario there is an advantage to physically laying out the files this way. From a dialog "clone" or sharing/publishing to some central location for re-use, physically laying out the dialog/lu/lg in a way that encapsulates the dialog's dependencies would have the advantage.

Additionally, I'd like to propose we hoist the main.dialog to the root of the bot. I see the following layouts as reasonable adjustment.

  1. main.dialog at root to signify entry-point, partitioned on dialog
/coolbot
  main.dialog
  /language-generation
    /en-us
      common.en-us.lg
      main.en-us.dialog
  /language-understanding
    main.en-us.lu
  /dialogs
    /foo
      foo.dialog
      /language-generation
        /en-us
          foo.en-us.dialog
      /language-understanding
        foo.en-us.lu
  1. main.dialog inside /dialogs with the rest of the dialogs, partitioned on asset-type
/coolbot
  /dialogs
    main.dialog
    foo.dialog
    bar.dialog
    baz.dialog
  /language-generation
    /en-us
      common.en-us.lg
      main.en-us.lg
      foo.en-us.lg
      bar.en-us.lg
      baz.en-us.lg
  /language-understanding
    /en-us
      main.en-us.lu
      foo.en-us.lu
      bar.en-us.lu
      baz.en-us.lu

While #2 looks clean physically, I tend to prefer the encapsulation and recursive nature of #1 and sets us up nicely to move/share dialogs between bots in the future.

cc @vishwacsena @benbrown

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the RFC to reflect this.

Copy link
Contributor

@boydc2014 boydc2014 Feb 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#2 do looks clean physically, it would be even more cleaner if we cover the "settings" folder and "schemas" folder.

#1 is recursively and do reflective the dialog structure in certain way, and is better than #2 on certain scenarios like sharing. But my biggest concern of a recursive presentation is that this enforce a tree structure but the dialog structure is actually a graph, that said, if two dialog A, B are both referring C, who should be encapsulate C? Maybe symbol-link can help on this, but as AFAIK, symbol-link in Windows is a mess, also this mean our solution is more complex and have more coupled into a very specific fs concept).

From another perspective, I agree that we are not designing physical layout for collaboration, but i would argue that we probably should not result in a structure that somehow restricting or limiting collaboration on physical files.

A recursive structure, in my opinion, is very easy to go wrong if people ever touch the files themselves and not knowing what's wrong. And it's also hard to reason over the structure, let's say, figure out how many language models are been used. If we don't want users to touch or reason over physical files manually, then why should we align the physical files recursively, why not a layout more friendly for both Composer and user with other tools? What do you guys think @vishwacsena @benbrown @christopheranderson

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's easy to jump into the mental gymnastics of what are full-blown package manager and dependency resolution challenges, like tree/graph and local module linking, etc. My hope is we can table that discussion but keep it in mind when we make a decision here. But this is my answer to your question:

then why should we align the physical files recursively, why not a layout more friendly for both Composer and user with other tools?

While #1 physically is more nested it is still sensible to reason about and edit with some education and clarity. Can you expand more on how #1 restricts & limits collaboration scenarios? I don't immediately see that.

#2 feels limiting and I'm concerned it suits a point in time (now) that won't work in the future. What if a dialog/sub-tree of dialogs want their own settings file? What if a dialog/sub-tree of dialogs want their own schema definition? We're not nearly as boxed in with #1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm late (or early to the party depending on your perspective.)

The early design decision for ResourceExplorer was to make resource ids unique and location independent. The reasoning for this was that:
a. It mapped to flat storage easier
b. It allows people to organize files in any manner which makes sense to them.
c. it means references to resources are less brittle because they continue to be correct even if you move files around.

That said, having an convention about how Composer represents them or the way that we we decide to have templates organize things seems like a good idea and will end up being something that people copy.

Some questions/comments I have are:
a. I don't get why the making everything lowercase is a good idea. What is driving that? If it's to make it easier to not have mistakes in references we can make case insensitive, but case is super useful for readability.
b. I am definitely biased towards assets being co-located so that LU/Dialog/LG can be worked with in the local, but I also believe there will be "global" shared assets which will be consumed by the things in the local. It feels like we aren't talking about that. For example, a bunch of LG templates defined at the root which are imported into the local LG files.

Copy link
Member Author

@cwhitten cwhitten Feb 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I also believe there will be "global" shared assets which will be consumed by the things in the local. It feels like we aren't talking about that.

What's implied in this thread is the common.<locale-code>.lg convention, which exists in the proposal, and we do this today as well. The local .lg files will have the ability to import from this asset regardless of how Composer lays the files out. We should use a local .lg template that imports the shared asset automatically that users can then extend to their needs.

You bring up a good point that I mention in the RFC - I posit that as soon as it is ready, Composer takes a hard dependency on the JS Adaptive ResourceExplorer in its storage plugin so the asset resolution mechanisms are exactly the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "limiting or restricting collaration" of #1, my assumption was "people will collaborate, to some extent, on raw files, no matter what UI we provide".

Based on that assumption, my feeling is a recursive structure is a little bit harder to collocate on some scenarios i was imaging like

  • When copying and moving the files
    • it's a little bit hard for user to quickly know my bot is completely copied for whaterver reason because i can't quickly glance something is missing when it's nested, especially when it's big (Vodafone has thousands dialogs)
    • i'm even concern will it easily exceed some path length limitation in windows (256 by default)
    • If I want to send my all lg or lu files to a translator or a writer, (i assume it's a very common workflow, because i see many users have a simple version first, then send the language assets to content writer without sending all the dialogs), i can not easily locate all the files. And once i get back a new version of lg files, i couldn't easily get it back into my bot. If it's a flatten structure, i can simply create a folder for that.
    • if something is wrong, the path to the errors could be a little less readable dialogA\dialogs\dialogC\dialogs\dialogsD\language-generation\a.lg things like this,
    • if our lg\lu files are referring to each other, will we put relative path like [import](../../dialogA/LG/b.lg), moving this dialog will cause the reference to break. (Put id and use resourceExplorer to implement a customized importResolver can solve this).

Those are kind of no big deal issue, it's just thinking about some scenario (may not all valid) give me a general feeling that a recursive structure is not friendly on physically copying, moving, manipulating the files. So hope we take this into consideration.

Regarding the flexibility you talked about

#2 feels limiting and I'm concerned it suits a point in time (now) that won't work in the future. What if a dialog/sub-tree of dialogs want their own settings file? What if a dialog/sub-tree of dialogs want their own schema definition? We're not nearly as boxed in with #1.

If we organize the dialog as tree\sub-tree, we definitely gain some extra space to configure\customize on tree\sub-tree, while at the same time, the cost is we organize the dialog as tree.

If it's the last chance we want to make change to folder structure, a structure without flavor perhaps is more likely to last than a structure with more flavor.

And, at the end, anyhow we should pick resourceExplorer in js to identify and load resources, what's missing today in resourceExplorer is creating resource following some pattern\layout, that's a gap Composer need to fill.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm even later than Tom. A few things:

  1. In the generator we currently put all .lg/.lu files into a single localized directory, i.e. en-us. We also have a single top-level .lg/.lu file in that directory which points to all of the component .lg/.lu files.
  2. For .dialog files, the assumption seems to be that there is a single .dialog file with all assets inline. In the generated dialogs case we make use of the ability to refer to named dialog files to split out all of the individual trigger dialogs. This makes it much easier to look at each dialog--they all fit on less than a page. Would I still be able to do that?

Part of the reason that I care about breaking things up into smaller files is that there is important information in the structure of the filename which should allow being intelligent about merging regenerated assets. We don't have to have separate files if we support id as a first class thing when defining things inline. An id is either explicitly specified inline or cannot be specified and comes from the filename.

Copy link
Contributor

@vishwacsena vishwacsena Feb 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like #1 but I have few questions (see below)

Also, I was providing some of my personal experience working on internationalization/ localization to @cwhitten other day. In my prior experience both on Internet Explorer as well as on Cortana, what I have seen is localization is more effective when the person localizing is enabled with three things - a) able to see full context into what's going on rather than mere strings in a text file b) able to readily test their changes c) able to always see what the base language version of the exact same string was (so they are essentially not just overwriting the base language string and then losing context into what the old string was).

So my 2c - we should not try to gate our decision to enable a purely file based localization. Instead just have the localization team use composer and use source control to reject any changes to .dialog files. In fact this was how IE was able to simultaneously ship 60+ languages on the same day as English.

With that said, @cwhitten, few questions for option #1

/coolbot
  coolbot.dialog
  /language-generation
Can you elaborate a bit on the logic to decide to create a 'lang-locale' sub-folder? In some cases, I see this but in other cases, I see we will directly write out name.lang-locale.ext file directly.  
    /en-us 
      common.en-us.lg
Can you help clarify why .dialog file show up under LG in this case?
      coolbot.en-us.dialog
  /language-understanding
Same as previous comment - unclear on what logic we'd use to decide to have a lang-locale sub-folder 
    main.en-us.lu
  /dialogs
    /foo
      foo.dialog
      /language-generation
        /en-us
          foo.en-us.dialog
      /language-understanding
        foo.en-us.lu

@cwhitten cwhitten requested a review from a-b-r-o-w-n as a code owner February 1, 2020 18:57
@chrimc62
Copy link
Contributor

chrimc62 commented Feb 5, 2020 via email

@a-b-r-o-w-n a-b-r-o-w-n deleted the cwhitten/multi-locale branch April 8, 2020 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants