Automate generation of @en-tagged versions of annotations #840

rjyounes · 2023-05-11T16:17:06Z

See issue #685.

This will go into a separate file, so the main files have canonical, untagged versions.
This will be generated as part of the pre-commit hook so that they are up-to-date for development purposes.

Assigning to @Jamie-SA, to be delegated if desired.

rjyounes · 2023-05-11T16:38:25Z

TBD: File naming conventions, file and directory structure

One directory for each language or a shared directory for all languages?
One file per language, or mirroring the canonical files with 5 separate files? If the latter, there should be a directory per language to avoid clutter.
The release package generates multiple serializations of each file, so having one directory per sub-language seems cleaner and more manageable.
Adding status "under review" until these questions are answered.

Also TBD whether to use 2- or 4-digit tags. See discussion thread on issue #685.

rjyounes · 2023-05-16T19:02:39Z

Not sure if this helps, but ontology-toolkit provides an example of how this could be done using onto_tool(though the example query replaces the non-tagged version rather than adding to it).

SPARQL tools apply a SPARQL Update query to each input file and serialize the resulting graph into the output file. RDF format is preserved unless overridden with the format option. If the query is specified inline, template substitution will be applied to it, so bundle variables can be used, but double braces ({{ instead of {, }} instead of }) have to be used to escape actual braces.
  - name: "add-language-en"
    type: "sparql"
    query: >
      prefix skos: <http://www.w3.org/2004/02/skos/core#>
      DELETE {{
        ?subject skos:prefLabel ?nolang .
      }}
      INSERT {{
        ?subject skos:prefLabel ?withlang
      }}
      where {{
        ?subject skos:prefLabel ?nolang .
        FILTER(lang(?nolang) = '')
        BIND(STRLANG(?nolang, '{lang}') as ?withlang)
      }}

rjyounes · 2023-05-27T15:28:10Z

Team discussion on 2023-05-25 led to the following conclusions:

  * Related issues: 
	  * [Automated generation of en version](https://github.com/semanticarts/gist/issues/840)
	  * [Sample French version](https://github.com/semanticarts/gist/issues/841)
	  * [Sample Spanish version](https://github.com/semanticarts/gist/issues/851) (Pat McBennett to submit PR)
	  * [Documentation](https://github.com/semanticarts/gist/issues/842)
* Open discussion points
    * Language or language+subtag? Subtag could be region or script. E.g., in Chinese the script is of primary significance in written text, not the region. From John Cowan.
      * Maybe language only if there are no significant regional distinctions, otherwise language+region? (Note, for accuracy the subtags are not always 2 letters). 
      * Don't need to prescribe, just let people submit what they want to? We could reject a region tag where there are no regional distinctions.
      * What do we want to do for our versions?
	      * English - use language only, may change in future
	      * French - use language only, may change in future
     * How do we review PRs in language we have no expertise in? 
         * Trust factor
         * In-house expertise
	     * Test in Google translate (to English), if they look OK accept them. If not, reject the whole batch.
	     * Add disclaimer for those we haven't been able to review, and also that they may not be up-to-date as gist evolves.
	     * SA in-house expertise:
		     * French: Doug, Jess
		     * German: Rebecca
		     * Arabic: Dalia
		     * Simplified and Traditional Chinese: Katie (some)
		     * Japanese: Katie (some)
		     * Russian: Irina, Boris
		     * Hebrew: Boris (some)
		     * Italian: Irina
    * No default untagged version - our default should be `en-US`. From John Cowan.
      * Then this is a major change.
      * Decision: keep untagged versions as defaults in ontology files. Can change later in a major release. 
    *  File names and directory structure. 
	    * Only release in Turtle. 
	    * Most likely only gistCore, not supplementary ontologies, will be translated.
	    * There might be a translation of the readme if someone wants to do it.
	    * So, we'll only have one ontology file per language. Naming conventions:
		    * `gistCore.en.ttl`
		    * `README.en.md`
	    * So use one directory. [We didn't decide on a directory name, I propose `language_tagged_annotations` or similar.]

rjyounes · 2023-09-28T16:01:30Z

Discussion of whether to put English-tagged versions in a separate file, with a default untagged version, or tag the current version with English tags. If we have a default untagged version, this is the one that would be edited.

The latter breaks backward compatibility.

Existing URL - untagged - if we have one. Default language-tagged version would be gistCore.en.ttl.

We decided previously to use only language, not language + region.

Non-English-tagged versions will go in separate.

Boris:

Put all tagged versions in the core file, not separate files - this introduces maintenance problem
Next minor release create two versions:
- (1) only English labels without tags (for backward compatibility)
- (2) full file, all languages. This will retain backward compatibility.
On major release, we flip it - the internationalized version becomes the default, and the default English-tagged only becomes the non-default and deprecated.

Rebecca: Keeping everything in core file means whenever there's a commit, you have to get a review from a language expert.

Boris: Have a completely separate ontology for each language, containing only annotations, with a dependency on a certain version of gist.

Rebecca: overhead too high, let's not do it. We don't have a large enough team to support internationalization. Some firms subcontract out internationalization.

Peter: Use content negotiation to get different language versions.
Minimally tag everything with @en, rather than having an untagged version. That gets our foot in the door of internationalization, and we can see how far we want to go in future.

Jamie: Keeping in separate files will decrease the burden.

Rebecca: Then we have a maintenance issue.

Rebecca: Internationalized versions have to be very precise to handle the level of precision we put into our annotations.

Boris: Have a gradual on-ramp. Difficult verbiage is in definitions. So maybe first step is to just do pref labels. Do first major version with labels only, and see how much of a burden it is to maintain them before moving forward.
If we are concerned about maintaining a separate file, have explicit version dependencies, merge during release process if the versions are compatible.
Use SHACL to require English tags, and no more than one prefLabel and definition per language.

Rebecca: How many users do we really have that don't know good English? We should acknowledge that English is the lingua franca of the corporate world, finance, business, science, academia, IT, engineering...any domain we are likely to enter.

Peter: Have English tags, don't do others if the maintenance is too high.

If you have multiple tags, you have to query as:
FILTER(langMatches(?label, 'en'))
SELECT ?v WHERE { ?v ?p "cat"@en } - exact string match

rjyounes · 2023-10-12T15:54:33Z

Rebecca - Do nothing

Mark - Don't put language tags in the primary gist file.

Questions:

Do we add any language tags?
If yes, do we offer a non-tagged version?
If we are adding tags, do they go in the main file or ancillary files?

DECISION:

Do nothing until we have a business case.
- Disrupts SPARQL queries
- Large maintenance issue
- No current demand from gist users
If/when a need arises, we will create the infrastructure to do so

rjyounes assigned Jamie-SA May 11, 2023

rjyounes added the status: under review In triage label May 11, 2023

This was referenced May 11, 2023

Modify documentation related to language-tagged annotations #842

Closed

Should gist add explicit English language tags to all its existing SKOS annotation values? #685

Closed

rjyounes added impact: major Non-backward compatible (changes inferences; e.g., adding a restriction, domain, range) and removed impact: major Non-backward compatible (changes inferences; e.g., adding a restriction, domain, range) labels May 11, 2023

rjyounes added the topic: language tags label Sep 28, 2023

rjyounes closed this as not planned Won't fix, can't repro, duplicate, stale Oct 12, 2023

rjyounes unassigned Jamie-SA Oct 12, 2023

rjyounes removed the status: under review In triage label Oct 12, 2023

This was referenced Oct 12, 2023

Add sample @fr-tagged version of gist annotations #841

Closed

Add sample file of Spanish-tagged version of gist annotations #851

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automate generation of @en-tagged versions of annotations #840

Automate generation of @en-tagged versions of annotations #840

rjyounes commented May 11, 2023 •

edited

Loading

rjyounes commented May 11, 2023 •

edited

Loading

rjyounes commented May 16, 2023 •

edited

Loading

rjyounes commented May 27, 2023 •

edited

Loading

rjyounes commented Sep 28, 2023 •

edited

Loading

rjyounes commented Oct 12, 2023 •

edited

Loading

Automate generation of @en-tagged versions of annotations #840

Automate generation of @en-tagged versions of annotations #840

Comments

rjyounes commented May 11, 2023 • edited Loading

rjyounes commented May 11, 2023 • edited Loading

rjyounes commented May 16, 2023 • edited Loading

rjyounes commented May 27, 2023 • edited Loading

rjyounes commented Sep 28, 2023 • edited Loading

rjyounes commented Oct 12, 2023 • edited Loading

rjyounes commented May 11, 2023 •

edited

Loading

rjyounes commented May 11, 2023 •

edited

Loading

rjyounes commented May 16, 2023 •

edited

Loading

rjyounes commented May 27, 2023 •

edited

Loading

rjyounes commented Sep 28, 2023 •

edited

Loading

rjyounes commented Oct 12, 2023 •

edited

Loading