Should `gist` add explicit English language tags to all its existing SKOS annotation values? #685

pmcb55 · 2022-06-12T01:54:12Z

Currently the skos:prefLabel and skos:definition (and skos:scopeNote and skos:example, etc.) values for all terms in gist are only provided in English, yet they are not explicitly language tagged with @en.

To help promote more international reuse of gist, and to more clearly and explicitly denote the intended use of these values (i.e., I assume for human users to read and interpret), I would strongly advocate always adding explicit language tags for such predicates (and to leave strings of type xsd:string for predicates whose literal values would never make sense to translate across human languages (e.g., values for social security numbers (that may contain hyphens), or alphanumeric passport numbers, etc.))

The text was updated successfully, but these errors were encountered:

Jamie-SA · 2022-06-13T21:49:21Z

This request has come up several times. I am not sure why we haven't done this yet.

#311
#228
#185
#85

rjyounes · 2022-06-13T22:19:43Z

It's worth reading the discussion thread on #311 to get some insight into why we haven't done this in the past - essentially, the assumption is we have no non-English-speaking users and therefore no use case. But as use of gist broadens, we may well have international users.

If we do want to do it, rdf-toolkit has this option:

 -osl,--override-string-language <arg> 
 sets an override language that is applied to all strings

We could add this argument in the pre-commit hook.

However, this will over-generate in a couple of cases:

gist:license value - though it has been suggested that this be an object rather than datatype property - see Should gist:license be an object property? #682.
If we use vann:hasPreferredPrefix and hasPreferredNamespaceUri as suggested in Should gist describe its preferred prefix and namespace URI using VANN? #684.

Perhaps it's better placed in the bundler, where we can specify exactly the properties we want it to apply to (note: all annotation properties, not just skos:prefLabel and skos:definition).

Jamie-SA · 2022-06-14T15:17:27Z

Should it be a post processing step? Why not just change the source files and then only verify correctness in the bundling step? Similar to some of the queries Boris wrote for validating style.

Also think through how we'd want to handle having multiple languages in/for gist.

rjyounes · 2022-06-14T15:44:35Z

That works too, but I do think we should use the bundler to verify it in case people forget, as you suggest.

JonathonGist · 2022-06-29T14:26:56Z

Excellent. I was hoping you all would add @en where appropriate.

rjyounes · 2022-09-08T16:15:32Z

Considerations:

@en tags could be added by bundler.
This is a breaking change, not in terms of inferences, but it will break SPARQL queries.
The bundler could add non-tagged versions of the annotations to rdfs:Annotations, for users who don't want to update their queries.
Alternatively, Jamie suggests keeping the non-tagged files in gistCore.ttl, at least as a transitional device, so users don't have to add an additional import.

Needs further discussion, so moving to December release.

Jamie-SA · 2023-01-26T16:02:59Z

Rebecca is correct, this likely will break someone's queries or code. But I think we should do it. I'll volunteer to do the PR.

I think we should schedule it for the next major release (probably the Units Of Measure changes release).

But there is another problem, if we only have one language, adding a second would also break queries/code. It seems we have 2 options, 1) keep translations in a separate file, or 2) add at least one more language at the same time so people would need to address the issue of multiple languages at the same time as the switch to language tags.

Dealing with multiple languages might be harder for most of us who haven't had to deal with it yet, than just the change of literal type. So part of my likes the idea of keeping other languages in separate files. But I also think it might be a good idea to push us to start preparing for it, and therefore should put them into the main file. Of course, we could release multiple versions and let people choose.

See also these discussions about dealing with multiple languages:

rjyounes · 2023-01-26T16:14:14Z

I don't consider myself anglocentric, but I don't think ontology terms need to have language versions until we have a specific use case for that. Instances are a different issue. In the Treasury of Lives project, for example, data properties like names and pref labels in taxonomies have different versions for different languages and transliterations. But ontology terms are not user-facing and I see no reason to provide alternate language versions for maintainers of the ontology - again, until we have a use case, which would have to be a gist developer who knows little or no English - arguably extremely unlikely.

Jamie-SA · 2023-01-26T17:36:10Z

I disagree. I think you are being quite anglocentric.

This isn't just for maintainers of gist, it is also for users of gist. I think we do want to increase the adoption of gist, don't we? If you want non-native English speaking developers (quite a bit of the world) to adopt gist, then gist should provide, or at least allow for, annotations in multiple languages.

And, when you move towards model/metadata driven architectures, Ontology terms do surface to the UI.

rjyounes · 2023-01-27T01:26:24Z

It's not anglo-centric to acknowledge that English is the de facto lingua franca in the worlds of business, finance, tech, industry, and academia, it's just a plain fact. I bet every professional ontologist has read the English texts on the subject.

Your final point is convincing, though, so I'll retract my disagreement.

rjyounes · 2023-01-27T13:15:07Z

Do we want to add @en or @en-us, so that our colleagues in the UK can use "Organisation"@en-gb?

JonathonGist · 2023-01-27T13:28:34Z

I think that it would be good to use @en-us, so that @en-GB can be used where needed.

uscholdm · 2023-01-27T20:33:42Z

I wonder what FIBO does. Sent from an input-challenged device. On Jan 27, 2023, at 07:15, Rebecca Younes ***@***.***> wrote: Do we want to add @en or @en-us, so that our colleagues in the UK can use ***@***.***<https://github.com/en-GB>"? — Reply to this email directly, view it on GitHub<#685 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ACSHHSIHKBCEFI4ZDBJNO6LWUPC6PANCNFSM5YQ6AF3A>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

rjyounes · 2023-03-23T15:54:18Z

Decision:

Use two-letter language tags, rather than 4-letter language + country tags

This is a breaking change due to differences in query results. Deferred to release 13.0.0.

rjyounes · 2023-05-11T16:10:53Z

If someone wants to add a language, they can submit a PR and continue to maintain it.

Should we have a version of each annotation without a language tag?

Should we require a complete implementation of all annotations, or should we allow someone to implement piecemeal?
Steven: Then we would have to maintain it with every update.
Rebecca: We can suggest it.

Michael: Another option is to ask people to create files with their translations.
Jamie: Or put them in a separate file in our repo.

Suggestion: Canonical version doesn't have a tag. All tagged versions go in separate files.
We'll implement the English one as a starter.
Also a partial (or complete) French version.

Jamie: Automate English version and add to pre-commit hook (dev team)

rjyounes · 2023-05-11T16:39:33Z

TBD: File naming conventions, file and directory structure

One directory for each language or a shared directory for all languages?
One file per language, or mirroring the canonical files with 5 separate files? If the latter, there should be a directory per language to avoid clutter.
The release package generates multiple serializations of each file, so having one directory per sub-language seems cleaner and more manageable.
Adding status "under review" until these questions are answered.

Setting status to "under review" until these questions are answered.

rjyounes · 2023-05-11T16:45:46Z

Added individual issues:

#840 - Automated generation of en version
#841 - Sample French version
#851 - Sample Spanish version
#842 - Documentation

This issue will be closed once the file and directory structure questions have been resolved.

pmcb55 · 2023-05-17T12:45:29Z

Just a very quick comment - in all my US-based customer engagements, the first (and so far only!) additional language they've all requested has been Spanish. (I just mention this as I see you're leaning towards French, but I would have thought most current gist users are probably US-based, and therefore wouldn't Spanish perhaps be a more 'useful' 2nd language for that user-base...?)

rjyounes · 2023-05-17T13:12:47Z

@pmcb55 This is a good point. A partial French version was going to be provided as an example of how to do it, because we have two French speakers on board that could do it. If you are able to provide a Spanish version instead that would be fine. It doesn't have to be complete, just some examples.

pmcb55 · 2023-05-17T13:46:47Z

If you are able to provide a Spanish version instead...

@rjyounes yeah, you've got a deal! Since this is just some examples, why don't we just add both French and Spanish (we have native French and Spanish speakers here at Inrupt - but how about you provide the French examples, and I'll ask a couple of our Spanish speakers to contribute Spanish too).
If you just point me at the French examples once they're ready (at patm@inrupt.com), I'll take it from there with the Spanish version.

pmcb55 · 2023-05-17T14:30:50Z

On the question of 2- or 4-letter language tags, I'd suggest sticking with 2-letters in the main, but do bite the bullet now and apply 4-letter tags for the very few situations where they are applicable, e.g.:

gist:Organization a owl:Class;
    skos:definition "A generic organization that can be formal or informal, legal or non-legal. It can have members, or not."@en-US;
    skos:definition "A generic organisation that can be formal or informal, legal or non-legal. It can have members, or not."@en-GB;
    skos:prefLabel "Organization"@en-US;
    skos:prefLabel "Organisation"@en-GB;

I say 'bite the bullet now' above, because yeah, this indeed puts an added burden on clients/users of gist - in that they now not only need to be language tag aware (but that's a 'good thing', as even the USA is (I believe) rapidly becoming multilingual, especially in the Southern USA), but they also now need to be aware of 4-letter tags (but, I'd argue, this is also a 'good thing').
I think it's important to appreciate that this is not just an AmericanEnglish/EnglishEnglish thing, but also a CanadianFrench/FrenchFrench thing, and also a SouthAmericanSpanish/SpanishSpanish thing, etc.
And given gist's Enterprise/Commercial focus, and it's laudable global ambitions, I think it would be a missed opportunity to only go in half-hearted on this question (i.e., 2-letter tags only), and instead preempt the frustrations that limiting to just 2-letter language tags will inevitably raise from the get-go.
(And just from a purely selfish perspective, Inrupt continues to be in active discussions with multiple governments and regions around the world, and again and again the 'language variation question' comes up - i.e., it can be a fiercely political issue in many parts of the world - so gist actively using (and therefore encouraging clients to actively support) 4-letter language tags would be a big step forward for everyone!)

Jamie-SA · 2023-05-17T15:48:24Z

@pmcb55 thanks for the comments. I agree with adding 4 letter tags, it basically starts everyone on the path of being able to handle it from the beginning.

But I have a clarifying question: would you use 2 letter language tags when there is no difference? Or do you duplicate everything even when there is no difference between the 4 letter language tag versions?

rjyounes · 2023-05-17T17:10:31Z

@pmcb55 You've convinced me of the four-letter language tags. @Jamie-SA You would use the same tags throughout, even if the languages coincide. That would certainly decrease the burden on querying.

rjyounes · 2023-05-17T18:14:13Z

I would like to get @mkumba's input on the 4-letter tags before moving forward. He expressed a preference for 2-letter tags.

mkumba · 2023-05-17T20:05:17Z

Yeah the 4 letter tag comes into play about 1% of time, and even when it’s a fail is a soft fail. Most Americans know that colour is really color and most Brit’s know color is really colour so there isn’t much lost. The real debate is whether or when to allow un languished strings

…

Sent from my iPhone On May 17, 2023, at 12:14, Rebecca Younes ***@***.***> wrote: I would like to get @mkumba<https://github.com/mkumba>'s input on the 4-letter tags before moving forward. He expressed a preference for 2-letter tags. — Reply to this email directly, view it on GitHub<#685 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAAGJPSIIFHCIP3FUW2JZRDXGUIP7ANCNFSM5YQ6AF3A>. You are receiving this because you were mentioned.Message ID: ***@***.***>

uscholdm · 2023-05-18T01:22:32Z

The real debate is whether or when to allow un languished strings

Yes that is the question - perhaps gist strings have languished in their un-languaged form for long enough! :-)

rjyounes · 2023-05-19T12:29:57Z

I thought we had decided to have the non-tagged strings in gistCore, with the tagged versions in different files? Or is there a pun I'm not getting?

One thing that occurred to me is that it would be optimal to clean up our gist annotations before we implement the samples in other languages. We don't want all the problems we know exist with our annotations to be simply carried over into other languages. We can proceed with generating the English versions automatically, but IMO we should do the full clean-up of our existing annotations before doing any work on other languages. This will cause a delay, of course, but we can make a concerted effort on the current annotations and make them a priority for the next release. I doubt if there'll be anyone banging down the door in the next 6 months wanting to implement a non-English version.

johnwcowan · 2023-05-20T11:51:18Z

It's important to be flexible about the length of a language tag.

For Japanese, the appropriate tag is simply "@ja", because there is no country other than Japan where Japanese is an important language. We do not need to tag anything "@ja-JP", for example, because there is no contrast with "@ja-US" or "@ja-EU" or what have you.

On the other hand, not all languages have 2-letter tags. One of the official languages of the Cook Islands, for example, is Cook Islands Maori (not the same as New Zealand Maori), whose tag is "@RAR". Across the entire range of languages (about 7000), most have 3-letter tags, though most of the commercially important ones have 2-letter tags.

On the gripping hand, the important distinction for written Chinese is traditional vs. simplified characters, which are tagged "@zh-Hant" and "@zh-Hans" respectively, and by the same token for Azerbaijani what matters is the script: Latin (@az-Latn), Cyrillic (@az-Cyrl), or Arabic (@az-Arab), as most literate Azeris only know one script.'

Links:

Language Subtag Registry

BCP 47. See RFC 5646 (first part of BCP 47), section 4.1 for guidelines for constructing language tags from subtags.

rjyounes · 2023-05-20T14:24:35Z

@johnwcowan Thanks, this is good information. W3C has a helpful discussion of tags and subtags which confirms what you are saying. We've been referring over-simplistically to "two" and "four" letter tags: the correct distinctions are between language tag, region sub-tag, and script tags.

I think you're right that we need not prescribe a certain level of tagging. If someone wants to provide a specifically Canadian French version as opposed to a generic French version, they can do so. On the other hand, we may choose not to accept a PR that uses "ja-JP" tags and request that they be re-written as simply "ja." The primary open implementation question for us is whether our auto-generated English tags should be "en" or "en-US."

johnwcowan · 2023-05-21T09:46:20Z

Minor correction: Japanese is 'ja', whereas Japan is 'JP'.

I recommend the use of 'en-US', since the existing text is written in U.S. English.

In general, anything meant to be linguistic data (descriptions, etc.) as opposed to codes, should be language-tagged. So unless it's a requirement for backward compatibility, I recommend tagging all the linguistic data rather than providing an untagged form. Untagged text should be confined to codes of various sorts.

rjyounes · 2023-05-21T10:57:26Z

If we make this a minor change, we need the untagged versions for backward compatibility. If it's a major change, we don't. This is subject to discussion.

rjyounes · 2023-05-25T15:45:03Z

Closing: all to-do items moved to other issues.

#840 - Automated generation of en version
#841 - Sample French version
#842 - Documentation

rjyounes · 2023-05-27T15:25:48Z

Team discussion on 2023-05-25 led to the following conclusions:

  * Related issues: 
	  * [Automated generation of en version](https://github.com/semanticarts/gist/issues/840)
	  * [Sample French version](https://github.com/semanticarts/gist/issues/841)
	  * [Sample Spanish version](https://github.com/semanticarts/gist/issues/851) (Pat McBennett to submit PR)
	  * [Documentation](https://github.com/semanticarts/gist/issues/842)
* Open discussion points
    * Language or language+subtag? Subtag could be region or script. E.g., in Chinese the script is of primary significance in written text, not the region. From John Cowan.
      * Maybe language only if there are no significant regional distinctions, otherwise language+region? (Note, for accuracy the subtags are not always 2 letters). 
      * Don't need to prescribe, just let people submit what they want to? We could reject a region tag where there are no regional distinctions.
      * What do we want to do for our versions?
	      * English - use language only, may change in future
	      * French - use language only, may change in future
     * How do we review PRs in language we have no expertise in? 
         * Trust factor
         * In-house expertise
	     * Test in Google translate (to English), if they look OK accept them. If not, reject the whole batch.
	     * Add disclaimer for those we haven't been able to review, and also that they may not be up-to-date as gist evolves.
	     * SA in-house expertise:
		     * French: Doug, Jess
		     * German: Rebecca
		     * Arabic: Dalia
		     * Simplified and Traditional Chinese: Katie (some)
		     * Japanese: Katie (some)
		     * Russian: Irina
		     * Italian: Irina
    * No default untagged version - our default should be `en-US`. From John Cowan.
      * Then this is a major change.
      * Decision: keep untagged versions as defaults in ontology files. Can change later in a major release. 
    *  File names and directory structure. 
	    * Only release in Turtle. 
	    * Most likely only gistCore, not supplementary ontologies, will be translated.
	    * There might be a translation of the readme if someone wants to do it.
	    * So, we'll only have one ontology file per language. Naming conventions:
		    * `gistCore.en.ttl`
		    * `README.en.md`
	    * So use one directory. [We didn't decide on a directory name, I propose `language_tagged_annotations` or similar.]

rjyounes · 2023-10-12T16:06:09Z

See additional discussion and final conclusion on #840.

pmcb55 changed the title ~~Should gist add explicit English language tags to all it's existing skos:prefLabel and skos:defintion values?~~ Should gist add explicit English language tags to all it's existing skos:prefLabel and skos:definition values? Jun 12, 2022

rjyounes added the topic: annotations label Jun 13, 2022

rjyounes changed the title ~~Should gist add explicit English language tags to all it's existing skos:prefLabel and skos:definition values?~~ Should gist add explicit English language tags to all its existing skos:prefLabel and skos:definition values? Jun 13, 2022

rjyounes changed the title ~~Should gist add explicit English language tags to all its existing skos:prefLabel and skos:definition values?~~ Should gist add explicit English language tags to all its existing SKOS annotation values? Jun 13, 2022

rjyounes added the status: deferred to major release Involves a major change so deferred till next major release. An implementation may be specified. label Mar 23, 2023

rjyounes assigned Jamie-SA Mar 23, 2023

This was referenced May 11, 2023

Automate generation of @en-tagged versions of annotations #840

Closed

Add sample @fr-tagged version of gist annotations #841

Closed

rjyounes added the status: under review In triage label May 11, 2023

rjyounes mentioned this issue May 11, 2023

Modify documentation related to language-tagged annotations #842

Closed

rjyounes mentioned this issue May 17, 2023

Add sample file of Spanish-tagged version of gist annotations #851

Closed

rjyounes closed this as completed May 25, 2023

rjyounes added the topic: language tags label Sep 28, 2023

rjyounes unassigned Jamie-SA Oct 12, 2023

rjyounes removed status: under review In triage topic: annotations labels Oct 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should `gist` add explicit English language tags to all its existing SKOS annotation values? #685

Should `gist` add explicit English language tags to all its existing SKOS annotation values? #685

pmcb55 commented Jun 12, 2022 •

edited by rjyounes

Loading

Jamie-SA commented Jun 13, 2022

rjyounes commented Jun 13, 2022 •

edited

Loading

Jamie-SA commented Jun 14, 2022

rjyounes commented Jun 14, 2022 •

edited

Loading

JonathonGist commented Jun 29, 2022

rjyounes commented Sep 8, 2022

Jamie-SA commented Jan 26, 2023

rjyounes commented Jan 26, 2023

Jamie-SA commented Jan 26, 2023

rjyounes commented Jan 27, 2023 •

edited

Loading

rjyounes commented Jan 27, 2023 •

edited

Loading

JonathonGist commented Jan 27, 2023

uscholdm commented Jan 27, 2023 via email

rjyounes commented Mar 23, 2023 •

edited

Loading

rjyounes commented May 11, 2023

rjyounes commented May 11, 2023

rjyounes commented May 11, 2023 •

edited

Loading

pmcb55 commented May 17, 2023

rjyounes commented May 17, 2023

pmcb55 commented May 17, 2023 •

edited

Loading

pmcb55 commented May 17, 2023 •

edited

Loading

Jamie-SA commented May 17, 2023

rjyounes commented May 17, 2023

rjyounes commented May 17, 2023

mkumba commented May 17, 2023 via email

uscholdm commented May 18, 2023

rjyounes commented May 19, 2023

johnwcowan commented May 20, 2023

rjyounes commented May 20, 2023 •

edited

Loading

johnwcowan commented May 21, 2023

rjyounes commented May 21, 2023 •

edited

Loading

rjyounes commented May 25, 2023 •

edited

Loading

rjyounes commented May 27, 2023 •

edited

Loading

rjyounes commented Oct 12, 2023

Should gist add explicit English language tags to all its existing SKOS annotation values? #685

Should gist add explicit English language tags to all its existing SKOS annotation values? #685

Comments

pmcb55 commented Jun 12, 2022 • edited by rjyounes Loading

Jamie-SA commented Jun 13, 2022

rjyounes commented Jun 13, 2022 • edited Loading

Jamie-SA commented Jun 14, 2022

rjyounes commented Jun 14, 2022 • edited Loading

JonathonGist commented Jun 29, 2022

rjyounes commented Sep 8, 2022

Jamie-SA commented Jan 26, 2023

rjyounes commented Jan 26, 2023

Jamie-SA commented Jan 26, 2023

rjyounes commented Jan 27, 2023 • edited Loading

rjyounes commented Jan 27, 2023 • edited Loading

JonathonGist commented Jan 27, 2023

uscholdm commented Jan 27, 2023 via email

rjyounes commented Mar 23, 2023 • edited Loading

rjyounes commented May 11, 2023

rjyounes commented May 11, 2023

rjyounes commented May 11, 2023 • edited Loading

pmcb55 commented May 17, 2023

rjyounes commented May 17, 2023

pmcb55 commented May 17, 2023 • edited Loading

pmcb55 commented May 17, 2023 • edited Loading

Jamie-SA commented May 17, 2023

rjyounes commented May 17, 2023

rjyounes commented May 17, 2023

mkumba commented May 17, 2023 via email

uscholdm commented May 18, 2023

rjyounes commented May 19, 2023

johnwcowan commented May 20, 2023

rjyounes commented May 20, 2023 • edited Loading

johnwcowan commented May 21, 2023

rjyounes commented May 21, 2023 • edited Loading

rjyounes commented May 25, 2023 • edited Loading

rjyounes commented May 27, 2023 • edited Loading

rjyounes commented Oct 12, 2023

Should `gist` add explicit English language tags to all its existing SKOS annotation values? #685

Should `gist` add explicit English language tags to all its existing SKOS annotation values? #685

pmcb55 commented Jun 12, 2022 •

edited by rjyounes

Loading

rjyounes commented Jun 13, 2022 •

edited

Loading

rjyounes commented Jun 14, 2022 •

edited

Loading

rjyounes commented Jan 27, 2023 •

edited

Loading

rjyounes commented Jan 27, 2023 •

edited

Loading

rjyounes commented Mar 23, 2023 •

edited

Loading

rjyounes commented May 11, 2023 •

edited

Loading

pmcb55 commented May 17, 2023 •

edited

Loading

pmcb55 commented May 17, 2023 •

edited

Loading

rjyounes commented May 20, 2023 •

edited

Loading

rjyounes commented May 21, 2023 •

edited

Loading

rjyounes commented May 25, 2023 •

edited

Loading

rjyounes commented May 27, 2023 •

edited

Loading