Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should gist add explicit English language tags to all its existing SKOS annotation values? #685

Closed
pmcb55 opened this issue Jun 12, 2022 · 34 comments

Comments

@pmcb55
Copy link

pmcb55 commented Jun 12, 2022

Currently the skos:prefLabel and skos:definition (and skos:scopeNote and skos:example, etc.) values for all terms in gist are only provided in English, yet they are not explicitly language tagged with @en.

To help promote more international reuse of gist, and to more clearly and explicitly denote the intended use of these values (i.e., I assume for human users to read and interpret), I would strongly advocate always adding explicit language tags for such predicates (and to leave strings of type xsd:string for predicates whose literal values would never make sense to translate across human languages (e.g., values for social security numbers (that may contain hyphens), or alphanumeric passport numbers, etc.))

@pmcb55 pmcb55 changed the title Should gist add explicit English language tags to all it's existing skos:prefLabel and skos:defintion values? Should gist add explicit English language tags to all it's existing skos:prefLabel and skos:definition values? Jun 12, 2022
@Jamie-SA
Copy link
Contributor

This request has come up several times. I am not sure why we haven't done this yet.

#311
#228
#185
#85

@rjyounes
Copy link
Collaborator

rjyounes commented Jun 13, 2022

It's worth reading the discussion thread on #311 to get some insight into why we haven't done this in the past - essentially, the assumption is we have no non-English-speaking users and therefore no use case. But as use of gist broadens, we may well have international users.

If we do want to do it, rdf-toolkit has this option:

 -osl,--override-string-language <arg> 
 sets an override language that is applied to all strings

We could add this argument in the pre-commit hook.

However, this will over-generate in a couple of cases:

Perhaps it's better placed in the bundler, where we can specify exactly the properties we want it to apply to (note: all annotation properties, not just skos:prefLabel and skos:definition).

@rjyounes rjyounes changed the title Should gist add explicit English language tags to all it's existing skos:prefLabel and skos:definition values? Should gist add explicit English language tags to all its existing skos:prefLabel and skos:definition values? Jun 13, 2022
@rjyounes rjyounes changed the title Should gist add explicit English language tags to all its existing skos:prefLabel and skos:definition values? Should gist add explicit English language tags to all its existing SKOS annotation values? Jun 13, 2022
@Jamie-SA
Copy link
Contributor

Should it be a post processing step? Why not just change the source files and then only verify correctness in the bundling step? Similar to some of the queries Boris wrote for validating style.

Also think through how we'd want to handle having multiple languages in/for gist.

@rjyounes
Copy link
Collaborator

rjyounes commented Jun 14, 2022

That works too, but I do think we should use the bundler to verify it in case people forget, as you suggest.

@JonathonGist
Copy link

Excellent. I was hoping you all would add @en where appropriate.

@rjyounes
Copy link
Collaborator

rjyounes commented Sep 8, 2022

Considerations:

  • @en tags could be added by bundler.
  • This is a breaking change, not in terms of inferences, but it will break SPARQL queries.
  • The bundler could add non-tagged versions of the annotations to rdfs:Annotations, for users who don't want to update their queries.
  • Alternatively, Jamie suggests keeping the non-tagged files in gistCore.ttl, at least as a transitional device, so users don't have to add an additional import.

Needs further discussion, so moving to December release.

@Jamie-SA
Copy link
Contributor

Rebecca is correct, this likely will break someone's queries or code. But I think we should do it. I'll volunteer to do the PR.

I think we should schedule it for the next major release (probably the Units Of Measure changes release).

But there is another problem, if we only have one language, adding a second would also break queries/code. It seems we have 2 options, 1) keep translations in a separate file, or 2) add at least one more language at the same time so people would need to address the issue of multiple languages at the same time as the switch to language tags.

Dealing with multiple languages might be harder for most of us who haven't had to deal with it yet, than just the change of literal type. So part of my likes the idea of keeping other languages in separate files. But I also think it might be a good idea to push us to start preparing for it, and therefore should put them into the main file. Of course, we could release multiple versions and let people choose.

See also these discussions about dealing with multiple languages:

@rjyounes
Copy link
Collaborator

I don't consider myself anglocentric, but I don't think ontology terms need to have language versions until we have a specific use case for that. Instances are a different issue. In the Treasury of Lives project, for example, data properties like names and pref labels in taxonomies have different versions for different languages and transliterations. But ontology terms are not user-facing and I see no reason to provide alternate language versions for maintainers of the ontology - again, until we have a use case, which would have to be a gist developer who knows little or no English - arguably extremely unlikely.

@Jamie-SA
Copy link
Contributor

I disagree. I think you are being quite anglocentric.

This isn't just for maintainers of gist, it is also for users of gist. I think we do want to increase the adoption of gist, don't we? If you want non-native English speaking developers (quite a bit of the world) to adopt gist, then gist should provide, or at least allow for, annotations in multiple languages.

And, when you move towards model/metadata driven architectures, Ontology terms do surface to the UI.

@rjyounes
Copy link
Collaborator

rjyounes commented Jan 27, 2023

It's not anglo-centric to acknowledge that English is the de facto lingua franca in the worlds of business, finance, tech, industry, and academia, it's just a plain fact. I bet every professional ontologist has read the English texts on the subject.

Your final point is convincing, though, so I'll retract my disagreement.

@rjyounes
Copy link
Collaborator

rjyounes commented Jan 27, 2023

Do we want to add @en or @en-us, so that our colleagues in the UK can use "Organisation"@en-gb?

@JonathonGist
Copy link

I think that it would be good to use @en-us, so that @en-GB can be used where needed.

@uscholdm
Copy link
Contributor

uscholdm commented Jan 27, 2023 via email

@rjyounes rjyounes added the status: deferred to major release Involves a major change so deferred till next major release. An implementation may be specified. label Mar 23, 2023
@rjyounes
Copy link
Collaborator

rjyounes commented Mar 23, 2023

Decision:

  • Use two-letter language tags, rather than 4-letter language + country tags

This is a breaking change due to differences in query results. Deferred to release 13.0.0.

@rjyounes
Copy link
Collaborator

If someone wants to add a language, they can submit a PR and continue to maintain it.

Should we have a version of each annotation without a language tag?

Should we require a complete implementation of all annotations, or should we allow someone to implement piecemeal?
Steven: Then we would have to maintain it with every update.
Rebecca: We can suggest it.

Michael: Another option is to ask people to create files with their translations.
Jamie: Or put them in a separate file in our repo.

Suggestion: Canonical version doesn't have a tag. All tagged versions go in separate files.
We'll implement the English one as a starter.
Also a partial (or complete) French version.

Jamie: Automate English version and add to pre-commit hook (dev team)

@rjyounes
Copy link
Collaborator

TBD: File naming conventions, file and directory structure

  • One directory for each language or a shared directory for all languages?
  • One file per language, or mirroring the canonical files with 5 separate files? If the latter, there should be a directory per language to avoid clutter.
  • The release package generates multiple serializations of each file, so having one directory per sub-language seems cleaner and more manageable.
  • Adding status "under review" until these questions are answered.

Setting status to "under review" until these questions are answered.

@rjyounes
Copy link
Collaborator

rjyounes commented May 11, 2023

Added individual issues:

#840 - Automated generation of en version
#841 - Sample French version
#851 - Sample Spanish version
#842 - Documentation

This issue will be closed once the file and directory structure questions have been resolved.

@rjyounes rjyounes added impact: major Non-backward compatible (changes inferences; e.g., adding a restriction, domain, range) and removed status: deferred to major release Involves a major change so deferred till next major release. An implementation may be specified. impact: major Non-backward compatible (changes inferences; e.g., adding a restriction, domain, range) labels May 11, 2023
@pmcb55
Copy link
Author

pmcb55 commented May 17, 2023

Just a very quick comment - in all my US-based customer engagements, the first (and so far only!) additional language they've all requested has been Spanish. (I just mention this as I see you're leaning towards French, but I would have thought most current gist users are probably US-based, and therefore wouldn't Spanish perhaps be a more 'useful' 2nd language for that user-base...?)

@rjyounes
Copy link
Collaborator

@pmcb55 This is a good point. A partial French version was going to be provided as an example of how to do it, because we have two French speakers on board that could do it. If you are able to provide a Spanish version instead that would be fine. It doesn't have to be complete, just some examples.

@pmcb55
Copy link
Author

pmcb55 commented May 17, 2023

If you are able to provide a Spanish version instead...

@rjyounes yeah, you've got a deal! Since this is just some examples, why don't we just add both French and Spanish (we have native French and Spanish speakers here at Inrupt - but how about you provide the French examples, and I'll ask a couple of our Spanish speakers to contribute Spanish too).
If you just point me at the French examples once they're ready (at patm@inrupt.com), I'll take it from there with the Spanish version.

@pmcb55
Copy link
Author

pmcb55 commented May 17, 2023

On the question of 2- or 4-letter language tags, I'd suggest sticking with 2-letters in the main, but do bite the bullet now and apply 4-letter tags for the very few situations where they are applicable, e.g.:

gist:Organization a owl:Class;
    skos:definition "A generic organization that can be formal or informal, legal or non-legal. It can have members, or not."@en-US;
    skos:definition "A generic organisation that can be formal or informal, legal or non-legal. It can have members, or not."@en-GB;
    skos:prefLabel "Organization"@en-US;
    skos:prefLabel "Organisation"@en-GB;

I say 'bite the bullet now' above, because yeah, this indeed puts an added burden on clients/users of gist - in that they now not only need to be language tag aware (but that's a 'good thing', as even the USA is (I believe) rapidly becoming multilingual, especially in the Southern USA), but they also now need to be aware of 4-letter tags (but, I'd argue, this is also a 'good thing').
I think it's important to appreciate that this is not just an AmericanEnglish/EnglishEnglish thing, but also a CanadianFrench/FrenchFrench thing, and also a SouthAmericanSpanish/SpanishSpanish thing, etc.
And given gist's Enterprise/Commercial focus, and it's laudable global ambitions, I think it would be a missed opportunity to only go in half-hearted on this question (i.e., 2-letter tags only), and instead preempt the frustrations that limiting to just 2-letter language tags will inevitably raise from the get-go.
(And just from a purely selfish perspective, Inrupt continues to be in active discussions with multiple governments and regions around the world, and again and again the 'language variation question' comes up - i.e., it can be a fiercely political issue in many parts of the world - so gist actively using (and therefore encouraging clients to actively support) 4-letter language tags would be a big step forward for everyone!)

@Jamie-SA
Copy link
Contributor

@pmcb55 thanks for the comments. I agree with adding 4 letter tags, it basically starts everyone on the path of being able to handle it from the beginning.

But I have a clarifying question: would you use 2 letter language tags when there is no difference? Or do you duplicate everything even when there is no difference between the 4 letter language tag versions?

@rjyounes
Copy link
Collaborator

@pmcb55 You've convinced me of the four-letter language tags. @Jamie-SA You would use the same tags throughout, even if the languages coincide. That would certainly decrease the burden on querying.

@rjyounes
Copy link
Collaborator

I would like to get @mkumba's input on the 4-letter tags before moving forward. He expressed a preference for 2-letter tags.

@mkumba
Copy link
Contributor

mkumba commented May 17, 2023 via email

@uscholdm
Copy link
Contributor

The real debate is whether or when to allow un languished strings

Yes that is the question - perhaps gist strings have languished in their un-languaged form for long enough! :-)

@rjyounes
Copy link
Collaborator

I thought we had decided to have the non-tagged strings in gistCore, with the tagged versions in different files? Or is there a pun I'm not getting?

One thing that occurred to me is that it would be optimal to clean up our gist annotations before we implement the samples in other languages. We don't want all the problems we know exist with our annotations to be simply carried over into other languages. We can proceed with generating the English versions automatically, but IMO we should do the full clean-up of our existing annotations before doing any work on other languages. This will cause a delay, of course, but we can make a concerted effort on the current annotations and make them a priority for the next release. I doubt if there'll be anyone banging down the door in the next 6 months wanting to implement a non-English version.

@johnwcowan
Copy link

It's important to be flexible about the length of a language tag.

For Japanese, the appropriate tag is simply "@ja", because there is no country other than Japan where Japanese is an important language. We do not need to tag anything "@ja-JP", for example, because there is no contrast with "@ja-US" or "@ja-EU" or what have you.

On the other hand, not all languages have 2-letter tags. One of the official languages of the Cook Islands, for example, is Cook Islands Maori (not the same as New Zealand Maori), whose tag is "@RAR". Across the entire range of languages (about 7000), most have 3-letter tags, though most of the commercially important ones have 2-letter tags.

On the gripping hand, the important distinction for written Chinese is traditional vs. simplified characters, which are tagged "@zh-Hant" and "@zh-Hans" respectively, and by the same token for Azerbaijani what matters is the script: Latin (@az-Latn), Cyrillic (@az-Cyrl), or Arabic (@az-Arab), as most literate Azeris only know one script.'

Links:

Language Subtag Registry

BCP 47. See RFC 5646 (first part of BCP 47), section 4.1 for guidelines for constructing language tags from subtags.

@rjyounes
Copy link
Collaborator

rjyounes commented May 20, 2023

@johnwcowan Thanks, this is good information. W3C has a helpful discussion of tags and subtags which confirms what you are saying. We've been referring over-simplistically to "two" and "four" letter tags: the correct distinctions are between language tag, region sub-tag, and script tags.

I think you're right that we need not prescribe a certain level of tagging. If someone wants to provide a specifically Canadian French version as opposed to a generic French version, they can do so. On the other hand, we may choose not to accept a PR that uses "ja-JP" tags and request that they be re-written as simply "ja." The primary open implementation question for us is whether our auto-generated English tags should be "en" or "en-US."

@johnwcowan
Copy link

Minor correction: Japanese is 'ja', whereas Japan is 'JP'.

I recommend the use of 'en-US', since the existing text is written in U.S. English.

In general, anything meant to be linguistic data (descriptions, etc.) as opposed to codes, should be language-tagged. So unless it's a requirement for backward compatibility, I recommend tagging all the linguistic data rather than providing an untagged form. Untagged text should be confined to codes of various sorts.

@rjyounes
Copy link
Collaborator

rjyounes commented May 21, 2023

If we make this a minor change, we need the untagged versions for backward compatibility. If it's a major change, we don't. This is subject to discussion.

@rjyounes
Copy link
Collaborator

rjyounes commented May 25, 2023

Closing: all to-do items moved to other issues.

#840 - Automated generation of en version
#841 - Sample French version
#842 - Documentation

@rjyounes
Copy link
Collaborator

rjyounes commented May 27, 2023

Team discussion on 2023-05-25 led to the following conclusions:

  * Related issues: 
	  * [Automated generation of en version](https://github.com/semanticarts/gist/issues/840)
	  * [Sample French version](https://github.com/semanticarts/gist/issues/841)
	  * [Sample Spanish version](https://github.com/semanticarts/gist/issues/851) (Pat McBennett to submit PR)
	  * [Documentation](https://github.com/semanticarts/gist/issues/842)
* Open discussion points
    * Language or language+subtag? Subtag could be region or script. E.g., in Chinese the script is of primary significance in written text, not the region. From John Cowan.
      * Maybe language only if there are no significant regional distinctions, otherwise language+region? (Note, for accuracy the subtags are not always 2 letters). 
      * Don't need to prescribe, just let people submit what they want to? We could reject a region tag where there are no regional distinctions.
      * What do we want to do for our versions?
	      * English - use language only, may change in future
	      * French - use language only, may change in future
     * How do we review PRs in language we have no expertise in? 
         * Trust factor
         * In-house expertise
	     * Test in Google translate (to English), if they look OK accept them. If not, reject the whole batch.
	     * Add disclaimer for those we haven't been able to review, and also that they may not be up-to-date as gist evolves.
	     * SA in-house expertise:
		     * French: Doug, Jess
		     * German: Rebecca
		     * Arabic: Dalia
		     * Simplified and Traditional Chinese: Katie (some)
		     * Japanese: Katie (some)
		     * Russian: Irina
		     * Italian: Irina
    * No default untagged version - our default should be `en-US`. From John Cowan.
      * Then this is a major change.
      * Decision: keep untagged versions as defaults in ontology files. Can change later in a major release. 
    *  File names and directory structure. 
	    * Only release in Turtle. 
	    * Most likely only gistCore, not supplementary ontologies, will be translated.
	    * There might be a translation of the readme if someone wants to do it.
	    * So, we'll only have one ontology file per language. Naming conventions:
		    * `gistCore.en.ttl`
		    * `README.en.md`
	    * So use one directory. [We didn't decide on a directory name, I propose `language_tagged_annotations` or similar.]

@rjyounes
Copy link
Collaborator

See additional discussion and final conclusion on #840.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants