-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should gist
add explicit English language tags to all its existing SKOS annotation values?
#685
Comments
gist
add explicit English language tags to all it's existing skos:prefLabel
and skos:defintion
values?gist
add explicit English language tags to all it's existing skos:prefLabel
and skos:definition
values?
It's worth reading the discussion thread on #311 to get some insight into why we haven't done this in the past - essentially, the assumption is we have no non-English-speaking users and therefore no use case. But as use of gist broadens, we may well have international users. If we do want to do it, rdf-toolkit has this option:
We could add this argument in the pre-commit hook. However, this will over-generate in a couple of cases:
Perhaps it's better placed in the bundler, where we can specify exactly the properties we want it to apply to (note: all annotation properties, not just |
gist
add explicit English language tags to all it's existing skos:prefLabel
and skos:definition
values?gist
add explicit English language tags to all its existing skos:prefLabel
and skos:definition
values?
gist
add explicit English language tags to all its existing skos:prefLabel
and skos:definition
values?gist
add explicit English language tags to all its existing SKOS annotation values?
Should it be a post processing step? Why not just change the source files and then only verify correctness in the bundling step? Similar to some of the queries Boris wrote for validating style. Also think through how we'd want to handle having multiple languages in/for gist. |
That works too, but I do think we should use the bundler to verify it in case people forget, as you suggest. |
Excellent. I was hoping you all would add @en where appropriate. |
Considerations:
Needs further discussion, so moving to December release. |
Rebecca is correct, this likely will break someone's queries or code. But I think we should do it. I'll volunteer to do the PR. I think we should schedule it for the next major release (probably the Units Of Measure changes release). But there is another problem, if we only have one language, adding a second would also break queries/code. It seems we have 2 options, 1) keep translations in a separate file, or 2) add at least one more language at the same time so people would need to address the issue of multiple languages at the same time as the switch to language tags. Dealing with multiple languages might be harder for most of us who haven't had to deal with it yet, than just the change of literal type. So part of my likes the idea of keeping other languages in separate files. But I also think it might be a good idea to push us to start preparing for it, and therefore should put them into the main file. Of course, we could release multiple versions and let people choose. See also these discussions about dealing with multiple languages: |
I don't consider myself anglocentric, but I don't think ontology terms need to have language versions until we have a specific use case for that. Instances are a different issue. In the Treasury of Lives project, for example, data properties like names and pref labels in taxonomies have different versions for different languages and transliterations. But ontology terms are not user-facing and I see no reason to provide alternate language versions for maintainers of the ontology - again, until we have a use case, which would have to be a gist developer who knows little or no English - arguably extremely unlikely. |
I disagree. I think you are being quite anglocentric. This isn't just for maintainers of gist, it is also for users of gist. I think we do want to increase the adoption of gist, don't we? If you want non-native English speaking developers (quite a bit of the world) to adopt gist, then gist should provide, or at least allow for, annotations in multiple languages. And, when you move towards model/metadata driven architectures, Ontology terms do surface to the UI. |
It's not anglo-centric to acknowledge that English is the de facto lingua franca in the worlds of business, finance, tech, industry, and academia, it's just a plain fact. I bet every professional ontologist has read the English texts on the subject. Your final point is convincing, though, so I'll retract my disagreement. |
Do we want to add |
I wonder what FIBO does.
Sent from an input-challenged device.
On Jan 27, 2023, at 07:15, Rebecca Younes ***@***.***> wrote:
Do we want to add @en or @en-us, so that our colleagues in the UK can use ***@***.***<https://github.com/en-GB>"?
—
Reply to this email directly, view it on GitHub<#685 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ACSHHSIHKBCEFI4ZDBJNO6LWUPC6PANCNFSM5YQ6AF3A>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
Decision:
This is a breaking change due to differences in query results. Deferred to release 13.0.0. |
If someone wants to add a language, they can submit a PR and continue to maintain it. Should we have a version of each annotation without a language tag? Should we require a complete implementation of all annotations, or should we allow someone to implement piecemeal? Michael: Another option is to ask people to create files with their translations. Suggestion: Canonical version doesn't have a tag. All tagged versions go in separate files. Jamie: Automate English version and add to pre-commit hook (dev team) |
TBD: File naming conventions, file and directory structure
Setting status to "under review" until these questions are answered. |
Just a very quick comment - in all my US-based customer engagements, the first (and so far only!) additional language they've all requested has been Spanish. (I just mention this as I see you're leaning towards French, but I would have thought most current gist users are probably US-based, and therefore wouldn't Spanish perhaps be a more 'useful' 2nd language for that user-base...?) |
@pmcb55 This is a good point. A partial French version was going to be provided as an example of how to do it, because we have two French speakers on board that could do it. If you are able to provide a Spanish version instead that would be fine. It doesn't have to be complete, just some examples. |
@rjyounes yeah, you've got a deal! Since this is just some examples, why don't we just add both French and Spanish (we have native French and Spanish speakers here at Inrupt - but how about you provide the French examples, and I'll ask a couple of our Spanish speakers to contribute Spanish too). |
On the question of 2- or 4-letter language tags, I'd suggest sticking with 2-letters in the main, but do bite the bullet now and apply 4-letter tags for the very few situations where they are applicable, e.g.:
I say 'bite the bullet now' above, because yeah, this indeed puts an added burden on clients/users of gist - in that they now not only need to be language tag aware (but that's a 'good thing', as even the USA is (I believe) rapidly becoming multilingual, especially in the Southern USA), but they also now need to be aware of 4-letter tags (but, I'd argue, this is also a 'good thing'). |
@pmcb55 thanks for the comments. I agree with adding 4 letter tags, it basically starts everyone on the path of being able to handle it from the beginning. But I have a clarifying question: would you use 2 letter language tags when there is no difference? Or do you duplicate everything even when there is no difference between the 4 letter language tag versions? |
I would like to get @mkumba's input on the 4-letter tags before moving forward. He expressed a preference for 2-letter tags. |
Yeah the 4 letter tag comes into play about 1% of time, and even when it’s a fail is a soft fail. Most Americans know that colour is really color and most Brit’s know color is really colour so there isn’t much lost.
The real debate is whether or when to allow un languished strings
…Sent from my iPhone
On May 17, 2023, at 12:14, Rebecca Younes ***@***.***> wrote:
I would like to get @mkumba<https://github.com/mkumba>'s input on the 4-letter tags before moving forward. He expressed a preference for 2-letter tags.
—
Reply to this email directly, view it on GitHub<#685 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAAGJPSIIFHCIP3FUW2JZRDXGUIP7ANCNFSM5YQ6AF3A>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Yes that is the question - perhaps gist strings have languished in their un-languaged form for long enough! :-) |
I thought we had decided to have the non-tagged strings in gistCore, with the tagged versions in different files? Or is there a pun I'm not getting? One thing that occurred to me is that it would be optimal to clean up our gist annotations before we implement the samples in other languages. We don't want all the problems we know exist with our annotations to be simply carried over into other languages. We can proceed with generating the English versions automatically, but IMO we should do the full clean-up of our existing annotations before doing any work on other languages. This will cause a delay, of course, but we can make a concerted effort on the current annotations and make them a priority for the next release. I doubt if there'll be anyone banging down the door in the next 6 months wanting to implement a non-English version. |
It's important to be flexible about the length of a language tag. For Japanese, the appropriate tag is simply "@ja", because there is no country other than Japan where Japanese is an important language. We do not need to tag anything "@ja-JP", for example, because there is no contrast with "@ja-US" or "@ja-EU" or what have you. On the other hand, not all languages have 2-letter tags. One of the official languages of the Cook Islands, for example, is Cook Islands Maori (not the same as New Zealand Maori), whose tag is "@RAR". Across the entire range of languages (about 7000), most have 3-letter tags, though most of the commercially important ones have 2-letter tags. On the gripping hand, the important distinction for written Chinese is traditional vs. simplified characters, which are tagged "@zh-Hant" and "@zh-Hans" respectively, and by the same token for Azerbaijani what matters is the script: Latin (@az-Latn), Cyrillic (@az-Cyrl), or Arabic (@az-Arab), as most literate Azeris only know one script.' Links: BCP 47. See RFC 5646 (first part of BCP 47), section 4.1 for guidelines for constructing language tags from subtags. |
@johnwcowan Thanks, this is good information. W3C has a helpful discussion of tags and subtags which confirms what you are saying. We've been referring over-simplistically to "two" and "four" letter tags: the correct distinctions are between language tag, region sub-tag, and script tags. I think you're right that we need not prescribe a certain level of tagging. If someone wants to provide a specifically Canadian French version as opposed to a generic French version, they can do so. On the other hand, we may choose not to accept a PR that uses "ja-JP" tags and request that they be re-written as simply "ja." The primary open implementation question for us is whether our auto-generated English tags should be "en" or "en-US." |
Minor correction: Japanese is 'ja', whereas Japan is 'JP'. I recommend the use of 'en-US', since the existing text is written in U.S. English. In general, anything meant to be linguistic data (descriptions, etc.) as opposed to codes, should be language-tagged. So unless it's a requirement for backward compatibility, I recommend tagging all the linguistic data rather than providing an untagged form. Untagged text should be confined to codes of various sorts. |
If we make this a minor change, we need the untagged versions for backward compatibility. If it's a major change, we don't. This is subject to discussion. |
Team discussion on 2023-05-25 led to the following conclusions:
|
See additional discussion and final conclusion on #840. |
Currently the
skos:prefLabel
andskos:definition
(andskos:scopeNote
andskos:example
, etc.) values for all terms in gist are only provided in English, yet they are not explicitly language tagged with@en
.To help promote more international reuse of gist, and to more clearly and explicitly denote the intended use of these values (i.e., I assume for human users to read and interpret), I would strongly advocate always adding explicit language tags for such predicates (and to leave strings of type
xsd:string
for predicates whose literal values would never make sense to translate across human languages (e.g., values for social security numbers (that may contain hyphens), or alphanumeric passport numbers, etc.))The text was updated successfully, but these errors were encountered: