Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

base direction #9

Closed
gkellogg opened this issue Jan 30, 2023 · 54 comments · Fixed by #48
Closed

base direction #9

gkellogg opened this issue Jan 30, 2023 · 54 comments · Fixed by #48
Labels
i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. spec:substantive Change in the spec affecting its normative content (class 3) –see also spec:bug, spec:new-feature

Comments

@gkellogg
Copy link
Member

gkellogg commented Jan 30, 2023

A possible issue for RDF 1.2 is to standardize on a solution for the base direction of strings.

This would possibly include updating to the Abstract Syntax and associated changes to the various Concrete Syntax specifications.

See RDF Literals and Base Directions for possible options.

JSON-LD introduced features for specifying the text direction. These included experimental features compatible with RDF 1.1:
i18n namespace, and rdf:CompoundLiteral.

See the issue for the discussion of further options, and the Working Group page for further discussion.

@gkellogg gkellogg added the i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. label Jan 30, 2023
gkellogg added a commit that referenced this issue Jan 30, 2023
@afs
Copy link
Contributor

afs commented Jan 31, 2023

One for the full WG!

Datatypes for language tags were discussed quite a lot in RDF 1.1 while discussing rdf:langString.

One factor considered was that the additional features of @lang (e.g. case insensitive, regional and script comparison, and others) made it complicated.

The fact that sub/super datatypes (XSD derived types) don't work for these features (derived-datatype is not compatible with parent-datatype if it is written in a different script).

To quote RFC3536: "[UNICODE] has a long and incredibly detailed algorithm for displaying bidirectional text."
and it also notes that mixed direction (e.g. numbers and script) is common.

@gkellogg
Copy link
Member Author

There was quite a bit of discussion in the JSON-LD group about this, notably including @dlongley, @iherman,, and @r12a. There were actually two "experimental" options suggested, both with tradeoffs:

  1. The i18n namespace, and
  2. rdf:compoundLiteral

Generally, i18n namespace seemed to be more favored, but this was trying to fit RDF 1.1 constraints.

One factor considered was that the additional features of @lang (e.g. case insensitive, regional and script comparison, and others) made it complicated.

In the context of the JSON-LD algorithms, this was addressed by normalizing the language to lower-case, which is not a general solution, and has been somewhat confusing previous RDF specs and implementation, IMO.

We could conceivably allow the use of literals with both an explicit language and datatype, with some semantic restrictions on the datatype, a solution not available to JSON-LD 1.1. If we were to create sub-properties of rdf:langString such as rdf:langStringLTR and rdf:langStringRTL, and updated the grammars of Turtle, TriG, N-Triples, and N-Quads, we could allow forms such as the following:

@prefix ex: <http://example.org/> .
@prefix i18n: <https://www.w3.org/ns/i18n#> .

# Note that this version preserves the base direction using a non-standard datatype.
[
  ex:title "HTML و CSS: تصميم و إنشاء مواقع الويب"@ar-eg^^rdfLangStringRTL;
  ex:publisher "مكتبة"^@ar-eg^^rdfLangStringRTL
] .

This is already suggested under RDF 1.1 descriptions for literals, if the restriction of being strictly equal to rdf:langString were relaxed:

Please note that concrete syntaxes may support simple literals consisting of only a lexical form without any datatype IRI or language tag. Simple literals are syntactic sugar for abstract syntax literals with the datatype IRI http://www.w3.org/2001/XMLSchema#string. Similarly, most concrete syntaxes represent language-tagged strings without the datatype IRI because it always equals http://www.w3.org/1999/02/22-rdf-syntax-ns#langString.

The restriction would be that if a literal contains both a language tag and a datatype, the data type must be rdf:langString or a subProperty.

To quote RFC3536: "[UNICODE] has a long and incredibly detailed algorithm for displaying bidirectional text."
and it also notes that mixed direction (e.g. numbers and script) is common.

JSON-LD cites Strings on the Web: Language and Direction Metadata for the more extended discussion of the complexities and limitations of text direction in UNICODE in its informative section on Base Direction.

Note that such a change would be backwards compatible with the JSON-LD 1.1 text direction options, but would allow it to be replaced by a normative statement in the future, while noting the previous experimental usage.

gkellogg added a commit that referenced this issue Feb 1, 2023
@Tpt
Copy link
Contributor

Tpt commented Feb 2, 2023

+1 to @gkellogg's proposal. However, I have one concern:

The restriction would be that if a literal contains both a language tag and a datatype, the data type must be rdf:langString or a subProperty.

This introduces the concept of subDatatype (not subProperty, I guess) inside of the core RDF interpretation and not only in a datatype-aware interpretation. Implementing this constraint in practice seems hard to me (the parsers would need to get access to a datatype hierarchy...). Or we can see it as a "deductive" constraint i.e. if a datatype is used with a language tag then it is a sub datatype of rdf:langString in some interpretations. But this creates a distinction with respect to the literals without language tags where all datatypes are not required to be sub datatypes of xsd:string even if xsd:string is the default datatype.
What about just stating that rdf:langString is the default datatype when a language tag is present just like xsd:string is the default datatype when there is no language tag?

@afs
Copy link
Contributor

afs commented Feb 2, 2023

I'd like to see the range of possibilities enumerated.

For example - an extension to language tags (which reduces the impact of literals how having a lang tag and a datatype - something that can break toolkits (an issue considered at RDF 1.1)). Concretely - what about the script and variant subtags?

The JSON-LD solutions cover transmission of information about text which is the basic and important task.

One question to address is does this need to be defined as a conceptual change to RDF? The compound literal approach does not; a common vocabulary ("rdf:") would be useful.

A datatype defines a value space which in turn gives value-equality.
Is RTL("ABC") the same value as LTR("CBA")? As an unqualified string?
What about LANGMATCHES?

XSD datatypes have facets - is that the right approach for this issue? (related to compound literals)

@gkellogg
Copy link
Member Author

gkellogg commented Feb 3, 2023

@Tpt said:

+1 to @gkellogg's proposal. However, I have one concern:

The restriction would be that if a literal contains both a language tag and a datatype, the data type must be rdf:langString or a subProperty.

This introduces the concept of subDatatype (not subProperty, I guess) inside of the core RDF interpretation and not only in a datatype-aware interpretation. Implementing this constraint in practice seems hard to me (the parsers would need to get access to a datatype hierarchy...). Or we can see it as a "deductive" constraint i.e. if a datatype is used with a language tag then it is a sub datatype of rdf:langString in some interpretations. But this creates a distinction with respect to the literals without language tags where all datatypes are not required to be sub datatypes of xsd:string even if xsd:string is the default datatype. What about just stating that rdf:langString is the default datatype when a language tag is present just like xsd:string is the default datatype when there is no language tag?

It seems that rdf:langString is a subclass of rdfs:Literal, as are all the other vocabulary terms suitable for use as a datatype. Not sure what the appropriate relationship between rdf:langString and the potential sub-types I discussed would be. Perhaps rdfs:subClassOf?

@afs said:

I liked to see the range of possibilities enumerated.

For example - an extension to language tags (which reduces the impact of literals how having a lang tag and a datatype - something that can break toolkits (an issue considered at RDF 1.1)). Concretely - what about the scrip and variant subtags?

The JSON-LD solutions cover transmission of information about text which is the basic and important task.

One question to address is does this need to be defined as a conceptual change to RDF? The compound literal approach does not; a common vocabulary ("rdf:") would be useful.

Yes, if we were to say anything it would be that literals with a language tag are more specifically enumerated:

A literal in an RDF graph consists of two or three elements:

  • ...
  • if and only if the datatype IRI is http://www.w3.org/1999/02/22-rdf-syntax-ns#langString or a related datatype defined in the rdf vocabulary (or words to that effect), a non-empty language tag as defined by [BCP47]. The language tag must be well-formed according to section 2.2.9 of [BCP47].

A datatype defines a value space which in turn gives value-equality. Is RTL("ABC") the same value as LTR("CBA")? As an unqualified string? What about LANGMATCHES?

Yes, these are good questions, as text direction has not been considered in RDF before. Does "foo"@en^^rdfLangStringLTR entail "foo"@en^^rdf:langString? What about "foo"@en^^rdfLangStringRTL? I'm pretty sure that LTR("foo") is not the same as RTL("foo"), it's only for presentation purposes

XSD datatypes have facets - is that the right approach for this issue? (related to compound literals)

I don't think so, or they have exactly the same facets (ordered, bound, cardinality, numeric), or anyway, XSD doesn't describe such a facet. It's really about signaling the direction to be used by viewers. It would have been great if Unicode could have included this, but it is only considered in a limited way.

Rather, text direction is an additional property of literals having this datatype, used to signal how viewers should display the result, so in that sense, it is an additional facet. HTML also allows lang to be set to auto, so rdf:langString would be treated as having a text direction facet of auto, while the others would fix it to ltr or rtl. Note, however, that there's no way to associate this facet with another datatype, such as xsd:normalizedString, or any other restriction on xsd:string.

The inference considerations are definitely where this gets to be tricky. I don't think we'll be able to solve this until there's a more concerted discussion involving the I18N group. But, in my interpretation, it's a requirement of specs to consider this now.

@afs
Copy link
Contributor

afs commented Feb 5, 2023

Strings on the Web: Language and Direction Metadata recommends the 4.2 Metadata approach.

This is compatible with RDF 1.1 and RDF 1.0.

:s :title [ rdf:value "مرحبا بالعالم!"@ar ; rdf:direction rdf:rtl ] .

c.f. adding units to literal values.

The section 4.7 Create a new bidi datatype uses only a datatype, not a language tag. (Example 7 shows this. The text isn't completely clear - and it is only considering JSON.) This is compatible with RDF 1.1 and RDF 1.0.

:s :title "مرحبا بالعالم!"^^i18n:ar_rtl .

There is a new possibility in the metadata style (addressing the verbosity concern) using RDF-star annotation syntax.

:s :p "مرحبا بالعالم!"@ar {| rdf:objectTextDirection rdf:ltr |} .

c.f. adding units to literal values.

This is not compatible with RDF 1.1 or RDF 1.0.

(Beware that the "Strings on the Web" document is not completely consistent in the use of terms like "plain literal", which is from RDF 1.0, and mixes it up with rdf:PlainLiteral which should not be found in any RDF 1.1 syntax ("typed literals with rdf:PlainLiteral as the datatype are considered by this specification to be not valid in syntaxes for RDF graphs or SPARQL.")

See also https://w3c.github.io/i18n-discuss/notes/i18n-action-612.html

@gkellogg
Copy link
Member Author

gkellogg commented Feb 7, 2023

The i18n namespace seemed to have the most favor in the JSON-LD WG, not withstanding the issues of different representation of the language tag, itself. JSON-LD normalizes this when converting to RDF Triples, but that might not be a generally acceptable solution coming from other RDF representations.

The rdf:CompoundLiteral solution had the by-product of introducing a blank node, which many in the JSON-LD WG found problematic. There was some thinking about separating the language tag from the literal value using rdf:language, which is what is specified for the compound literal option, rather than using a language-tagged string that @iherman may recall.

Using an annotation to define this is interesting, and doesn't really create issues with JSON-LD 1.1, as the two different mechanisms explored were experimental/non-normative. However, it can over-conflate the use of annotations, where you might want to both say something about the statement, and the particular object representation.

The proposed solutions from https://w3c.github.io/i18n-discuss/notes/i18n-action-612.html need some work, but are generally consistent with the mechanism I outlined above.

@iherman
Copy link
Member

iherman commented Feb 7, 2023

My apologies to be a bit on the sideline in the discussions right now; this is due to some personal circumstances. Two comments, though

  1. There was a public discussion on this issue, with the thoughts of creating a separate WG at the time, which also led to a document: https://w3c.github.io/rdf-dir-literal/. The work was never completed, but I think that write-up is still relevant. That document was the starting point, as far as I remember, for the JSON-LD discussion as well.
  2. As the aforementioned document, as well as this thread, shows, there are many possible approaches and none of them without drawbacks. My personal feeling is that the WG should choose one asap, put it into the new RDF draft, and publish a new TR with that approach. The RDF community may (or may not...) react and see if it carries. Otherwise, we are in the danger of dragging this discussion on for a long time...

Personally, the approach proposed in this thread (which is also documented in that draft) seems to be clean and it works. +1 on my side.

@afs
Copy link
Contributor

afs commented Feb 7, 2023

RDF Literals and Base Directions is still relevant and it describes the value space so some questions can be answered.

I would add that any solution that encodes information in the lexical form (different from the lexical space) has a backwards compatibility problem. What is the length of the string? How much code out there uses the programming language string and length function, SPARQL example being STRLEN.

If a system has to microparse the lexical form, even if it is not a technical change in RDF syntax, then pre- and post- system will come up with different answers on the same data.

We do have the option of extending language tag syntax (-d-rtl). The Turtle, N-triples and SPARQL grammars already have a wider syntax rule than BCP47. It is an RDF spec description change but not a parser or internal data model change - old system/ new data are little impacted.

Given the spread of RDF, old systems/new data as well as new systems/old data need to be considered.

@afs
Copy link
Contributor

afs commented Feb 7, 2023

My apologies to be a bit on the sideline

Not late - it isn't going to be decided on this issue - it needs to go to the wider WG and public WG mailing list at least, the sooner the better.

@domel
Copy link
Contributor

domel commented Feb 7, 2023

As an aside, it is worth mentioning that in N3, we prefer the i18n namespace solution There is also rdf:CompoundLiteral option shown in the Design Patterns section. Note that N3 spec is still in development.

@pchampin
Copy link
Contributor

Piggybacking direction metadata the language tag is attractive for backward compatibility reasons.
BCP47 makes it possible (either by registering an extension or by using the "private use" tags), but remember that it was frowned upon by i18n experts because they consider that this is out of the scope of BCP47.

Now, @afs points out:

We do have the option of extending language tag syntax (-d-rtl). The Turtle, N-triples and SPARQL grammars already have a wider syntax rule than BCP47.

So we might change RDF concepts, by saying "a language tag is a string for the form X[-Y] where X is a BCP47 tag, and where Y is ...", where Y would

  • satisfy the regular expression used by RDF concrete syntaxes (backward compatible),
  • and ideally could not be conflated with a part of the BCP47 tag.

The second point is challenging. If Y was defined to be one of ltr or rtl, they could be confused with an extended language subtag which may one day be registered.

Looking at the BCP47 grammar, there are several options to fall our of BCP47 while remaining in Turtle's regexp :

  • use more than 8 characters (en-ltrxxxxxx)
  • use digits in a tag with less than 5 characters (en-ltr0)

none of this is super user-friendly...

Or we might just bite the bullet, and use ltr and rtl hoping that nobody will ever register them as extended language tags.

@afs
Copy link
Contributor

afs commented Feb 11, 2023

Info: Checking RFC 5646, it seems that the script subtag is discouraged where it is unnecessary (section 4.1).

@afs
Copy link
Contributor

afs commented Feb 11, 2023

To add to the choices:

A language is 2*3ALPHA (RFC 5646). A starting element "d-rtl-" can be used to add the information.
Or "r-" / "l-".

none of this is super user-friendly

agreed - that is a cost with all backwards syntax-compatible solutions.

@gkellogg
Copy link
Member Author

It doesn't look to me like ^ is allowed in a language tag, so updating the terminal rules for LANGTAG to be something like LANGTAG ::= "@" [a-zA-Z]+ ( "-" [a-zA-Z0-9]+ )* ( "^" ('rtl' | 'ltr'))? would not confuse with any other language tag, and it shouldn't get confused with a datatype IRI. The helps us avoid RFC 5646 altogether.

Trying to maintain compatibility with 1.1 LANGTAG doesn't really help, other than to keep older parsers from having to change, except it now looks like an odd language with text direction encoded, rather than being separated as a different facet. Do we really expect older systems to do the right thing with text direction? Changing the terminal to properly separate them also helps be sure that older systems will not incorrectly parse data that they can't handle properly, which is part of creating an extension point for such features. JSON-LD 1.1 had a similar problem when introducing new features, as we hadn't considered a versioning system in 1.0.

Are we trying to maintain syntactic compatibility with 1.1 languages to shoe-horn in text direction, or taking advantage of the need to revisit the grammars by properly separating the concepts? Considering alternatives:

Original i18n datatype:

  ex:publisher "مكتبة"^^i18n:ar-eg_rtl

Syntactically separate language tag from text direction:

  ex:publisher "مكتبة"^@ar-eg^rtl

Combine in language tag:

  ex:publisher "مكتبة"^@d-rtl-ar-eg

@afs
Copy link
Contributor

afs commented Feb 12, 2023

@afs
Copy link
Contributor

afs commented Feb 12, 2023

If we aren't aiming for compatibility, an extended language tag related extension (new syntax) works better.

Use case: skos:prefLabel/skos:altLabel and looking for a unique language for display to the user.

@afs
Copy link
Contributor

afs commented Feb 12, 2023

Do we really expect older systems to do the right thing with text direction?

It's something to consider. The "right thing" may be passing information along which is doable, at some cost, with all syntax-backwards-compatible solutions.

e.g. Data published end-to-end (the ends being text-direction capable), through systems such as an RDF 1.1 triplestore or validated by RDF 1.1 tools (SHACL). Consider SHACL sh:languageIn and sh:uniqueLang, and SPARQL LANG, LANGMATCHES.

Depending on the timescale you expect the transition to new syntax to happen, there may be a significant length of time when clients have evolved, but the data path has not depending on industry domains. It may even accelerate the uptake of direction aware client software; evolving client and server at the same time is hard.

The wiki says:

Requires changes to every RDF serialization format (which we're doing anyway).

For quoted triples.

What about systems uninterested in quoted triples but interested in text direction? (A variant of "weak compliance".)

@gkellogg
Copy link
Member Author

I think a note that the i18n namespace can be used in systems not fully supporting native text direction could be useful, and there could be some entailment described for adding triples from one to another.

However, syntax aside, the abstract syntax description of language-tagged literals would need to be updated to accommodate datatypes other than rdf:langString, which may have its own implications elsewhere. And, if the abstract syntax supports language-tagged literals with varying datatype IRIs, the syntaxes probably need to support using both @ and ^^ to support this (the @ar^rtl micro syntax probably only applies to Turtle-like syntaxes).

Alternatively, if we deem that the cost of updating the abstract syntax and related concrete syntaxes to support text direction in the data model, that leaves either the @d-rtl-ar or ^^i18n:ar_rtl methods, the later of which seems to have some uptake. This leaves a gap in folding this into the abstract syntax, making it available to query, and having some equivalence for @d-ltr-en and @en.

I favor taking the plunge to update the abstract and concrete syntaxes to fold this into the data model. I'm wary if an option-soup to signal compliance, but the fact that non-compliant systems would fail to parse documents, in whatever form, that specify the text direction as part of a literal is a form of signal. Adding some facet in the test suite would help vendors not fully supporting these features to filter through them. In any-case, informative notes for alternative ways of encoding text direction, similar to what's in JSON-LD, would remain useful during a transition period.

@domel
Copy link
Contributor

domel commented Feb 13, 2023

It seems to me that the option "مكتبة"@ar-eg^rtl is the least painful. Yes, it requires grammar changes (and parsers in various tools). But it has a small overhead, unlike rdf:compoundLiteral. The i18n namespace, on the other hand, creates inconsistencies in Turtle-like serializations (why do we tag some languages @ and some languages ^^?). Even if we reduced (e.g. using the canonical version) everything to the form ^^i18n:XX-YY_ZZZ, it would involve major changes.

@afs
Copy link
Contributor

afs commented Feb 13, 2023

I favor taking the plunge to update the abstract and concrete syntaxes to fold this into the data model.

What is "this" as far as concrete syntaxes are concerned?

By data model, do you mean the new datatype with value space of pairs or becoming part of section 3.3?

I prefer the "lang + direction" approach as separate aspects of a literal as part of 3.3.
Then consider RDF 1.1 a compatibility feature (a separate Note perhaps).

If @lang^direction goes in 3.3, apps can look for rdf:langString, not a multiplicity of datatypes.

For FPWD, I think we should say "we're working on it - see issue 9".

@afs
Copy link
Contributor

afs commented Feb 13, 2023

Generally, i18n namespace seemed to be more favored, but this was trying to fit RDF 1.1 constraints.

The "but ... fit RDF 1.1 constraints." didn't make it to the wiki/doc.

@gkellogg
Copy link
Member Author

"but ... fit RDF 1.1 constraints." didn't make it to the wiki/doc.

We seem to be having the active discussion here, and I think that once we've reached some consensus we can synchronize that page.

I favor taking the plunge to update the abstract and concrete syntaxes to fold this into the data model.
What is "this" as far as concrete syntaxes are concerned?

By data model, do you mean the new datatype with value space of pairs or becoming part of section 3.3?

I was thinking of the additional datatypes that would describe language-tagged strings with text direction as having the least impact on other implementations.

I prefer the "lang + direction" approach as separate aspects of a literal as part of 3.3.
Then consider RDF 1.1 a compatibility feature (a separate Note perhaps).

By this I presume you mean that a language-tagged string might have a fourth element, in addition to the language tag and the language might be something like the following:

If the literal is a language-tagged string, then the literal value is a tuple consisting of its lexical form, its language tag, and optionally its text direction, in that order.

If @lang^direction goes in 3.3, apps can look for rdf:langString, not a multiplicity of datatypes.

This can indeed work, but may have more impact on triple stores than extending the datatype. In the end, I'm agnostic as to which approach is better.

For FPWD, I think we should say "we're working on it - see issue 9".

We already do, see https://w3c.github.io/rdf-concepts/spec/#issue-container-number-9.

@afs
Copy link
Contributor

afs commented Feb 13, 2023

We already do, see https://w3c.github.io/rdf-concepts/spec/#issue-container-number-9.

It is a list of two options. The text above is i18n only.
It was in a PR "Errata and boilerplate with reverted submodule".

On this issue, i18n is said to be "experimental" and "trying to fit RDF 1.1 constraints".
In the spec text, the section "Internationalization Considerations" does not mention "experimental" or "RDF 1.1 constraints".

Up-issue, LANGTAG ::= "@" [a-zA-Z]+ ( "-" [a-zA-Z0-9]+ )* ( "^" ('rtl' | 'ltr'))?
Where's that gone?

For me, i18n datatypes are least attractive option for a permanent solution because it cuts off literals with text direction off from current language tags (noting that questions about the effect on LANG and LANGMATCHES have not been responded to). We then need to work on how to make uses like skos:prefLabel work which aren't having uses write code for both cases.

@gkellogg
Copy link
Member Author

I’ll update the content to just refer to the issue without describing the alternatives in the body.

@pfps
Copy link
Contributor

pfps commented Feb 16, 2023

Yes, displaying text that has bits with different directional characteristics can be difficult, but RDF language-tagged strings don't allow internal markers. The question is whether having an external direction marker changes the meaning of the string. If not, then direction markers are outside the scope of RDF language-tagged strings.

@r12a
Copy link

r12a commented Feb 16, 2023

Yes. When information is eventually displayed to a human. If not accompanied by direction metadata it can result at best in confusion and difficulty, and at worst in incorrect meanings. Mati (who is Israeli) gave one example. Here's another:

A MAC address that should be read:
mac-correct

will appear as:
mac-incorrect

in a RTL context without accompanying directional metadata. This is particulary problematic because there are no clues to indicate that it is actually incorrect.

@matial
Copy link

matial commented Feb 16, 2023 via email

@pfps
Copy link
Contributor

pfps commented Feb 16, 2023

I'm not saying that there are not lots of complexities in correctly displaying text. I'm just saying that a simple ltr or rtl around strings that are not parts of larger text are not helpful and likely to be misunderstood. After all, all that they can produce are the two outputs I provided - a left-to-right output and a right-to-left output.

As all that language-tagged string in RDF provides is strings in a singular language there doesn't seem to be any utility in providing an ltr or rtl flag. If one wants to display multi-language text, or even text in a single language that has opposite-direction parts, a more complex mechanism is needed.

As far as "logical" goes, I see 'logical order' in the Unicode documents but nowhere do I see '"logical" order'. Enclosing a word in double quotes without any indication of what the quoting means can have negative connotations, so much so that there is even a phrase for the practice - scare quotes. https://en.wikipedia.org/wiki/Scare_quotes

@TallTed
Copy link
Member

TallTed commented Feb 17, 2023

Largely because it covers many of the questions and arguments raised above, I give you Unicode, Inc.'s Writing Direction and Bidirectional Text FAQ which itself references relevant W3C tutorials and articles.

As with many topics that appear simple at first glance, this is actually a very complicated subject, with complicated answers. @pfps is correct in asserting that "a simple ltr or rtl around strings that are not parts of larger text are ... likely to be misunderstood" though I disagree with his unqualified assertion that such "simple ltr or rtl" are "not helpful", as for single words or simple phrases, they can be very helpful.

Some "strings" in RDF are such single words or simple phrases, and the LTR vs RTL question has been largely considered irrelevant, in so small part, in my opinion, because people who use RTL languages are already used to being treated as second-class (if that) in the mostly English and Western European LTR Internet "world" of the "World Wide Web".

However, many RDF "strings" are actually full or even multiple paragraphs (which is one of the reasons Turtle makes it easy to embed new lines, though there remains a prejudice against "open" spacing — that is, displaying a larger space between paragraphs than between lines — in Web-involved rendering, even though typesetting examples over hundreds of years show that either a first-line indent or such a larger vertical gap are more than common, and help greatly with readability).

It does appear that (as suggested by @iherman) rdf:html may be the best current solution for such human-text literals, which will require some additional work as it was non-normative at the time of RDF 1.1 (circa 2014-02-25) because W3C DOM4 was not yet final but WHATWG's DOM is now a normatively-citable "Living Standard — Last Updated 5 February 2023". (Note that there are 2 URIs in the preceding, one in w3.org and one in whatwg.org, that lead to the same URL in whatwg.org).

@afs
Copy link
Contributor

afs commented Feb 18, 2023

The issue marker in the spec just draws off this issue. So, changing the text of the initial comment will update what is shown in the spec.

I've updated the issue description. Hopefully, that will make it into the online draft.

@gkellogg gkellogg added the spec:substantive Change in the spec affecting its normative content (class 3) –see also spec:bug, spec:new-feature label Mar 16, 2023
@gkellogg gkellogg added the needs discussion Proposed for discussion in an upcoming meeting label Mar 31, 2023
@afs
Copy link
Contributor

afs commented Apr 27, 2023

Drawing on the discussion here, I've put a proposal on the WG wiki. Any work on this area will affect several documents. The idea is to get WG agreement for work on the area, which is easier with a concrete design to focus on, before investing time on text across several documents.

https://github.com/w3c/rdf-star-wg/wiki/Text-Direction-Proposal

I'll edit the wiki page to keep it current as discussion here happens.

@pfps
Copy link
Contributor

pfps commented Apr 28, 2023

I'm trying to figure out what the benefits adding a fourth element to language-tagged literals. I wasn't coming up with anything so I decided to write down some of my questions and what I came up with as answers. This didn't help me see benefits so I'm putting them into this discussion.

TL;DR: I don't see that adding a text direction to language-tagged literals achieves anything significant and there are better ways to represent text direction in RDF.

Q: What problem is adding a direction marker to RDF language-tagged strings trying to solve?

Information about bi-directional display of text within RDF.

Q: Who is requiring that this problem be solved?

Unknown.

Q: Should RDF solve this problem?

Unknown.

Q: Is this problem within the scope of the working group?

No. "Adding other improvements or extensions to RDF or SPARQL" is explicitly outside the scope of the working group.

Q: RDF is about meaning. Does a direction marker change the meaning of a string, divorced from any display considerations?

No.

Q: Does adding a direction marker to RDF language-tagged strings solve the problem?

No. Bi-directional display of strings requires changing direction within a string.

Q: Does RDF already have facilities for solving the problem?

Yes. RDF has rdf:HTML and HTML has the 'dir' attribute. The proposal already suggests using rdf:HTML.

Q: Will documents have to change?

Many working group documents will have to change: Concepts, syntaxes, Semantics, Query, and more.

Q: Will implementations have to change?

Yes. Implementations of almost every part of RDF and SPARQL will have to change, from syntax to semantics to storage to querying to update.

Q: Will applications have to change?

Yes. Text direction will affect the results of SPARQL queries.

Q: Are there better ways of solving the problem?

Yes. If rdf:HTML is deemed unsuitable it is possible to create a vocabulary for text direction or a dataype for text that includes direction markers.

@afs
Copy link
Contributor

afs commented Apr 30, 2023

@pfps: We are not operating in isolation. The context has changed from RDF 1.1:

  • JSON-LD 1.1 already has @direction. In the mapping to RDF, both compound literals (vocabulary) and datatype are non-normative features developed in the context of RDF 1.1.
  • The need for initial direction of text is described on the i18n pages.

If the approach is the option that extends rdf:langString, the changes to RDF Semantics should be localized to those places already covering rdf:langString. An argument in favor of that option.

The proposal already suggests using rdf:HTML.

for more detailed control of displayable content. Not all display is HTML; if the app wants detailed control, it would be output format specific.

Initial text direction is carrying information around separate from output format.

Q: Will applications have to change?
Yes. Text direction will affect the results of SPARQL queries.

Any data that includes text direction information affects application use of SPARQL; there is an impact from the non-syntax approaches as well.

With new dataypes based approaches, there would be two different ways to have language information. In some variations, the length of a string is changed.

With compound literals, finding blank nodes where a literal term is expected (c.f. SHACL) is an impact on applications, and in SPARQL, following the extra triples mean queries need to be modified.

So staying within RDF 1.1 can have more impact on applications.

@gkellogg
Copy link
Member Author

@pfps said:

Q: Who is requiring that this problem be solved?

Unknown.

Any person or group that regularly deals with presentation of text in different directions. Quite a bit of the world, actually.

Q: Should RDF solve this problem?

Unknown.

RDF is used as a data representation format that is often used for creating user interfaces; probably more with JSON-LD than other formats. Allowing the initial text direction to be added as a literal facet (of some kind) preserves this information that is important when presenting this data back to users.

Q: Is this problem within the scope of the working group?

No. "Adding other improvements or extensions to RDF or SPARQL" is explicitly outside the scope of the working group.

This is subject to interpretation, and arguably necessary to address reasonable (and long-standing) internationalization considerations that were not obvious or not given enough weight during previous design cycles.

Q: RDF is about meaning. Does a direction marker change the meaning of a string, divorced from any display considerations?

No.

Given that without the proper use of the initial text direction a presentation would be at least confusing, if not harmful, then I think it absolutely changes the meaning of the string.

Q: Does adding a direction marker to RDF language-tagged strings solve the problem?

No. Bi-directional display of strings requires changing direction within a string.

Unfortunately, BiDi didn't go far enough, and while you can signal text direction change within a string, it does not work properly at the beginning of the string. Things would be so much easier (in retrospect) had Unicode supported this.

Q: Does RDF already have facilities for solving the problem?

Yes. RDF has rdf:HTML and HTML has the 'dir' attribute. The proposal already suggests using rdf:HTML.

JSON-LD introduced the i18n namespace to address this issue, and it seems to have had some uptake in the community. The problem is that it does not work properly with the language facet of a literal as it encodes both language and text direction as a datatype.

Q: Will documents have to change?

Many working group documents will have to change: Concepts, syntaxes, Semantics, Query, and more.

Definitely a big consideration.

Q: Will implementations have to change?

Yes. Implementations of almost every part of RDF and SPARQL will have to change, from syntax to semantics to storage to querying to update.

That could depend on the nature of the change. But, if done as a separate facet, or as a sub-type of rdf:langString, this would have implications for every implementation that wants to fully conform.

Q: Will applications have to change?

Yes. Text direction will affect the results of SPARQL queries.

Applications that do not currently consider text direction may have no need to adapt if that is not something important to them.

Q: Are there better ways of solving the problem?

Yes. If rdf:HTML is deemed unsuitable it is possible to create a vocabulary for text direction or a dataype for text that includes direction markers.

Considered and rejected previously by the JSON-LD WG for a number of reasons. rdf:HTML is a structured format that can represent a string with other HTML attributes, but it is not suitable in all cases, and conflates a structured value with a faceted string. Not all systems that display text use HTML (some e-readers, I believe), so not even a complete solution for the simple presentation of text.

@pfps
Copy link
Contributor

pfps commented Apr 30, 2023

Q: Who is requiring that this problem be solved?
Unknown.

Any person or group that regularly deals with presentation of text in different directions. Quite a bit of the world, actually.

This can't be the case. If RDF isn't being used by then there is no problem to be solved.

I'm now unclear as to what the problem is that is supposed to be solved. Is it providing information about display of text in general, as it appears to be from the answer above? Or is the problem something different?

@TallTed
Copy link
Member

TallTed commented May 1, 2023

I am concerned that the conversation @pfps had with himself is naturally skewed to English users, who are generally over-represented in Internet and Web technologies.

If RDF isn't being used by then there is no problem to be solved.

Or perhaps the reason they are not using RDF is that there is a problem which is only surmountable by addressing this issue in a suitably generalized fashion.

I think we need more substantial input (optimally, full participation) from someone(s) who use rtl or otherwise non-ltr languages, whether those folks can be recruited from existing i18n groups or elsewhere.

@afs
Copy link
Contributor

afs commented May 11, 2023

At the RDF star telecon (2023-05-11) https://www.w3.org/2023/05/11-rdf-star-minutes.html#r02

RESOLUTION: Accept RDF-Concepts issue "text direction" w3c/rdf-concepts#9

@strogonoff
Copy link

I think we need more substantial input (optimally, full participation) from someone(s) who use rtl or otherwise non-ltr languages

I do, though so far occasionally as a learner.

I want to express support for https://w3c.github.io/rdf-dir-literal/#script-subtag.

I fail to see how writing direction can be divorced from writing system, which can already be specified as part of the language tag by way of script subtag.

Furthermore, I am not sure whether the inclusion of @direction in JSON-LD should be a factor, that is just a way of serializing RDF concepts and as such should follow not lead.

Or perhaps the reason they are not using RDF is that there is a problem which is only surmountable by addressing this issue in a suitably generalized fashion.

If I understood that sentence correctly… There are many reasons that make RDF daunting: a graph is not how most people intuitively tend to think about information and knowledge (legitimate barrier), the tooling is lacking (solvable with time and effort), but inability to specify text direction does not strike me personally as one.

I am somewhat enthusiastic about adopting RDF/JSON-LD in some tooling I work on. I believe RDF tooling could take into account the script when rendering data and facilitate script selection at the authoring stage, whereas allowing mixed signals (specifying a writing system through language tag and then giving a contradictory direction marker) seems liable to introduce more uncertainties as far as tooling implementation.

@gkellogg gkellogg changed the title text direction base direction Jun 27, 2023
@gkellogg gkellogg added the discuss-f2f Proposed for discussion during the next face-to-face meeting label Sep 5, 2023
@ktk ktk removed the discuss-f2f Proposed for discussion during the next face-to-face meeting label Oct 3, 2023
@gkellogg gkellogg removed the needs discussion Proposed for discussion in an upcoming meeting label Oct 9, 2023
gkellogg added a commit that referenced this issue Oct 13, 2023
* Add **base direction** as a forth element of literals. For #9.
* Add a note on UAX9 determining a default text direction.
* Define "directional language-tagged string".
* Indicate that a plain literal has no explicit base direction, in addition to having no datatype or language tag.
* Remove suggestion to format language tags based on BCP47 rules and for comparing language tags after normalizing to lower case.
* Apply suggestions from I18N review
* Unrelated change not on rdf:HTML and rdf:XMLLiteral datatypes being definitions.

---------

Co-authored-by: Andy Seaborne <andy@apache.org>
Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>
Co-authored-by: Pierre-Antoine Champin <github-100614@champin.net>
Co-authored-by: Addison Phillips <addisonI18N@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. spec:substantive Change in the spec affecting its normative content (class 3) –see also spec:bug, spec:new-feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.