-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add base direction as a fourth element of literals. #48
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Bike shedding!)
I wonder whether the term "Text Direction" is appropriate. The value of the fourth element does not fully define the full text direction, because that may also depend on the specific Unicode characters in the text and may even vary within the text in the bidirectional case.
We used, elsewhere, the term "based direction". The HTML specification uses the term "text directionality".
My personal choice is to align with the HTML terminology.
CC: @r12a
See https://github.com/w3c/rdf-star-wg/wiki/Text-Direction-Proposal It uses "initial text direction" - does that capture your point @iherman? |
If text direction is to be supported in RDF let's support it in a general way. Proposals for adding an initial text direction to not meet this requirement as correct rendering of bidirectional text needs internal direction markers. If there is going to be a partial solution provided in RDF, then I feel that it needs to be backward compatible with existing RDF systems. One way to do this is to use '-x-ltr' and '-x-rtl' at the end of the language tag. This produces valid language tags and is backwards compatible. |
Maybe, but it is not ideal. "Initial" suggests some sort of an ordering in time. Although I realize that I am on a slippery slope in terms of English terminology... I am a bit worried by getting into unnecessary bike-shedding; the reason I proposed to pick the term used by the HTML spec is to avoid that... Part of the community has already picked that term, I am not sure if it is worth picking our own for something that is essentially the same. |
spec/index.html
Outdated
the two <a>datatype IRIs</a>, | ||
the two <a>language tags</a> (if any), and | ||
the two <a>text directions</a> (if any) | ||
compare equal, character by character. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside: for language tags, it's case insensitive.
That's a decision from outside RDF - for text direction we can restrict to lower case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We previously say that the value space of language tag
is lower case, is it redundant to say to use a case insensitive comparison here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it redundant to say to use a case insensitive comparison here
I think it conforms to Postel's Law, will clearly reflect user intent, and will be better than case sensitivity-based errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was jumping to conclusions. Some investigation: it isn't as simple as "case insensitive".
The text above this paragraph says "MAY lower case" the concrete string that is the language tag.
The text here is about term-equality and does not say it is a value-space comparison. (FWIW "value space" for language tags is a bit meaningless - "value spaces" involve datatypes but we are where we are.)
The RDF 1.1 text:
the two language tags (if any) compare equal, character by character
does allow "abc"@en
and "abc"@EN
as different terms, whether that was intended or not.
The root problem is that RDF has not used the canonical form for language tags. Some users do care about this.
At users' request Jena has options to leave as-is, always lower-case and always canonicalization.
Maybe better to leave as "character by character" because otherwise it is a implementation change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, yes, I see what you mean. Accidental inconsistency is a rarely vanquished hobgoblin, especially across specs developed separately over years. We have our work cut out for us, in trying to bring consistency to all these docs that we're simultaneously trying to upgrade/update.
"Character by character" is at least clear, and we can include a note that advises deployers of the potential need to enact Jena's options — i.e., keep original langtag casing, make all langtags lower-case, or (whatever you meant by "always canonicalization").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need something to reconcile the notions of a lower-case value (space) for the language-tag and the fact that it's compared character by character (code point by code point?). Are there systems where "abc"@en
and "abc"@EN
are not considered the same term.
It might say something like "two language tags (if any) compare equal after normalizing to lower case".
The sentence "The value space of language tags is always in lower case" might be changed to "The value of language tags is always treated as being in lower case".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should avoid "value" because that is about datatype literals.
"compare equal as if normalized to lower case".
which does not imply they are converted (the earlier text is MAY).
Are there systems where "abc"@en and "abc"@en
Jena keeps them apart but they are the same term (yes - that's contradictory)
It's about meeting the user expectation of round-trip with no change.
Yes, it's tricky to get the right terminology. When we talked to i18n, "text direction" was pointed out as suggesting "all this text" whereas this is meant as in "default". In the "פעילות הבינאום, W3C" example there are multiple directions. HTML also has
Yes :-) |
@pfps In https://github.com/w3c/rdf-star-wg/wiki/Text-Direction-Proposal the proposed syntax is
RDF 1.2 gives us the opportunity for a syntax change while previous work has operated within the confines of RDF 1.1. Being separate: |
@afs said:
Sorry, there were a number of documents on (initial) text direction, and I based the PR off of https://github.com/w3c/rdf-star-wg/blob/main/docs/text-direction.md. We can reconcile the differences. I'm fine with "initial text direction", as that gets to the intent of the element. "Text Directionality" may have a subtly different meaning, as it describes the behavior of a display element, not a property of the text, but we can continue to discuss terminology either in this PR, or subsequently. @pfps said:
That's not my understanding of how bidirectional text works in Unicode. From Unicode Bidirectional Algorithm basics each character already has its own directionality encoded, it's for cases where character classes are mixed that there is no a-priori way of knowing how to begin rendering the text. After setting off the initial direction, the Unicode algorithms handle any subsequent change in direction. Within that document base direction is used, so perhaps that would be a better term than "initial text direction" or "text directionality".
RDF Literals and Base Directions did explore extending the language tag, but were ultimately rejected. See 2.1.1 Extend language tag for a discussion. |
https://github.com/w3c/rdf-star-wg/wiki/Text-Direction-Proposal is a write up based on issue #9. |
Co-authored-by: Andy Seaborne <andy@apache.org>
Applied some of @afs's suggestions, leaving the others for now pending further discussion. |
I am also fine with "base direction". Actually, when writing up my comment, my initial instinct was to propose that term but (to my surprise) that is not the term used by the HTML standard, and that is why I fell on the "directionality" side. (No idea why that term was chosen for HTML.) Either way is fine with me. |
@gkellogg In https://www.w3.org/International/articles/inline-bidi-markup/ there are examples of strings that need embedded markup for correct rendering. My takeaway is that a solution that only provides a language tag and a base direciton is insufficient. The worst situation, I think, is including identifiers using strong ltr characters in rtl text, as in "[ARABIC TEXT] A7, B8, X" where the order of the identifiers is reversed from its correct order if there is no embedded markup. Note that the language of this text is entirely Arabic - the identifiers are not English or any other language that uses ltr display. I include an example with rtl identifiers inside ltr script. Here are two identifiers using Hebrew script בבב, אא. The first is בבב the second is אא |
"base direction" works for me. |
I am also good with "base direction". I don't like |
How is -x- broken? |
Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>
Do we need separate |
I was going to include it in the first slot proposed for I18N, along with the Unicode cleanup (if necessary) and discussion of BCP47 case sensitivity when it comes to literal equality, and thus if triples differing in case are the same, or not. I'm separately working on slides for this section. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed in the I18N TPAC meeting as prep for tomorrow. Some comments included.
spec/index.html
Outdated
<p class="note">The absence of a <a>base direction</a> does not necessarily imply that | ||
the text has no initial text direction; | ||
as described in [[[?UAX9]]], | ||
strings may be embedded within structures which establish an <em>embedding direction</em>, | ||
which determines the default bidirectional orientation of text.</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is slightly misleading. The bidi algorithm determines the base direction in any case. And "embedding" is an overloaded term in the bidi algorithm (strings can be "embedded", but "embedding" in bidi refers to stacking bidirectional states...)
I'm not sure what the note is trying to convey. Are you trying to say "if the direction is not provided as metadata, the string can still be rendered"? Generally, what we say is either (a) when there is no base direction provided for a given string, the auto
(first-strong detection) direction should be used; or (b) when the base direction is not provided, the direction of the enclosing document (or content??) is used
e3dc44d
to
811812d
Compare
Rebased, after merging in #59. |
This was discussed during the TPAC 2023 meeting: |
7dab0b4
to
660dca1
Compare
* Apply suggestions from I18N review * Unrelated change not on rdf:HTML and rdf:XMLLiteral datatypes being definitions. Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com> Co-authored-by: Addison Phillips <addisonI18N@gmail.com>
3cdc417
to
f1b884f
Compare
Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>
This addresses the conceptual representation of text direction in RDF language-tagged literals, based on discussion in:
Fixes #9.
For discussion:
directional language-tagged string
withrdf:dirLangString
datatype, or extend the notion oflanguage-tagged string
and continue to userdf:langString
.text direction
– From previous proposalsbase direction
– From Unicode bidi basicsinitial text direction
– From Wikitext directionality
– From HTML(As the infrastructure seems to be having issues, you can also view the document via GitHack.
Preview | Diff