Write canonical NTriples 1.1 by default #35

plasticfist · 2021-12-31T00:48:39Z

(Edited) The output does not appear to be UTF-8, is this is a bug? I thought UTF-8 would be the default given there is an option to "Write ASCII output if possible"

Example:

source triple from dbpedia/article-templates_lang=en_nested.ttl
<http://dbpedia.org/resource/André_Éric_Létourneau> <http://dbpedia.org/property/wikiPageUsesTemplate> <http://dbpedia.org/resource/Template:Birth_date_and_age> .

$ file article-templates_lang=en_nested.ttl
article-templates_lang=en_nested.ttl: UTF-8 Unicode text

serdi output:
<http://dbpedia.org/resource/Andr\u00E9_\u00C9ric_L\u00E9tourneau> <http://dbpedia.org/property/wikiPageUsesTemplate> <http://dbpedia.org/resource/Template:Birth_date_and_age> .

$ file article-templates_lang=en_nested-serdi.nt
article-templates_lang=en_nested-serdi.nt: ASCII text, with very long lines

apache jena riot output:
<http://dbpedia.org/resource/André_Éric_Létourneau> <http://dbpedia.org/property/wikiPageUsesTemplate> <http://dbpedia.org/resource/Template:Birth_date_and_age> .

$ file article-templates_lang=en_nested.ttl.bz2-riot.nt
article-templates_lang=en_nested.ttl.bz2-riot.nt: UTF-8 Unicode text

Spec Reference:
https://www.w3.org/TR/n-triples/#canonical-ntriples

Note: At first I thought maybe this was a BOM related rendering/display issue, but file would reveal if there is a BOM, and the same tools were used to find and display the examples above...

The text was updated successfully, but these errors were encountered:

drobilla · 2022-01-06T21:36:36Z

This is a holdover from back in the day when NTriples was ASCII. serd now supports RDF 1.1 NTriples, which is UTF-8, but the command-line tool behaviour is still the same. The upcoming major version is more precise about this and lets you mix and match all kinds of options to get what you want.

I'm not sure if the default could be changed without breaking things for people in the current version. Maybe? I agree that the option existing (it's meant for Turtle) makes this confusing, but I'm hesitant to change it and potentially break people's existing scripts/workflows/whatever...

drobilla · 2022-01-06T21:44:11Z

For reference, this is how the new command-line tool interfaces look: https://drobilla.net/files/serd_man_pages/ where serd-pipe is the closest thing to serdi. So the default will be UTF-8 everywhere, but you can -O ascii to ASCIIfy any syntax. This also lets you do nice things like write a "flat Turtle" file, like NTriples but with namespace prefixes, and so on.

joelduerksen · 2022-01-07T01:03:58Z

I understand and can empathize with backwards compatibility, but the (current) specs seemed to be clear on this question, or I thought so on first read.

Quote: "The content encoding of N-Triples is always UTF-8."
Reference: 6. Media Type and Content Encoding

That said, I have to say they seem to walk back on the clear directive in section 6.1 (if doc is plain/text it would be ASCII and escaped, etc..) I guess this gets into the nuances of "web document types" as opposed to files, so when working outside that frame work it is left up to individual interpretation. sigh.

drobilla · 2022-01-07T17:26:54Z

ASCII is a subset of UTF-8. In other words, the output of serdi is UTF-8, and valid N-Triples.

It's not canonical RDF 1.1 N-Triples though, because escaping like this is not allowed there (see link in OP).

plasticfist · 2022-01-07T18:58:03Z

Ok, I'll rephrase ticket request, would like command line tool that outputs canonical N-Triples. (no escaped characters)
Whether you make it the default or not is up to you, as long as it is possible. I wouldn't mind adding --canonical to the command line if required. No worries here.

drobilla · 2022-01-07T20:21:56Z

Sure, I was just responding to the above comment. If you want this right now, I suggest building the serd1 branch from git and using serd-pipe. My top priority is getting the new major version out, there will probably not be any more non-trivial releases of 0.x.x.

I'll make a note to double-check the other canonical rules and make sure that the default output adheres to them, but I think it does.

plasticfist changed the title ~~UTF-8 characters in input are converted to \u code in the output (ntriples)~~ Output is ASCII? desire UTF-8 output (edited for clarity) Jan 3, 2022

drobilla changed the title ~~Output is ASCII? desire UTF-8 output (edited for clarity)~~ Write canonical NTriples 1.1 by default Jan 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write canonical NTriples 1.1 by default #35

Write canonical NTriples 1.1 by default #35

plasticfist commented Dec 31, 2021 •

edited

Loading

drobilla commented Jan 6, 2022

drobilla commented Jan 6, 2022

joelduerksen commented Jan 7, 2022

drobilla commented Jan 7, 2022

plasticfist commented Jan 7, 2022

drobilla commented Jan 7, 2022

Write canonical NTriples 1.1 by default #35

Write canonical NTriples 1.1 by default #35

Comments

plasticfist commented Dec 31, 2021 • edited Loading

drobilla commented Jan 6, 2022

drobilla commented Jan 6, 2022

joelduerksen commented Jan 7, 2022

drobilla commented Jan 7, 2022

plasticfist commented Jan 7, 2022

drobilla commented Jan 7, 2022

plasticfist commented Dec 31, 2021 •

edited

Loading