Unicode tests fail in python3.3 when using a non-UTF-8 locale #344

kepstin · 2013-11-18T21:03:33Z

For example, run

cd build/src; LC_ALL=C python3.3 ./run_tests.py

(Note: this is somewhat important for linux packagers, because most linux packaging runs in a "clean" environment with posix locale set, and we'd like to run tests where possible.)

You will get several errors like:

ERROR: test.test_dawg.test_dawg((rdflib.term.URIRef('http://www.w3.org/2009/sparql/docs/tests/data-sparql11/functions/manifest#strlang03'), 'STRLANG() TypeErrors', None, 'file:///home/cwalton/Development/rdflib-4.0.1/build/src/test/DAWG/data-sparql11/functions/data.ttl', [], 'file:///home/cwalton/Development/rdflib-4.0.1/build/src/test/DAWG/data-sparql11/functions/strlang03.rq', rdflib.term.URIRef('file:///home/cwalton/Development/rdflib-4.0.1/build/src/test/DAWG/data-sparql11/functions/strlang03.srx'), True),)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib64/python3.3/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/cwalton/Development/rdflib-4.0.1/build/src/test/test_dawg.py", line 412, in query_test
    res = Result.parse(open(resfile[7:]), format='xml')
  File "/home/cwalton/Development/rdflib-4.0.1/build/src/rdflib/query.py", line 197, in parse
    return parser.parse(source, **kwargs)
  File "/home/cwalton/Development/rdflib-4.0.1/build/src/rdflib/plugins/sparql/results/xmlresults.py", line 34, in parse
    return XMLResult(source)
  File "/home/cwalton/Development/rdflib-4.0.1/build/src/rdflib/plugins/sparql/results/xmlresults.py", line 40, in __init__
    xmlstring = source.read()
  File "/usr/lib/python3.3/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 1097: ordinal not in range(128)

The trivial fix of setting the character encoding on the open() statements in test_dawg.py allows the tests to work in python3.3; but if you set suitably, one test in python2.7 will fail with this output:

ERROR: test.test_dawg.test_dawg((rdflib.term.URIRef(u'http://raw.github.com/RDFLib/rdflib/master/test/DAWG/rdflib/manifest.ttl#unicode'), 'Unicode in SPARQL queries', None, 'file:///home/cwalton/Development/rdflib-4.0.1/test/DAWG/rdflib/unicode.ttl', [], 'file:///home/cwalton/Development/rdflib-4.0.1/test/DAWG/rdflib/unicode.rq', rdflib.term.URIRef(u'file:///home/cwalton/Development/rdflib-4.0.1/test/DAWG/rdflib/unicode.srx'), True),)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/cwalton/Development/rdflib-4.0.1/test/test_dawg.py", line 384, in query_test
    res2 = g.query(codecs.open(query[7:], encoding="utf-8").read(), base=urljoin(query, '.'))
  File "/home/cwalton/Development/rdflib-4.0.1/rdflib/graph.py", line 1045, in query
    query_object, initBindings, initNs, **kwargs))
  File "/home/cwalton/Development/rdflib-4.0.1/rdflib/plugins/sparql/processor.py", line 72, in query
    parsetree = parseQuery(strOrQuery)
  File "/home/cwalton/Development/rdflib-4.0.1/rdflib/plugins/sparql/parser.py", line 1034, in parseQuery
    return Query.parseString(q, parseAll=True)
  File "/usr/lib64/python2.7/site-packages/pyparsing.py", line 1031, in parseString
    loc, tokens = self._parse( instring, 0 )
<snip a bunch of pyparsing internals>
  File "/usr/lib64/python2.7/site-packages/pyparsing.py", line 695, in wrapper
    ret = func(*args[limit[0]:])
  File "/home/cwalton/Development/rdflib-4.0.1/rdflib/plugins/sparql/parser.py", line 300, in <lambda>
    lambda x: rdflib.Literal(decodeStringEscape(x[0][1:-1])))
  File "/home/cwalton/Development/rdflib-4.0.1/rdflib/py3compat.py", line 129, in decodeStringEscape
    return s.decode('string-escape')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)

The text was updated successfully, but these errors were encountered:

joernhees · 2013-11-19T09:02:01Z

we all love encodings ;)

i'm somewhat sure that setting the encoding in open isn't correct for all rdf serialization formats, some are plain ascii based, some utf-8...

also are you sure that the test is wrong?
to me it rather seems to be a real bug that should be fixed in the lib not in the test.

will have a closer look when i find some time

prologic · 2014-01-09T00:36:58Z

This changeset 98fc6b3 actually breaks several of our Plone-built web applications that use RDFLib for our tripple-store storage backend(s). See Plone traceback below:

2014-01-09 10:19:30 ERROR Zope.SiteErrorLog 1389226770.10.423529724865 http://localhost:8499/ccaih/repository/view
Traceback (innermost last):
  Module ZPublisher.Publish, line 138, in publish
  Module ZPublisher.mapply, line 77, in mapply
  Module ZPublisher.Publish, line 48, in call_object
  Module plone.autoform.view, line 49, in __call__
  Module plone.autoform.view, line 40, in render
  Module Products.Five.browser.pagetemplatefile, line 125, in __call__
  Module Products.Five.browser.pagetemplatefile, line 59, in __call__
  Module zope.pagetemplate.pagetemplate, line 132, in pt_render
   - Warning: Macro expansion failed
   - Warning: <type 'exceptions.KeyError'>: 'listing_macro'
  Module zope.pagetemplate.pagetemplate, line 240, in __call__
  Module zope.tal.talinterpreter, line 271, in __call__
  Module zope.tal.talinterpreter, line 343, in interpret
  Module zope.tal.talinterpreter, line 888, in do_useMacro
  Module zope.tal.talinterpreter, line 343, in interpret
  Module zope.tal.talinterpreter, line 533, in do_optTag_tal
  Module zope.tal.talinterpreter, line 518, in do_optTag
  Module zope.tal.talinterpreter, line 513, in no_tag
  Module zope.tal.talinterpreter, line 343, in interpret
  Module zope.tal.talinterpreter, line 954, in do_defineSlot
  Module zope.tal.talinterpreter, line 343, in interpret
  Module zope.tal.talinterpreter, line 533, in do_optTag_tal
  Module zope.tal.talinterpreter, line 518, in do_optTag
  Module zope.tal.talinterpreter, line 513, in no_tag
  Module zope.tal.talinterpreter, line 343, in interpret
  Module zope.tal.talinterpreter, line 858, in do_defineMacro
  Module zope.tal.talinterpreter, line 343, in interpret
  Module zope.tal.talinterpreter, line 954, in do_defineSlot
  Module zope.tal.talinterpreter, line 343, in interpret
  Module zope.tal.talinterpreter, line 533, in do_optTag_tal
  Module zope.tal.talinterpreter, line 518, in do_optTag
  Module zope.tal.talinterpreter, line 513, in no_tag
  Module zope.tal.talinterpreter, line 343, in interpret
  Module zope.tal.talinterpreter, line 954, in do_defineSlot
  Module zope.tal.talinterpreter, line 343, in interpret
  Module zope.tal.talinterpreter, line 533, in do_optTag_tal
  Module zope.tal.talinterpreter, line 518, in do_optTag
  Module zope.tal.talinterpreter, line 513, in no_tag
  Module zope.tal.talinterpreter, line 343, in interpret
  Module zope.tal.talinterpreter, line 852, in do_condition
  Module zope.tal.talinterpreter, line 343, in interpret
  Module zope.tal.talinterpreter, line 742, in do_insertStructure_tal
  Module Products.PageTemplates.Expressions, line 218, in evaluateStructure
  Module zope.tales.tales, line 696, in evaluate
   - URL: file:/Users/s2092651/work/ccaih/hub/buildout/eggs/plonetheme.sunburst-1.4.5-py2.7.egg/plonetheme/sunburst/skins/sunburst_templates/main_template.pt
   - Line 115, Column 29
   - Expression: <StringExpr u'plone.abovecontentbody'>
   - Names:
      {'args': (),
       'container': <RepositoryContainer at /ccaih/repository>,
       'context': <RepositoryContainer at /ccaih/repository>,
       'default': <object object at 0x10ab63bd0>,
       'here': <RepositoryContainer at /ccaih/repository>,
       'loop': {},
       'nothing': None,
       'options': {},
       'repeat': <Products.PageTemplates.Expressions.SafeMapping object at 0x10b18b6d8>,
       'request': <HTTPRequest, URL=http://localhost:8499/ccaih/repository/view>,
       'root': <Application at >,
       'template': <Products.Five.browser.pagetemplatefile.ViewPageTemplateFile object at 0x1100b0a90>,
       'traverse_subpath': [],
       'user': <SpecialUser 'Anonymous User'>,
       'view': <Products.Five.metaclass.SimpleViewClass from /Users/s2092651/work/ccaih/hub/buildout/eggs/plone.app.dexterity-2.0.9-py2.7.egg/plone/app/dexterity/browser/container.pt object at 0x110388bd0>,
       'views': <Products.Five.browser.pagetemplatefile.ViewMapper object at 0x1102e8650>}
  Module zope.contentprovider.tales, line 77, in __call__
  Module zope.viewlet.manager, line 112, in update
  Module zope.viewlet.manager, line 118, in _updateViewlets
  Module gu.plone.rdf.browser.viewlet, line 47, in update
  Module gu.plone.rdf.browser.viewlet, line 30, in update
  Module plone.z3cform.fieldsets.extensible, line 59, in update
  Module plone.z3cform.patch, line 30, in GroupForm_update
  Module z3c.form.group, line 137, in update
  Module z3c.form.group, line 49, in update
  Module z3c.form.group, line 45, in updateWidgets
  Module z3c.form.field, line 277, in update
  Module z3c.formwidget.query.widget, line 108, in update
  Module z3c.formwidget.query.widget, line 95, in bound_source
  Module z3c.formwidget.query.widget, line 227, in source
  Module zope.schema._field, line 471, in bind
  Module zope.schema._field, line 352, in bind
  Module Zope2.App.schema, line 33, in get
  Module gu.z3cform.rdf.vocabulary, line 130, in __call__
  Module ordf.handler, line 387, in query
  Module ordf.handler.httpfourstore, line 144, in query
  Module rdflib.query, line 197, in parse
  Module rdflib.plugins.sparql.results.tsvresults, line 61, in parse
  Module pyparsing, line 996, in parseString
  Module pyparsing, line 871, in _parseNoCache
  Module pyparsing, line 2342, in parseImpl
  Module pyparsing, line 871, in _parseNoCache
  Module pyparsing, line 2708, in parseImpl
  Module pyparsing, line 871, in _parseNoCache
  Module pyparsing, line 2342, in parseImpl
  Module pyparsing, line 871, in _parseNoCache
  Module pyparsing, line 2451, in parseImpl
  Module pyparsing, line 871, in _parseNoCache
  Module pyparsing, line 2596, in parseImpl
  Module pyparsing, line 871, in _parseNoCache
  Module pyparsing, line 2326, in parseImpl
  Module pyparsing, line 871, in _parseNoCache
  Module pyparsing, line 2596, in parseImpl
  Module pyparsing, line 871, in _parseNoCache
  Module pyparsing, line 2451, in parseImpl
  Module pyparsing, line 897, in _parseNoCache
  Module pyparsing, line 660, in wrapper
  Module rdflib.plugins.sparql.parser, line 315, in <lambda>
  Module rdflib.py3compat, line 170, in decodeUnicodeEscape
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 10: ordinal not in range(128)

I've bisected this down to the changeset mentioned above but haven't had the time yet to investigate what's going wrong.

gweis · 2014-01-09T01:30:14Z

I think this solves only part of the issue.
What about parsing form file-like objects like urllib2 responses?
There is no mode you can set on them, therefore the current fix assumes it is encoded in 'ascii'.

Other options like reading in the response and converting it manually, or re-wrapping it into something with a mode attribute properly set (or even setting mode on a file like object) are somewhat suboptimal.

IMO the default should assume 'utf-8' encoding if not already unicode or stated otherwise.

gromgull · 2014-01-09T08:06:45Z

Actually - after doing I read Armin Ronacher guide to unicode wrangling in py2/3 - and I think a better way is possible, rather than checking the encoding attribute - you can read a 0 length string from the stream and check if you get a str/bytes or unicode back and then encode as needed.

gromgull · 2014-01-09T08:06:59Z

http://lucumr.pocoo.org/2014/1/5/unicode-in-2-and-3/

gweis · 2014-01-09T13:16:22Z

Interesting article. Thanks for sharing.

After reading it, I think the py3 model only works if you really separate strings and bytes. Strings are something decoded and they just work, and bytes are just bytes but can be decoded to strings if required. I don't have enough experience with py3 but so far I believe this model can work nicely, and developers have to be explicit when to en/decode strings and bytes, which should help a lot to avoid UnicodeDecodeErrors. (yes it is annoying to do it manually, but I don't see another way around, and it kinda follows the principle: rather explicit than implicit).

Maybe it would be best to require that the input data is unicode or the stream has to produce unicode. If it's not unicode, then the library uses a default decoder, which could be either system default, utf-8 or even mimetype dependent if that information is available. The user of the library can easily wrap a decoder around the input data (shouldn't be more than one line I guess). This would make the parser maybe a bit simpler, and follows the py3 idea of being explicit with bytes and strings. (May well be that I have missed some use-cases here).

@PuckCh

2013/12/31 RELEASE 4.1 ====================== This is a new minor version RDFLib, which includes a handful of new features: * A TriG parser was added (we already had a serializer) - it is up-to-date wrt. to the newest spec from: http://www.w3.org/TR/trig/ * The Turtle parser was made up to date wrt. to the latest Turtle spec. * Many more tests have been added - RDFLib now has over 2000 (passing!) tests. This is mainly thanks to the NT, Turtle, TriG, NQuads and SPARQL test-suites from W3C. This also included many fixes to the nt and nquad parsers. * ```ConjunctiveGraph``` and ```Dataset``` now support directly adding/removing quads with ```add/addN/remove``` methods. * ```rdfpipe``` command now supports datasets, and reading/writing context sensitive formats. * Optional graph-tracking was added to the Store interface, allowing empty graphs to be tracked for Datasets. The DataSet class also saw a general clean-up, see: RDFLib/rdflib#309 * After long deprecation, ```BackwardCompatibleGraph``` was removed. Minor enhancements/bugs fixed: ------------------------------ * Many code samples in the documentation were fixed thanks to @PuckCh * The new ```IOMemory``` store was optimised a bit * ```SPARQL(Update)Store``` has been made more generic. * MD5 sums were never reinitialized in ```rdflib.compare``` * Correct default value for empty prefix in N3 [#312]RDFLib/rdflib#312 * Fixed tests when running in a non UTF-8 locale [#344]RDFLib/rdflib#344 * Prefix in the original turtle have an impact on SPARQL query resolution [#313]RDFLib/rdflib#313 * Duplicate BNode IDs from N3 Parser [#305]RDFLib/rdflib#305 * Use QNames for TriG graph names [#330]RDFLib/rdflib#330 * \uXXXX escapes in Turtle/N3 were fixed [#335]RDFLib/rdflib#335 * A way to limit the number of triples retrieved from the ```SPARQLStore``` was added [#346]RDFLib/rdflib#346 * Dots in localnames in Turtle [#345]RDFLib/rdflib#345 [#336]RDFLib/rdflib#336 * ```BNode``` as Graph's public ID [#300]RDFLib/rdflib#300 * Introduced ordering of ```QuotedGraphs``` [#291]RDFLib/rdflib#291 2013/05/22 RELEASE 4.0.1 ======================== Following RDFLib tradition, some bugs snuck into the 4.0 release. This is a bug-fixing release: * the new URI validation caused lots of problems, but is nescessary to avoid ''RDF injection'' vulnerabilities. In the spirit of ''be liberal in what you accept, but conservative in what you produce", we moved validation to serialisation time. * the ```rdflib.tools``` package was missing from the ```setup.py``` script, and was therefore not included in the PYPI tarballs. * RDF parser choked on empty namespace URI [#288](RDFLib/rdflib#288) * Parsing from ```sys.stdin``` was broken [#285](RDFLib/rdflib#285) * The new IO store had problems with concurrent modifications if several graphs used the same store [#286](RDFLib/rdflib#286) * Moved HTML5Lib dependency to the recently released 1.0b1 which support python3 2013/05/16 RELEASE 4.0 ====================== This release includes several major changes: * The new SPARQL 1.1 engine (rdflib-sparql) has been included in the core distribution. SPARQL 1.1 queries and updates should work out of the box. * SPARQL paths are exposed as operators on ```URIRefs```, these can then be be used with graph.triples and friends: ```py # List names of friends of Bob: g.triples(( bob, FOAF.knows/FOAF.name , None )) # All super-classes: g.triples(( cls, RDFS.subClassOf * '+', None )) ``` * a new ```graph.update``` method will apply SPARQL update statements * Several RDF 1.1 features are available: * A new ```DataSet``` class * ```XMLLiteral``` and ```HTMLLiterals``` * ```BNode``` (de)skolemization is supported through ```BNode.skolemize```, ```URIRef.de_skolemize```, ```Graph.skolemize``` and ```Graph.de_skolemize``` * Handled of Literal equality was split into lexical comparison (for normal ```==``` operator) and value space (using new ```Node.eq``` methods). This introduces some slight backwards incomaptible changes, but was necessary, as the old version had inconsisten hash and equality methods that could lead the literals not working correctly in dicts/sets. The new way is more in line with how SPARQL 1.1 works. For the full details, see: https://github.com/RDFLib/rdflib/wiki/Literal-reworking * Iterating over ```QueryResults``` will generate ```ResultRow``` objects, these allow access to variable bindings as attributes or as a dict. I.e. ```py for row in graph.query('select ... ') : print row.age, row["name"] ``` * "Slicing" of Graphs and Resources as syntactic sugar: ([#271](RDFLib/rdflib#271)) ```py graph[bob : FOAF.knows/FOAF.name] -> generator over the names of Bobs friends ``` * The ```SPARQLStore``` and ```SPARQLUpdateStore``` are now included in the RDFLib core * The documentation has been given a major overhaul, and examples for most features have been added. Minor Changes: -------------- * String operations on URIRefs return new URIRefs: ([#258](RDFLib/rdflib#258)) ```py >>> URIRef('http://example.org/')+'test rdflib.term.URIRef('http://example.org/test') ``` * Parser/Serializer plugins are also found by mime-type, not just by plugin name: ([#277](RDFLib/rdflib#277)) * ```Namespace``` is no longer a subclass of ```URIRef``` * URIRefs and Literal language tags are validated on construction, avoiding some "RDF-injection" issues ([#266](RDFLib/rdflib#266)) * A new memory store needs much less memory when loading large graphs ([#268](RDFLib/rdflib#268)) * Turtle/N3 serializer now supports the base keyword correctly ([#248](RDFLib/rdflib#248)) * py2exe support was fixed ([#257](RDFLib/rdflib#257)) * Several bugs in the TriG serializer were fixed * Several bugs in the NQuads parser were fixed

gweis · 2014-03-04T08:25:07Z

squashed into commit b7fa8d6 (also see issue #367)

gromgull closed this as completed in 98fc6b3 Dec 30, 2013

gromgull reopened this Jan 9, 2014

This was referenced Mar 3, 2014

if query result source has no encoding set, fall back to utf-8 encoding. #366

Closed

if query result source has no encoding set, fall back to utf-8 encoding. #367

Closed

gweis closed this as completed Mar 4, 2014

pyup-bot mentioned this issue Nov 8, 2016

Update rdflib to 4.2.1 mytardis/mytardis#733

Closed

This was referenced Jan 16, 2017

Initial Update mozilla/addons-server#4303

Closed

Update rdflib to 4.2.1 mozilla/addons-server#4390

Closed

pyup-bot mentioned this issue Jan 29, 2017

Update rdflib to 4.2.2 mytardis/mytardis#815

Merged

This was referenced Mar 16, 2017

Initial Update mozilla/amo-validator#510

Closed

Update rdflib to 4.2.2 mozilla/amo-validator#515

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode tests fail in python3.3 when using a non-UTF-8 locale #344

Unicode tests fail in python3.3 when using a non-UTF-8 locale #344

kepstin commented Nov 18, 2013

joernhees commented Nov 19, 2013

prologic commented Jan 9, 2014

gweis commented Jan 9, 2014

gromgull commented Jan 9, 2014

gromgull commented Jan 9, 2014

gweis commented Jan 9, 2014

gweis commented Mar 4, 2014

Unicode tests fail in python3.3 when using a non-UTF-8 locale #344

Unicode tests fail in python3.3 when using a non-UTF-8 locale #344

Comments

kepstin commented Nov 18, 2013

joernhees commented Nov 19, 2013

prologic commented Jan 9, 2014

gweis commented Jan 9, 2014

gromgull commented Jan 9, 2014

gromgull commented Jan 9, 2014

gweis commented Jan 9, 2014

gweis commented Mar 4, 2014