Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode tests fail in python3.3 when using a non-UTF-8 locale #344

Closed
kepstin opened this issue Nov 18, 2013 · 7 comments
Closed

Unicode tests fail in python3.3 when using a non-UTF-8 locale #344

kepstin opened this issue Nov 18, 2013 · 7 comments
Milestone

Comments

@kepstin
Copy link

kepstin commented Nov 18, 2013

For example, run

cd build/src; LC_ALL=C python3.3 ./run_tests.py

(Note: this is somewhat important for linux packagers, because most linux packaging runs in a "clean" environment with posix locale set, and we'd like to run tests where possible.)

You will get several errors like:

ERROR: test.test_dawg.test_dawg((rdflib.term.URIRef('http://www.w3.org/2009/sparql/docs/tests/data-sparql11/functions/manifest#strlang03'), 'STRLANG() TypeErrors', None, 'file:///home/cwalton/Development/rdflib-4.0.1/build/src/test/DAWG/data-sparql11/functions/data.ttl', [], 'file:///home/cwalton/Development/rdflib-4.0.1/build/src/test/DAWG/data-sparql11/functions/strlang03.rq', rdflib.term.URIRef('file:///home/cwalton/Development/rdflib-4.0.1/build/src/test/DAWG/data-sparql11/functions/strlang03.srx'), True),)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib64/python3.3/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/cwalton/Development/rdflib-4.0.1/build/src/test/test_dawg.py", line 412, in query_test
    res = Result.parse(open(resfile[7:]), format='xml')
  File "/home/cwalton/Development/rdflib-4.0.1/build/src/rdflib/query.py", line 197, in parse
    return parser.parse(source, **kwargs)
  File "/home/cwalton/Development/rdflib-4.0.1/build/src/rdflib/plugins/sparql/results/xmlresults.py", line 34, in parse
    return XMLResult(source)
  File "/home/cwalton/Development/rdflib-4.0.1/build/src/rdflib/plugins/sparql/results/xmlresults.py", line 40, in __init__
    xmlstring = source.read()
  File "/usr/lib/python3.3/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 1097: ordinal not in range(128)

The trivial fix of setting the character encoding on the open() statements in test_dawg.py allows the tests to work in python3.3; but if you set suitably, one test in python2.7 will fail with this output:

ERROR: test.test_dawg.test_dawg((rdflib.term.URIRef(u'http://raw.github.com/RDFLib/rdflib/master/test/DAWG/rdflib/manifest.ttl#unicode'), 'Unicode in SPARQL queries', None, 'file:///home/cwalton/Development/rdflib-4.0.1/test/DAWG/rdflib/unicode.ttl', [], 'file:///home/cwalton/Development/rdflib-4.0.1/test/DAWG/rdflib/unicode.rq', rdflib.term.URIRef(u'file:///home/cwalton/Development/rdflib-4.0.1/test/DAWG/rdflib/unicode.srx'), True),)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/cwalton/Development/rdflib-4.0.1/test/test_dawg.py", line 384, in query_test
    res2 = g.query(codecs.open(query[7:], encoding="utf-8").read(), base=urljoin(query, '.'))
  File "/home/cwalton/Development/rdflib-4.0.1/rdflib/graph.py", line 1045, in query
    query_object, initBindings, initNs, **kwargs))
  File "/home/cwalton/Development/rdflib-4.0.1/rdflib/plugins/sparql/processor.py", line 72, in query
    parsetree = parseQuery(strOrQuery)
  File "/home/cwalton/Development/rdflib-4.0.1/rdflib/plugins/sparql/parser.py", line 1034, in parseQuery
    return Query.parseString(q, parseAll=True)
  File "/usr/lib64/python2.7/site-packages/pyparsing.py", line 1031, in parseString
    loc, tokens = self._parse( instring, 0 )
<snip a bunch of pyparsing internals>
  File "/usr/lib64/python2.7/site-packages/pyparsing.py", line 695, in wrapper
    ret = func(*args[limit[0]:])
  File "/home/cwalton/Development/rdflib-4.0.1/rdflib/plugins/sparql/parser.py", line 300, in <lambda>
    lambda x: rdflib.Literal(decodeStringEscape(x[0][1:-1])))
  File "/home/cwalton/Development/rdflib-4.0.1/rdflib/py3compat.py", line 129, in decodeStringEscape
    return s.decode('string-escape')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
@joernhees
Copy link
Member

we all love encodings ;)

i'm somewhat sure that setting the encoding in open isn't correct for all rdf serialization formats, some are plain ascii based, some utf-8...

also are you sure that the test is wrong?
to me it rather seems to be a real bug that should be fixed in the lib not in the test.

will have a closer look when i find some time

@prologic
Copy link

prologic commented Jan 9, 2014

This changeset 98fc6b3 actually breaks several of our Plone-built web applications that use RDFLib for our tripple-store storage backend(s). See Plone traceback below:

2014-01-09 10:19:30 ERROR Zope.SiteErrorLog 1389226770.10.423529724865 http://localhost:8499/ccaih/repository/view
Traceback (innermost last):
  Module ZPublisher.Publish, line 138, in publish
  Module ZPublisher.mapply, line 77, in mapply
  Module ZPublisher.Publish, line 48, in call_object
  Module plone.autoform.view, line 49, in __call__
  Module plone.autoform.view, line 40, in render
  Module Products.Five.browser.pagetemplatefile, line 125, in __call__
  Module Products.Five.browser.pagetemplatefile, line 59, in __call__
  Module zope.pagetemplate.pagetemplate, line 132, in pt_render
   - Warning: Macro expansion failed
   - Warning: <type 'exceptions.KeyError'>: 'listing_macro'
  Module zope.pagetemplate.pagetemplate, line 240, in __call__
  Module zope.tal.talinterpreter, line 271, in __call__
  Module zope.tal.talinterpreter, line 343, in interpret
  Module zope.tal.talinterpreter, line 888, in do_useMacro
  Module zope.tal.talinterpreter, line 343, in interpret
  Module zope.tal.talinterpreter, line 533, in do_optTag_tal
  Module zope.tal.talinterpreter, line 518, in do_optTag
  Module zope.tal.talinterpreter, line 513, in no_tag
  Module zope.tal.talinterpreter, line 343, in interpret
  Module zope.tal.talinterpreter, line 954, in do_defineSlot
  Module zope.tal.talinterpreter, line 343, in interpret
  Module zope.tal.talinterpreter, line 533, in do_optTag_tal
  Module zope.tal.talinterpreter, line 518, in do_optTag
  Module zope.tal.talinterpreter, line 513, in no_tag
  Module zope.tal.talinterpreter, line 343, in interpret
  Module zope.tal.talinterpreter, line 858, in do_defineMacro
  Module zope.tal.talinterpreter, line 343, in interpret
  Module zope.tal.talinterpreter, line 954, in do_defineSlot
  Module zope.tal.talinterpreter, line 343, in interpret
  Module zope.tal.talinterpreter, line 533, in do_optTag_tal
  Module zope.tal.talinterpreter, line 518, in do_optTag
  Module zope.tal.talinterpreter, line 513, in no_tag
  Module zope.tal.talinterpreter, line 343, in interpret
  Module zope.tal.talinterpreter, line 954, in do_defineSlot
  Module zope.tal.talinterpreter, line 343, in interpret
  Module zope.tal.talinterpreter, line 533, in do_optTag_tal
  Module zope.tal.talinterpreter, line 518, in do_optTag
  Module zope.tal.talinterpreter, line 513, in no_tag
  Module zope.tal.talinterpreter, line 343, in interpret
  Module zope.tal.talinterpreter, line 852, in do_condition
  Module zope.tal.talinterpreter, line 343, in interpret
  Module zope.tal.talinterpreter, line 742, in do_insertStructure_tal
  Module Products.PageTemplates.Expressions, line 218, in evaluateStructure
  Module zope.tales.tales, line 696, in evaluate
   - URL: file:/Users/s2092651/work/ccaih/hub/buildout/eggs/plonetheme.sunburst-1.4.5-py2.7.egg/plonetheme/sunburst/skins/sunburst_templates/main_template.pt
   - Line 115, Column 29
   - Expression: <StringExpr u'plone.abovecontentbody'>
   - Names:
      {'args': (),
       'container': <RepositoryContainer at /ccaih/repository>,
       'context': <RepositoryContainer at /ccaih/repository>,
       'default': <object object at 0x10ab63bd0>,
       'here': <RepositoryContainer at /ccaih/repository>,
       'loop': {},
       'nothing': None,
       'options': {},
       'repeat': <Products.PageTemplates.Expressions.SafeMapping object at 0x10b18b6d8>,
       'request': <HTTPRequest, URL=http://localhost:8499/ccaih/repository/view>,
       'root': <Application at >,
       'template': <Products.Five.browser.pagetemplatefile.ViewPageTemplateFile object at 0x1100b0a90>,
       'traverse_subpath': [],
       'user': <SpecialUser 'Anonymous User'>,
       'view': <Products.Five.metaclass.SimpleViewClass from /Users/s2092651/work/ccaih/hub/buildout/eggs/plone.app.dexterity-2.0.9-py2.7.egg/plone/app/dexterity/browser/container.pt object at 0x110388bd0>,
       'views': <Products.Five.browser.pagetemplatefile.ViewMapper object at 0x1102e8650>}
  Module zope.contentprovider.tales, line 77, in __call__
  Module zope.viewlet.manager, line 112, in update
  Module zope.viewlet.manager, line 118, in _updateViewlets
  Module gu.plone.rdf.browser.viewlet, line 47, in update
  Module gu.plone.rdf.browser.viewlet, line 30, in update
  Module plone.z3cform.fieldsets.extensible, line 59, in update
  Module plone.z3cform.patch, line 30, in GroupForm_update
  Module z3c.form.group, line 137, in update
  Module z3c.form.group, line 49, in update
  Module z3c.form.group, line 45, in updateWidgets
  Module z3c.form.field, line 277, in update
  Module z3c.formwidget.query.widget, line 108, in update
  Module z3c.formwidget.query.widget, line 95, in bound_source
  Module z3c.formwidget.query.widget, line 227, in source
  Module zope.schema._field, line 471, in bind
  Module zope.schema._field, line 352, in bind
  Module Zope2.App.schema, line 33, in get
  Module gu.z3cform.rdf.vocabulary, line 130, in __call__
  Module ordf.handler, line 387, in query
  Module ordf.handler.httpfourstore, line 144, in query
  Module rdflib.query, line 197, in parse
  Module rdflib.plugins.sparql.results.tsvresults, line 61, in parse
  Module pyparsing, line 996, in parseString
  Module pyparsing, line 871, in _parseNoCache
  Module pyparsing, line 2342, in parseImpl
  Module pyparsing, line 871, in _parseNoCache
  Module pyparsing, line 2708, in parseImpl
  Module pyparsing, line 871, in _parseNoCache
  Module pyparsing, line 2342, in parseImpl
  Module pyparsing, line 871, in _parseNoCache
  Module pyparsing, line 2451, in parseImpl
  Module pyparsing, line 871, in _parseNoCache
  Module pyparsing, line 2596, in parseImpl
  Module pyparsing, line 871, in _parseNoCache
  Module pyparsing, line 2326, in parseImpl
  Module pyparsing, line 871, in _parseNoCache
  Module pyparsing, line 2596, in parseImpl
  Module pyparsing, line 871, in _parseNoCache
  Module pyparsing, line 2451, in parseImpl
  Module pyparsing, line 897, in _parseNoCache
  Module pyparsing, line 660, in wrapper
  Module rdflib.plugins.sparql.parser, line 315, in <lambda>
  Module rdflib.py3compat, line 170, in decodeUnicodeEscape
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 10: ordinal not in range(128)

I've bisected this down to the changeset mentioned above but haven't had the time yet to investigate what's going wrong.

@gweis
Copy link
Member

gweis commented Jan 9, 2014

I think this solves only part of the issue.
What about parsing form file-like objects like urllib2 responses?
There is no mode you can set on them, therefore the current fix assumes it is encoded in 'ascii'.

Other options like reading in the response and converting it manually, or re-wrapping it into something with a mode attribute properly set (or even setting mode on a file like object) are somewhat suboptimal.

IMO the default should assume 'utf-8' encoding if not already unicode or stated otherwise.

@gromgull gromgull reopened this Jan 9, 2014
@gromgull
Copy link
Member

gromgull commented Jan 9, 2014

Actually - after doing I read Armin Ronacher guide to unicode wrangling in py2/3 - and I think a better way is possible, rather than checking the encoding attribute - you can read a 0 length string from the stream and check if you get a str/bytes or unicode back and then encode as needed.

@gromgull
Copy link
Member

gromgull commented Jan 9, 2014

@gweis
Copy link
Member

gweis commented Jan 9, 2014

Interesting article. Thanks for sharing.

After reading it, I think the py3 model only works if you really separate strings and bytes. Strings are something decoded and they just work, and bytes are just bytes but can be decoded to strings if required. I don't have enough experience with py3 but so far I believe this model can work nicely, and developers have to be explicit when to en/decode strings and bytes, which should help a lot to avoid UnicodeDecodeErrors. (yes it is annoying to do it manually, but I don't see another way around, and it kinda follows the principle: rather explicit than implicit).

Maybe it would be best to require that the input data is unicode or the stream has to produce unicode. If it's not unicode, then the library uses a default decoder, which could be either system default, utf-8 or even mimetype dependent if that information is available. The user of the library can easily wrap a decoder around the input data (shouldn't be more than one line I guess). This would make the parser maybe a bit simpler, and follows the py3 idea of being explicit with bytes and strings. (May well be that I have missed some use-cases here).

mamash pushed a commit to TritonDataCenter/pkgsrc-wip that referenced this issue Feb 15, 2014
	2013/12/31 RELEASE 4.1
======================

This is a new minor version RDFLib, which includes a handful of new features:

* A TriG parser was added (we already had a serializer) - it is
  up-to-date wrt. to the newest spec from: http://www.w3.org/TR/trig/

* The Turtle parser was made up to date wrt. to the latest Turtle spec.

* Many more tests have been added - RDFLib now has over 2000
  (passing!) tests. This is mainly thanks to the NT, Turtle, TriG,
  NQuads and SPARQL test-suites from W3C. This also included many
  fixes to the nt and nquad parsers.

* ```ConjunctiveGraph``` and ```Dataset``` now support directly adding/removing
  quads with ```add/addN/remove``` methods.

* ```rdfpipe``` command now supports datasets, and reading/writing context
  sensitive formats.

* Optional graph-tracking was added to the Store interface, allowing
  empty graphs to be tracked for Datasets. The DataSet class also saw
  a general clean-up, see: RDFLib/rdflib#309

* After long deprecation, ```BackwardCompatibleGraph``` was removed.

Minor enhancements/bugs fixed:
------------------------------

* Many code samples in the documentation were fixed thanks to @PuckCh

* The new ```IOMemory``` store was optimised a bit

* ```SPARQL(Update)Store``` has been made more generic.

* MD5 sums were never reinitialized in ```rdflib.compare```

* Correct default value for empty prefix in N3
  [#312]RDFLib/rdflib#312

* Fixed tests when running in a non UTF-8 locale
  [#344]RDFLib/rdflib#344

* Prefix in the original turtle have an impact on SPARQL query
  resolution
  [#313]RDFLib/rdflib#313

* Duplicate BNode IDs from N3 Parser
  [#305]RDFLib/rdflib#305

* Use QNames for TriG graph names
  [#330]RDFLib/rdflib#330

* \uXXXX escapes in Turtle/N3 were fixed
  [#335]RDFLib/rdflib#335

* A way to limit the number of triples retrieved from the
  ```SPARQLStore``` was added
  [#346]RDFLib/rdflib#346

* Dots in localnames in Turtle
  [#345]RDFLib/rdflib#345
  [#336]RDFLib/rdflib#336

* ```BNode``` as Graph's public ID
  [#300]RDFLib/rdflib#300

* Introduced ordering of ```QuotedGraphs```
  [#291]RDFLib/rdflib#291

2013/05/22 RELEASE 4.0.1
========================

Following RDFLib tradition, some bugs snuck into the 4.0 release.
This is a bug-fixing release:

* the new URI validation caused lots of problems, but is
  nescessary to avoid ''RDF injection'' vulnerabilities. In the
  spirit of ''be liberal in what you accept, but conservative in
  what you produce", we moved validation to serialisation time.

* the   ```rdflib.tools```   package    was   missing   from   the
  ```setup.py```  script, and  was therefore  not included  in the
  PYPI tarballs.

* RDF parser choked on empty namespace URI
  [#288](RDFLib/rdflib#288)

* Parsing from ```sys.stdin``` was broken
  [#285](RDFLib/rdflib#285)

* The new IO store had problems with concurrent modifications if
  several graphs used the same store
  [#286](RDFLib/rdflib#286)

* Moved HTML5Lib dependency to the recently released 1.0b1 which
  support python3

2013/05/16 RELEASE 4.0
======================

This release includes several major changes:

* The new SPARQL 1.1 engine (rdflib-sparql) has been included in
  the core distribution. SPARQL 1.1 queries and updates should
  work out of the box.

  * SPARQL paths are exposed as operators on ```URIRefs```, these can
    then be be used with graph.triples and friends:

    ```py
    # List names of friends of Bob:
    g.triples(( bob, FOAF.knows/FOAF.name , None ))

    # All super-classes:
    g.triples(( cls, RDFS.subClassOf * '+', None ))
    ```

      * a new ```graph.update``` method will apply SPARQL update statements

* Several RDF 1.1 features are available:
  * A new ```DataSet``` class
  * ```XMLLiteral``` and ```HTMLLiterals```
  * ```BNode``` (de)skolemization is supported through ```BNode.skolemize```,
    ```URIRef.de_skolemize```, ```Graph.skolemize``` and ```Graph.de_skolemize```

* Handled of Literal equality was split into lexical comparison
  (for normal ```==``` operator) and value space (using new ```Node.eq```
  methods). This introduces some slight backwards incomaptible
  changes, but was necessary, as the old version had
  inconsisten hash and equality methods that could lead the
  literals not working correctly in dicts/sets.
  The new way is more in line with how SPARQL 1.1 works.
  For the full details, see:

  https://github.com/RDFLib/rdflib/wiki/Literal-reworking

* Iterating over ```QueryResults``` will generate ```ResultRow``` objects,
  these allow access to variable bindings as attributes or as a
  dict. I.e.

  ```py
  for row in graph.query('select ... ') :
     print row.age, row["name"]
  ```

* "Slicing" of Graphs and Resources as syntactic sugar:
  ([#271](RDFLib/rdflib#271))

  ```py
  graph[bob : FOAF.knows/FOAF.name]
            -> generator over the names of Bobs friends
  ```

* The ```SPARQLStore``` and ```SPARQLUpdateStore``` are now included
  in the RDFLib core

* The documentation has been given a major overhaul, and examples
  for most features have been added.


Minor Changes:
--------------

* String operations on URIRefs return new URIRefs: ([#258](RDFLib/rdflib#258))
  ```py
  >>> URIRef('http://example.org/')+'test
  rdflib.term.URIRef('http://example.org/test')
  ```

* Parser/Serializer plugins are also found by mime-type, not just
  by plugin name:  ([#277](RDFLib/rdflib#277))
* ```Namespace``` is no longer a subclass of ```URIRef```
* URIRefs and Literal language tags are validated on construction,
  avoiding some "RDF-injection" issues ([#266](RDFLib/rdflib#266))
* A new memory store needs much less memory when loading large
  graphs ([#268](RDFLib/rdflib#268))
* Turtle/N3 serializer now supports the base keyword correctly ([#248](RDFLib/rdflib#248))
* py2exe support was fixed ([#257](RDFLib/rdflib#257))
* Several bugs in the TriG serializer were fixed
* Several bugs in the NQuads parser were fixed
@gweis
Copy link
Member

gweis commented Mar 4, 2014

squashed into commit b7fa8d6 (also see issue #367)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants