Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decoding errors on escaped Unicode string in test for narrow build python #453

Closed
wants to merge 1 commit into from

Conversation

ymph
Copy link
Contributor

@ymph ymph commented Feb 18, 2015

Running the tests on a narrow build of python 2.7 (for example OSX python version) reveal a list of unicode decoding errors on unicode escaped string such as:

======================================================================
ERROR: test.test_trig_w3c.test_trig(RDFTest(uri=rdflib.term.URIRef(u'https://dvcs.w3.org/hg/rdf/raw-file/default/trig/tests/manifest.ttl#localName_with_assigned_nfc_PN_CHARS_BASE_character_boundaries'), name=u'localName_with_assigned_nfc_PN_CHARS_BASE_character_boundaries', comment=u'localName with assigned, NFC-normalized PN CHARS BASE character boundaries (p:AZaz\xc0\xd6\xd8\xf6\xf8...)', data=None, graphdata=None, action=u'file:///Users/ymh/dev/projects/rdflib/test/w3c/trig/localName_with_assigned_nfc_PN_CHARS_BASE_character_boundaries.trig', result=rdflib.term.URIRef(u'file:///Users/ymh/dev/projects/rdflib/test/w3c/trig/localName_with_assigned_nfc_PN_CHARS_BASE_character_boundaries.nq'), syntax=True),)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/ymh/dev/venvs/rdflib/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/ymh/dev/projects/rdflib/test/test_trig_w3c.py", line 26, in trig
    res.parse(test.result, format='nquads')
  File "/Users/ymh/dev/projects/rdflib/rdflib/graph.py", line 1460, in parse
    location=location, file=file, data=data, **args)
  File "/Users/ymh/dev/projects/rdflib/rdflib/graph.py", line 1033, in parse
    parser.parse(source, self, **args)
  File "/Users/ymh/dev/projects/rdflib/rdflib/plugins/parsers/nquads.py", line 64, in parse
    self.parseline()
  File "/Users/ymh/dev/projects/rdflib/rdflib/plugins/parsers/nquads.py", line 81, in parseline
    obj = self.object()
  File "/Users/ymh/dev/projects/rdflib/rdflib/plugins/parsers/ntriples.py", line 228, in object
    objt = self.uriref() or self.nodeid() or self.literal()
  File "/Users/ymh/dev/projects/rdflib/rdflib/plugins/parsers/ntriples.py", line 236, in uriref
    uri = unquote(uri)
  File "/Users/ymh/dev/projects/rdflib/rdflib/plugins/parsers/ntriples.py", line 63, in unquote
    s = decodeUnicodeEscape(s)
  File "/Users/ymh/dev/projects/rdflib/rdflib/py3compat.py", line 171, in decodeUnicodeEscape
    s = _unicodeExpand(s)
  File "/Users/ymh/dev/projects/rdflib/rdflib/py3compat.py", line 141, in _unicodeExpand
    return r_unicodeEscape.sub(lambda m: unichr(int(m.group(0)[2:], 16)), s)
  File "/Users/ymh/dev/projects/rdflib/rdflib/py3compat.py", line 141, in <lambda>
    return r_unicodeEscape.sub(lambda m: unichr(int(m.group(0)[2:], 16)), s)
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

There is currently 15 errors of the same kind apparently affecting code in rdflib/py3compat.py (line 141) and rdflib/plugins/parsers/notation3.py (lines 1594 and 308)
These errors are not observed for python 3.4.

@coveralls
Copy link

Coverage Status

Coverage remained the same at 61.06% when pulling 30f0f8c on IRI-Research:issue_453 into 5f41263 on RDFLib:master.

@joernhees
Copy link
Member

thanks for bringing this up, i've also frequently been annoyed by this, but let me explain why it is this way:

back when we discovered that narrow python builds don't really support unicode chars in the supplementary planes we found this highly annoying:

In [1]: u'\U0010ffff'
Out[1]: u'\U0010ffff'

In [2]: print u'\U0010ffff'

In [3]: len(u'\U0010ffff')
Out[3]: 2

In [4]: u'\U0010ffff'.encode('utf-8')
Out[4]: '\xf4\x8f\xbf\xbf'

In [5]: '\xf4\x8f\xbf\xbf'.decode('utf-8')
Out[5]: u'\U0010ffff'

In [6]: len('\xf4\x8f\xbf\xbf'.decode('utf-8'))
Out[6]: 2

In [7]: u'\U0010ffff'[0:1]
Out[7]: u'\udbff'

In [8]: u'\U0010ffff'[1:2]
Out[8]: u'\udfff'

In [9]: u'\U0010ffff'[0:1].encode('utf-8')
Out[9]: '\xed\xaf\xbf'

In [10]: u'\U0010ffff'[1:2].encode('utf-8')
Out[10]: '\xed\xbf\xbf'

So while your fix works (and is probably better than just crashing out in case you encounter a char > 0xFFFF) it is dangerous as soon as you do something with the strings like slicing, regexps based on string length, ...

Back then the decision was: UGH, use a wide python build.

Given that narrow python builds are the default on mac os x and binary packages like scipy aren't well compatible with wide python builds on these platforms, i'd now opt for: use a wide python build, but if you use a narrow build: warning to the user and best effort solution instead of crashing.

@gromgull i think you found this back then... any thoughts?

@gromgull
Copy link
Member

I don't see any point trying to make the tests pass in a narrow build - the build is "broken" :)

BUT I see lots of point in trying to fail more gracefully... not sure what I good solution is though. If you let the string through - you set yourself up for weird errors that are impossible to debug down the line.

@joernhees
Copy link
Member

would it be ok-ish if i added this to __init__.py?

try:
    unichr(0x10FFFF)
except ValueError:
    import warnings
    warnings.warn(
        'You are using a narrow Python build!\n'
        'This means that your Python does not properly support chars > 16bit.\n'
        'On your system chars like c=u"\\U0010FFFF" will have a len(c)==2.\n'
        'As this can cause hard to debug problems with string processing\n'
        '(slicing, regexp, ...) later on, we strongly advise to use a wide\n'
        'Python build in production systems.',
        UnicodeWarning
    )
    del warnings

@joernhees
Copy link
Member

(running python -W ignore) will ignore those warnings...

@joernhees joernhees added bug Something isn't working enhancement New feature or request parsing Related to a parsing. labels Feb 18, 2015
@joernhees joernhees self-assigned this Feb 18, 2015
@gromgull
Copy link
Member

I guess warnings are read by the same mythological users who read the documentation :)

@joernhees
Copy link
Member

well, this one is really annoying ^^

@joernhees
Copy link
Member

i actually think it's too annoying... maybe we should just not bother as everyone else?

@gromgull
Copy link
Member

I wonder if you should wait until we encounter a string with astral-plane unicode before warning? Many people live happily in a world where us-ascii is enough for everyone! No reason to bother them?

@joernhees
Copy link
Member

well, problem is that many developers won't ever see the error then and won't handle it... until the code ends up in a production system... at that point showing a warning to some poor user is quite pointless...

but annoying the shit out of every developer on mac os x might be is a stupid solution as well :-/

@ymph
Copy link
Contributor Author

ymph commented Feb 18, 2015

hello all,

Thank you for your kind answers and giving me the context of your choice.
Personally, I would have advocated for avoiding a crash, since in any case
I find always difficult to reliably process unicode string using methods
relying on string length.

Of course my goal was not to make the tests run ok, but they reflected some
problems I had on my dev machine when processing results from dbpedia.

Our production target is a wide build and as you say, I will just bite the
bullet and install a wide build on my dev machine (or use my "patched"
version of rdflib in dev).

regards,

ymh

On Wed, Feb 18, 2015 at 3:43 PM, Jörn Hees notifications@github.com wrote:

well, this one is really annoying ^^


Reply to this email directly or view it on GitHub
#453 (comment).

joernhees added a commit to joernhees/rdflib that referenced this pull request Feb 18, 2015
if chars > 0xFFFF are really encountered a UnicodeWarning is
issued.

On import an ImportWarning is issued. These are ignored by
default, but can be enabled if python is invoked with `-W all`,
as any good developer should do ^^.

closes RDFLib#453
@joernhees joernhees modified the milestone: rdflib 4.2.0 Feb 19, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request parsing Related to a parsing.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants