-
Notifications
You must be signed in to change notification settings - Fork 560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decoding errors on escaped Unicode string in test for narrow build python #453
Conversation
thanks for bringing this up, i've also frequently been annoyed by this, but let me explain why it is this way: back when we discovered that narrow python builds don't really support unicode chars in the supplementary planes we found this highly annoying: In [1]: u'\U0010ffff'
Out[1]: u'\U0010ffff'
In [2]: print u'\U0010ffff'
In [3]: len(u'\U0010ffff')
Out[3]: 2
In [4]: u'\U0010ffff'.encode('utf-8')
Out[4]: '\xf4\x8f\xbf\xbf'
In [5]: '\xf4\x8f\xbf\xbf'.decode('utf-8')
Out[5]: u'\U0010ffff'
In [6]: len('\xf4\x8f\xbf\xbf'.decode('utf-8'))
Out[6]: 2
In [7]: u'\U0010ffff'[0:1]
Out[7]: u'\udbff'
In [8]: u'\U0010ffff'[1:2]
Out[8]: u'\udfff'
In [9]: u'\U0010ffff'[0:1].encode('utf-8')
Out[9]: '\xed\xaf\xbf'
In [10]: u'\U0010ffff'[1:2].encode('utf-8')
Out[10]: '\xed\xbf\xbf' So while your fix works (and is probably better than just crashing out in case you encounter a char > 0xFFFF) it is dangerous as soon as you do something with the strings like slicing, regexps based on string length, ... Back then the decision was: UGH, use a wide python build. Given that narrow python builds are the default on mac os x and binary packages like scipy aren't well compatible with wide python builds on these platforms, i'd now opt for: use a wide python build, but if you use a narrow build: warning to the user and best effort solution instead of crashing. @gromgull i think you found this back then... any thoughts? |
I don't see any point trying to make the tests pass in a narrow build - the build is "broken" :) BUT I see lots of point in trying to fail more gracefully... not sure what I good solution is though. If you let the string through - you set yourself up for weird errors that are impossible to debug down the line. |
would it be ok-ish if i added this to try:
unichr(0x10FFFF)
except ValueError:
import warnings
warnings.warn(
'You are using a narrow Python build!\n'
'This means that your Python does not properly support chars > 16bit.\n'
'On your system chars like c=u"\\U0010FFFF" will have a len(c)==2.\n'
'As this can cause hard to debug problems with string processing\n'
'(slicing, regexp, ...) later on, we strongly advise to use a wide\n'
'Python build in production systems.',
UnicodeWarning
)
del warnings |
(running |
I guess warnings are read by the same mythological users who read the documentation :) |
well, this one is really annoying ^^ |
i actually think it's too annoying... maybe we should just not bother as everyone else? |
I wonder if you should wait until we encounter a string with astral-plane unicode before warning? Many people live happily in a world where us-ascii is enough for everyone! No reason to bother them? |
well, problem is that many developers won't ever see the error then and won't handle it... until the code ends up in a production system... at that point showing a warning to some poor user is quite pointless... but annoying the shit out of every developer on mac os x might be is a stupid solution as well :-/ |
hello all, Thank you for your kind answers and giving me the context of your choice. Of course my goal was not to make the tests run ok, but they reflected some Our production target is a wide build and as you say, I will just bite the regards, ymh On Wed, Feb 18, 2015 at 3:43 PM, Jörn Hees notifications@github.com wrote:
|
if chars > 0xFFFF are really encountered a UnicodeWarning is issued. On import an ImportWarning is issued. These are ignored by default, but can be enabled if python is invoked with `-W all`, as any good developer should do ^^. closes RDFLib#453
Running the tests on a narrow build of python 2.7 (for example OSX python version) reveal a list of unicode decoding errors on unicode escaped string such as:
There is currently 15 errors of the same kind apparently affecting code in
rdflib/py3compat.py
(line 141) andrdflib/plugins/parsers/notation3.py
(lines 1594 and 308)These errors are not observed for python 3.4.