Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test failures on FreeBSD 9.1 #3360

Closed
neirbowj opened this issue Apr 14, 2013 · 92 comments
Closed

Test failures on FreeBSD 9.1 #3360

neirbowj opened this issue Apr 14, 2013 · 92 comments
Labels
Unicode Unicode strings
Milestone

Comments

@neirbowj
Copy link
Contributor

% uname -a
FreeBSD XXXX.saltant.net 9.1-STABLE FreeBSD 9.1-STABLE #0 r248078: Fri Mar  8 20:36:00 EST 2013     root@XXXX.saltant.net:/usr/obj/usr/src/sys/NIPPL  amd64
% pkg_info -xE pandas
py27-pandas-0.11.0.r1
% pkg_info -xE ^python
python27-2.7.3_6
% pkg_info -xE numpy
py27-numpy-1.6.2_1,1
% nosetests pandas.tests.test_index:TestMultiIndex.test_legacy_pickle
E
======================================================================
ERROR: test_legacy_pickle (pandas.tests.test_index.TestMultiIndex)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/pandas-0.11.0rc1-py2.7-freebsd-9.1-STABLE-amd64.egg/pandas/tests/test_index.py", line 1060, in test_legacy_pickle
    obj = pickle.load(open(ppath, 'r'))
  File "/usr/local/lib/python2.7/pickle.py", line 1378, in load
    return Unpickler(file).load()
  File "/usr/local/lib/python2.7/pickle.py", line 858, in load
    dispatch[key](self)
  File "/usr/local/lib/python2.7/pickle.py", line 1090, in load_global
    klass = self.find_class(module, name)
  File "/usr/local/lib/python2.7/pickle.py", line 1124, in find_class
    __import__(module)
ImportError: No module named multiarray

----------------------------------------------------------------------
Ran 1 test in 0.002s

FAILED (errors=1)
@jreback
Copy link
Contributor

jreback commented Apr 14, 2013

are there other failures?
what does
test_fast.sh return?

@neirbowj
Copy link
Contributor Author

I installed from the rc1 tarball, so I don't have test_fast.sh handy. Give me a moment to bring up a dev't environment and I'll get back to you about test_fast. In the mean time, here's the complete output from nose.

% nosetests --exe pandas

** (process:86069): WARNING **: Trying to register gtype 'GMountMountFlags' as enum when in fact it is of type 'GFlags'

** (process:86069): WARNING **: Trying to register gtype 'GDriveStartFlags' as enum when in fact it is of type 'GFlags'

** (process:86069): WARNING **: Trying to register gtype 'GSocketMsgFlags' as enum when in fact it is of type 'GFlags'
..........................SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.........................................................................................................................................................................................................................................................................................SSSSSSSSSSS..........SS...................................................................................................................................................................................S..................S..S.SSSSSSS......................................S.....SSSSSSSSSSSSSSSSSSSSSSSS..............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................S.S....................................................E................................................................................................................................................................................................................................SSS.............................................................................................................................................................................................................................S.........................S.......................................................................................................................................................................................................................................................................................................................................E..............................................................................................................................................................S.............................................................................................................................................................S..................S...S............SSS.........S............SS...........SS.....SSSS.......................................................................................................................S......................................................................................................E.......................................S............................................................................................................
======================================================================
ERROR: test_to_string_repr_unicode (pandas.tests.test_format.TestDataFrameFormatting)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/pandas-0.11.0rc1-py2.7-freebsd-9.1-STABLE-amd64.egg/pandas/tests/test_format.py", line 237, in test_to_string_repr_unicode
    rs = repr(ser).split('\n')
  File "/usr/local/lib/python2.7/site-packages/pandas-0.11.0rc1-py2.7-freebsd-9.1-STABLE-amd64.egg/pandas/core/series.py", line 1138, in __repr__
    return str(self)
  File "/usr/local/lib/python2.7/site-packages/pandas-0.11.0rc1-py2.7-freebsd-9.1-STABLE-amd64.egg/pandas/core/series.py", line 1097, in __str__
    return self.__bytes__()
  File "/usr/local/lib/python2.7/site-packages/pandas-0.11.0rc1-py2.7-freebsd-9.1-STABLE-amd64.egg/pandas/core/series.py", line 1107, in __bytes__
    return self.__unicode__().encode(encoding, 'replace')
  File "/usr/local/lib/python2.7/site-packages/pandas-0.11.0rc1-py2.7-freebsd-9.1-STABLE-amd64.egg/pandas/core/series.py", line 1125, in __unicode__
    dtype=True)
  File "/usr/local/lib/python2.7/site-packages/pandas-0.11.0rc1-py2.7-freebsd-9.1-STABLE-amd64.egg/pandas/core/series.py", line 1214, in _get_repr
    result = formatter.to_string()
  File "/usr/local/lib/python2.7/site-packages/pandas-0.11.0rc1-py2.7-freebsd-9.1-STABLE-amd64.egg/pandas/core/format.py", line 146, in to_string
    idx = k.ljust(pad_space + _encode_diff(k))
  File "/usr/local/lib/python2.7/site-packages/pandas-0.11.0rc1-py2.7-freebsd-9.1-STABLE-amd64.egg/pandas/core/format.py", line 168, in _encode_diff
    return len(x) - len(x.decode(encoding))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xcf in position 0: ordinal not in range(128)

======================================================================
ERROR: test_legacy_pickle (pandas.tests.test_index.TestMultiIndex)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/pandas-0.11.0rc1-py2.7-freebsd-9.1-STABLE-amd64.egg/pandas/tests/test_index.py", line 1060, in test_legacy_pickle
    obj = pickle.load(open(ppath, 'r'))
  File "/usr/local/lib/python2.7/pickle.py", line 1378, in load
    return Unpickler(file).load()
  File "/usr/local/lib/python2.7/pickle.py", line 858, in load
    dispatch[key](self)
  File "/usr/local/lib/python2.7/pickle.py", line 1090, in load_global
    klass = self.find_class(module, name)
  File "/usr/local/lib/python2.7/pickle.py", line 1124, in find_class
    __import__(module)
ImportError: No module named multiarray

======================================================================
ERROR: http://docs.python.org/py3k/reference/datamodel.html#object.__repr__
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/pandas-0.11.0rc1-py2.7-freebsd-9.1-STABLE-amd64.egg/pandas/tests/test_series.py", line 1361, in test_repr_should_return_str
    self.assertTrue(type(df.__repr__() == str))  # both py2 / 3
  File "/usr/local/lib/python2.7/site-packages/pandas-0.11.0rc1-py2.7-freebsd-9.1-STABLE-amd64.egg/pandas/core/series.py", line 1138, in __repr__
    return str(self)
  File "/usr/local/lib/python2.7/site-packages/pandas-0.11.0rc1-py2.7-freebsd-9.1-STABLE-amd64.egg/pandas/core/series.py", line 1097, in __str__
    return self.__bytes__()
  File "/usr/local/lib/python2.7/site-packages/pandas-0.11.0rc1-py2.7-freebsd-9.1-STABLE-amd64.egg/pandas/core/series.py", line 1107, in __bytes__
    return self.__unicode__().encode(encoding, 'replace')
  File "/usr/local/lib/python2.7/site-packages/pandas-0.11.0rc1-py2.7-freebsd-9.1-STABLE-amd64.egg/pandas/core/series.py", line 1125, in __unicode__
    dtype=True)
  File "/usr/local/lib/python2.7/site-packages/pandas-0.11.0rc1-py2.7-freebsd-9.1-STABLE-amd64.egg/pandas/core/series.py", line 1214, in _get_repr
    result = formatter.to_string()
  File "/usr/local/lib/python2.7/site-packages/pandas-0.11.0rc1-py2.7-freebsd-9.1-STABLE-amd64.egg/pandas/core/format.py", line 146, in to_string
    idx = k.ljust(pad_space + _encode_diff(k))
  File "/usr/local/lib/python2.7/site-packages/pandas-0.11.0rc1-py2.7-freebsd-9.1-STABLE-amd64.egg/pandas/core/format.py", line 168, in _encode_diff
    return len(x) - len(x.decode(encoding))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xcf in position 0: ordinal not in range(128)

----------------------------------------------------------------------
Ran 3125 tests in 297.710s

FAILED (SKIP=115, errors=3)

I'm looking into the GFlags warnings now. TestDataFrameFormatting.test_eng_float_formatter generates them in the final call to fmt.reset_printoptions()

I'm also working on a minimal failing test for TestDataFrameFormatting.test_to_string_repr_unicode, because:

% nosetests pandas.tests.test_format:TestDataFrameFormatting
.
** (process:86110): WARNING **: Trying to register gtype 'GMountMountFlags' as enum when in fact it is of type 'GFlags'

** (process:86110): WARNING **: Trying to register gtype 'GDriveStartFlags' as enum when in fact it is of type 'GFlags'

** (process:86110): WARNING **: Trying to register gtype 'GSocketMsgFlags' as enum when in fact it is of type 'GFlags'
...............................................E................
======================================================================
ERROR: test_to_string_repr_unicode (pandas.tests.test_format.TestDataFrameFormatting)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/pandas-0.11.0rc1-py2.7-freebsd-9.1-STABLE-amd64.egg/pandas/tests/test_format.py", line 237, in test_to_string_repr_unicode
    rs = repr(ser).split('\n')
  File "/usr/local/lib/python2.7/site-packages/pandas-0.11.0rc1-py2.7-freebsd-9.1-STABLE-amd64.egg/pandas/core/series.py", line 1138, in __repr__
    return str(self)
  File "/usr/local/lib/python2.7/site-packages/pandas-0.11.0rc1-py2.7-freebsd-9.1-STABLE-amd64.egg/pandas/core/series.py", line 1097, in __str__
    return self.__bytes__()
  File "/usr/local/lib/python2.7/site-packages/pandas-0.11.0rc1-py2.7-freebsd-9.1-STABLE-amd64.egg/pandas/core/series.py", line 1107, in __bytes__
    return self.__unicode__().encode(encoding, 'replace')
  File "/usr/local/lib/python2.7/site-packages/pandas-0.11.0rc1-py2.7-freebsd-9.1-STABLE-amd64.egg/pandas/core/series.py", line 1125, in __unicode__
    dtype=True)
  File "/usr/local/lib/python2.7/site-packages/pandas-0.11.0rc1-py2.7-freebsd-9.1-STABLE-amd64.egg/pandas/core/series.py", line 1214, in _get_repr
    result = formatter.to_string()
  File "/usr/local/lib/python2.7/site-packages/pandas-0.11.0rc1-py2.7-freebsd-9.1-STABLE-amd64.egg/pandas/core/format.py", line 146, in to_string
    idx = k.ljust(pad_space + _encode_diff(k))
  File "/usr/local/lib/python2.7/site-packages/pandas-0.11.0rc1-py2.7-freebsd-9.1-STABLE-amd64.egg/pandas/core/format.py", line 168, in _encode_diff
    return len(x) - len(x.decode(encoding))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xcf in position 0: ordinal not in range(128)

----------------------------------------------------------------------
Ran 65 tests in 3.022s

FAILED (errors=1)

But...

% nosetests pandas.tests.test_format:TestDataFrameFormatting.test_to_string_repr_unicode
.
----------------------------------------------------------------------
Ran 1 test in 0.024s

OK

test_to_string_repr_unicode uses np.random.randn to generate test data, so pass/fail depends on earlier consumption of random numbers.

@neirbowj
Copy link
Contributor Author

OK, when I

git clone --branch v0.11.0rc1 https://github.com/pydata/pandas pandas-0.11.0rc1
cd pandas-0.11.0rc1
python setup.py build_ext --inplace
./test_fast.sh

I get the warnings, the failure in pandas.tests.test_format:TestDataFrameFormatting.test_to_string_repr_unicode and the associated http://docs.python.org/py3k/reference/datamodel.html#object.__repr__ error, but pandas.tests.test_index:TestMultiIndex.test_legacy_pickle apparently passes.

I don't know what to do next, but will gladly take direction and respond to requests.

@ghost
Copy link

ghost commented Apr 15, 2013

Please pull the latest master (I pushed some changes today), and post the output ofci/print_versions.py.

also, what is the value of pandas.options.display.encoding?

edit: Tha test name above is now fixed in master.

@neirbowj
Copy link
Contributor Author

% git describe
v0.11.0rc1-27-g9c05da7
% ci/print_versions.py

INSTALLED VERSIONS
------------------
Python: 2.7.3.final.0
OS: FreeBSD 9.1-STABLE FreeBSD 9.1-STABLE #0 r248078: Fri Mar  8 20:36:00 EST 2013     root@XXXX.saltant.net:/usr/obj/usr/src/sys/NIPPL amd64

Cython: 0.17.1
Numpy: 1.6.2
Scipy: 0.11.0
statsmodels: Not installed
    patsy: Not installed
scikits.timeseries: Not installed
dateutil: 2.1
pytz: 2013b
PyTables: Not Installed
    numexpr: Not Installed
matplotlib: 1.2.0
openpyxl: Not installed
xlrd: Not installed
xlwt: Not installed
sqlalchemy: Not installed

And furthermore...

>>> import pandas
/usr/local/lib/python2.7/site-packages/pytz-2013b-py2.7.egg/pytz/__init__.py:35: UserWarning: Module pandas was already imported from pandas/__init__.pyc, but /usr/local/lib/python2.7/site-packages/pandas-0.11.0rc1-py2.7-freebsd-9.1-STABLE-amd64.egg is being added to sys.path
>>> pandas.options.display.encoding
'US-ASCII'
>>>

@ghost
Copy link

ghost commented Apr 15, 2013

Your terminal doesn't support utf-8/unicode, or at least doesn't report it in a way pandas
can discern (the code is in core/format.py:detect_console_encoding()).

Obviously, the test needs to fail more gracefuly in such a situation, but really it's signalling
that while pandas supports unicode, your environment does not.

If you're only working with ASCII data, your should be fine in principle btw.

@ghost
Copy link

ghost commented Apr 15, 2013

But there's something else going on here.

  File "/usr/local/lib/python2.7/site-packages/pandas-0.11.0rc1-py2.7-freebsd-9.1-STABLE-amd64.egg/pandas/core/format.py", line 168, in _encode_diff
    return len(x) - len(x.decode(encoding))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xcf in position 0: ordinal not in range(128)

That should have been a UnicodeEncodeError (which would have been caught and the test passed).

I expected to see:

In [7]: x=u'\u03c3'
   ...: len(x)-len(x.decode('ascii'))
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-7-33671dd44d61> in <module>()
      1 x=u'\u03c3'
----> 2 len(x)-len(x.decode('ascii'))

UnicodeEncodeError: 'ascii' codec can't encode character u'\u03c3' in position 0: ordinal not in range(128)

x is a unicode string, which gets implicitly encoded into ASCII when decoded,
it doesn't make sense that that would succeed and the decode immediately
following would fail.

Can you break in with pdb and print out x at the point of failure?

nosetests pandas/tests/test_fomat -m repr --pdb
<break in>
(pdb) print x

also, what are the values of the $LC_ALL and $LANG envars?

@neirbowj
Copy link
Contributor Author

% nosetests pandas/tests/test_format.py -m repr --pdb
.
** (process:2949): WARNING **: Trying to register gtype 'GMountMountFlags' as enum when in fact it is of type 'GFlags'

** (process:2949): WARNING **: Trying to register gtype 'GDriveStartFlags' as enum when in fact it is of type 'GFlags'

** (process:2949): WARNING **: Trying to register gtype 'GSocketMsgFlags' as enum when in fact it is of type 'GFlags'
.............> /home/obrienjw/src/github/pandas/pandas/core/format.py(168)_encode_diff()
-> return len(x) - len(x.decode(encoding))
(Pdb) print x
*** UnicodeEncodeError: 'ascii' codec can't encode character u'\u03c3' in position 0: ordinal not in range(128)
(Pdb)

Neither $LC_ALL nor $LANG are defined in the environment.

This has been vexing me but I just breathed a sigh of relief since you just pointed out the UnicodeEncodeError vs. UnicodeDecodeError difference. I hadn't noticed it, and was racking my brain to understand why the except block wasn't catching it. D'oh! At least now there is an explanation for why this test passes sometimes and fails other times (i.e. different exceptions).

As for the UTF-8 support, I'll see about turning on support in my terminal so that I can run tests in a richer environment. I'm the pandas maintainer for FreeBSD so it's not just my data I need to worry about.

@ghost
Copy link

ghost commented Apr 15, 2013

I'm pretty sure the code that's misbehaving (although the wrong exception is puzzling)
can actually be removed (#3364), it was a stopgap that's no longer necessary after other
unicode-related changes. But that's not a change I'll merge a few days before a major release.

I can't repro any of this on my box, but if you figure out why the DecodeError is happening
instead of the EncodeError that's 99% of it.

@ghost
Copy link

ghost commented Apr 15, 2013

was the pickle issue a false alarm?

@neirbowj neirbowj reopened this Apr 15, 2013
@neirbowj
Copy link
Contributor Author

Oops. Not sure about the pickle test. Will see if I can reproduce it tonight.

@jreback
Copy link
Contributor

jreback commented Apr 15, 2013

fyi....do a clean git clone and try.....the error message on there, can't find multiarray is a numpy module (so also make sure that it is loading correct numpy), etc.

@ghost
Copy link

ghost commented Apr 15, 2013

Relavent perhaps:
http://stackoverflow.com/questions/9641916/python-pandas-cant-find-numpy-core-multiarray-when-importing-pandas

might you be using a 32bit compiled numpy on an amd64 system?

@neirbowj
Copy link
Contributor Author

OK, here's the reproducibility pattern I'm seeing for the test_legacy_pickle failure.

  • build from tarball (via FreeBSD math/py-pandas port with my draft patch applied) and test installed: fails
  • build from git at 0.11.0rc1 tag and test in-place: passes
  • build from git at latest master (a9f3f6d) and test in-place: passes

Next I'm going to try installing from git without assistance from the FreeBSD ports machinery and/or point FreeBSD ports at the repo instead of the tarball. Depending on the result I will probably suspect Cython or FreeBSD ports, but hopefully not both.

As for word-length, I see no possibility that pandas and numpy are mismatched on this machine.

% uname -m
amd64
% find \
    /usr/local/lib/python2.7/site-packages/pandas-0.11.0rc1-py2.7-freebsd-9.1-STABLE-amd64.egg \
    /usr/local/lib/python2.7/site-packages/numpy -name "*.so" \
    | xargs file -b \
    | sort | uniq
ELF 64-bit LSB shared object, x86-64, version 1 (FreeBSD), dynamically linked, not stripped

@jreback
Copy link
Contributor

jreback commented Apr 16, 2013

so the py-pandas port is the FreeBSD packing mechanism? looks like it installs 0.10.1?

if you install 0.10.1 from the pypi tarball does it work for you?

@neirbowj
Copy link
Contributor Author

@jreback: Yes, "port" in this context is like the .spec file and associated cruft to generate an RPM. I published a working draft for 0.11.0rc1 for testing to a mailing list. After 0.11.0 is released, I will revise the patch and submit it as a pandas maintainer update. All versions published to FreeBSD ports so far have had all tests passing, including 0.10.1.

@neirbowj
Copy link
Contributor Author

  • build/install from FreeBSD port + patch for 0.11.0rc1 tarball: fails (as before)
  • build/install from FreeBSD port + patch for 0.11.0rc1 github tag: passes

It's long past my bedtime, but when I get a chance tomorrow I will take a look at the diff between the tarball-provided .C files and my cython-generated .C files. That is, unless there is something else I should do first.

At this point I think I have a fallback plan. In the event that we cannot determine and resolve the root cause of this issue prior to release, I could convert the port to build from github. I would rather not do that, though, because it adds a build-time dependency (cython), and increases the number of things that could vary across user build environments.

@jreback
Copy link
Contributor

jreback commented Apr 17, 2013

what's your cython version?

@neirbowj
Copy link
Contributor Author

Cython 0.17.1. See above for other versions.

@ghost
Copy link

ghost commented Apr 17, 2013

any new leads on the unexplained UnicodeDecodeError? that's pretty straightforward
to tackle using pdb.

@neirbowj
Copy link
Contributor Author

@y-p: Sorry, but no. I've only had about an hour a day to work on this so far. I should have a good chunk of time this evening, so I hope to be able to report more substantial progress soon.

@ghost
Copy link

ghost commented Apr 17, 2013

I don't see this blocking 0.11, all the evidence points at the issue being somewhere along
the freebsd pipeline. If there's a freebsd vagrant box and you post steps to reproduce,
maybe we can speed things along towards resolution.

@neirbowj
Copy link
Contributor Author

@y-p: Vagrant is a good idea, but it's not working out in practice. Using xironix/freebsd-vagrant, I got to vagrant ssh before it broke. I'm not about to insert this as a prerequisite problem to solve before I get back to work on solving the pandas problem. If you feel like standing up a stock FreeBSD 9.1 system in VirtualBox, I'll gladly provide steps to reproduce.

@neirbowj
Copy link
Contributor Author

@y-p: For the UnicodeDecodeError in TestDataFrameFormatting.test_to_string_repr_unicode, here's what I've learned so far with the help of pdb.

When the test passes, and the UnicodeEncodeError is correctly caught, the offending value is u'\u03c3a' and the display encoding is 'US-ASCII'. When the test fails, and the unexpected UnicodeDecodeError is raised, the offending value and display encoding are the same as they are when the test passes. Terminal captures available on request.

Next up: I need to find out what else could cause python to raise a different exception.

@ghost
Copy link

ghost commented Apr 18, 2013

I think I figured out the problem, the encoding detection routine was too rigid.
Pushed 729d333 to master, please see if it resolves the unicode issues.

That also possibly solves the failed unicode support detection on your
system, you can confirm by checking whether display.encoding is now utf-8 or similar.

@ghost
Copy link

ghost commented Apr 20, 2013

@neirbowj , was the fix effective for you? would be good to clear this up prior to 0.11.0

@neirbowj
Copy link
Contributor Author

@y-p: Sorry for the delay. No, 729d333, when applied in situ to my tarball build does not change the exception behaviour inside of pandas.core.format:SeriesFormatter.to_string.

This is what is currently bending my brain. In addition to applying 729d333, I have instrumented to_string with pdb so that it will drop me into the debugger at essentially the same point under both passing and failing conditions.

When TestDataFrameFormatting.test_to_string_repr_unicode passes:

% nosetests -s pandas.tests.test_format:TestDataFrameFormatting.test_to_string_repr_unicode  
> /usr/local/lib/python2.7/site-packages/pandas-0.11.0rc1-py2.7-freebsd-9.1-STABLE-amd64.egg/pandas/core/format.py(152)to_string()
-> idx = k.ljust(pad_space)
(Pdb) import codecs
(Pdb) codecs.getdecoder('ascii') 
<built-in function ascii_decode>
(Pdb) shrubbery = u'\u03c3'
(Pdb) shrubbery.decode('ascii')
*** UnicodeEncodeError: 'ascii' codec can't encode character u'\u03c3' in position 0: ordinal not in range(128)
(Pdb) 

When TestDataFrameFormatting.test_to_string_repr_unicode fails:

% nosetests -s pandas.tests.test_format:TestDataFrameFormatting
.
** (process:84153): WARNING **: Trying to register gtype 'GMountMountFlags' as enum when in fact it is of type 'GFlags'

** (process:84153): WARNING **: Trying to register gtype 'GDriveStartFlags' as enum when in fact it is of type 'GFlags'

** (process:84153): WARNING **: Trying to register gtype 'GSocketMsgFlags' as enum when in fact it is of type 'GFlags'
...............................................> /usr/local/lib/python2.7/site-packages/pandas-0.11.0rc1-py2.7-freebsd-9.1-STABLE-amd64.egg/pandas/core/format.py(155)to_string()
-> result[i] = result[i] % (idx, v)
(Pdb) import codecs
(Pdb) codecs.getdecoder('ascii')
<built-in function ascii_decode>
(Pdb) shrubbery = u'\u03c3'
(Pdb) shrubbery.decode('ascii')
*** UnicodeDecodeError: 'ascii' codec can't decode byte 0xcf in position 0: ordinal not in range(128)
(Pdb) 

What I think this shows is that some internal state that is non-obvious to me is affecting the behaviour of the ascii_decode builtin. Do any of the tests monkey patch something that ascii_decode uses?

@ghost
Copy link

ghost commented Apr 20, 2013

There's the stdin_encoding context manager in util/testing.py, but
I think it's used anywhere.

you need to check the value of sys.getdefaultencoding() in either situation.
That's what python uses to coerce between string and unicode.

it's ascii on every system I've ever seen.

@neirbowj
Copy link
Contributor Author

FreeBSD is off the hook for meddling with line endings. If you recall, test_legacy_pickle passes when I build from GH. We had considered the possibility of a Cython-related root-cause. In fact, it looks like the 0.11.0rc1 tarball may contain nothing but files with CRLF line endings, while the GH repo doesn't.

@neirbowj
Copy link
Contributor Author

Options for test_legacy_pickle failure:

  1. Update the test to use obj = pickle.load(open(ppath, 'rU')) to make it insensitive to line endings.
  2. Update the test (or add a new one) to ensure that multiindex_v1.pickle is installed with correct newlines.

@ghost
Copy link

ghost commented Apr 21, 2013

On MPL issues.

Changing sys.setdefaultencoding() back is a no go. might break GTK backend, and general
anti-pattern. edit: dislike changing display encoding midrun as well. see below.

Try #3409. I think it solves the problem.
Basically, the code that's failing is there to handle bytestrings, not unicode, so the
dependency on the value of sys.getdefaultencoding() is optional.

That doesn't solve the issue of MPL changing sys.getdefaultencoding, but the contract is that
your system reports it's capabilities at import time, and that's what we go by.

The rest of pandas shouldn't depend on getdefaultencoding() directly, so the end result
is no unicode support when the system reports it only supports ascii, but no exceptions.

@neirbowj
Copy link
Contributor Author

#3409 does it for me. No more test failures.

@ghost
Copy link

ghost commented Apr 21, 2013

Good, merged.

Now, about the pickle issues. @jreback what do you say?

@jreback
Copy link
Contributor

jreback commented Apr 21, 2013

this code was not changed at all in 0.11, maybe the filename ha the embedded /r when it was generated?

@ghost
Copy link

ghost commented Apr 21, 2013

@neirbowj , when you says "remember that it passes when I built from GH", do you mean that you
ran python ./setup.py sdist, got a tarball, installed it and the test passed?

@ghost
Copy link

ghost commented Apr 21, 2013

Ok, Github repo clone:

λ ll pandas/tests/data/

drwxr-xr-x 2 user1 user1 4096 Apr 13 05:02 ./
drwxr-xr-x 3 user1 user1 4096 Apr 13 05:02 ../
-rw-r--r-- 1 user1 user1 4750 Jun  4  2012 iris.csv
-rw-r--r-- 1 user1 user1  670 Jun 28  2012 mindex_073.pickle
-rw-r--r-- 1 user1 user1 1249 Feb  9  2012 multiindex_v1.pickle
-rw-r--r-- 1 user1 user1 8188 Apr 10 10:02 tips.csv
-rw-r--r-- 1 user1 user1  595 Apr 26  2012 unicode_series.csv

tarball from (windows?) build box:

/tmp/pandas-0.11.0rc1  
λ ll ~/src/pandas/pandas/tests/data/
total 36
drwxr-xr-x 2 user1 user1 4096 Apr 21 16:32 ./
drwxr-xr-x 3 user1 user1 4096 Apr 21 18:29 ../
-rw-r--r-- 1 user1 user1 4600 Apr 21 16:32 iris.csv
-rw-r--r-- 1 user1 user1  670 Apr 21 16:32 mindex_073.pickle
-rw-r--r-- 1 user1 user1 1101 Apr 21 16:32 multiindex_v1.pickle
-rw-r--r-- 1 user1 user1 7943 Apr 21 16:32 tips.csv
-rw-r--r-- 1 user1 user1  577 Apr 21 16:32 unicode_series.csv

NB the changed file size(s) of the pickle file multiindex_v1.pickle
I did a setup.py sdist on linux, and the pickle files are unchanged.

The idea that windows would mangle binary files for line termination just
boggles the mind.

cc @changhiskhan
edit: those are actually reversed

@ghost
Copy link

ghost commented Apr 21, 2013

possibly related

git config  core.autocrlf

@neirbowj
Copy link
Contributor Author

@y-p: When I talk about building from a tarball, I'm referring to the published sdist. When I refer to building from GH, FreeBSD fetches a GH-produced archive, and thereafter treats it like a regular sdist (i.e. extract into a working directory, configure, build, install). The GH method does not actually use git locally, but it can be configured to ask GH for a tarball from an arbitrary ref.

@ghost
Copy link

ghost commented Apr 21, 2013

Ok. looks like There's something wonky with the build box, don't think it's a pandas issue per se.

@changhiskhan
Copy link
Contributor

I downloaded the tarball from http://pandas.pydata.org/pandas-build/dev/
this is what i get. The file sizes seem to match the github repo clone rather than the one you got from the tarball.

 ~/Downloads/pandas-0.11.0rc1/pandas/tests/data $ ll
total 28
-rw-rw-r-- 1 chang chang 4750 Jun  3  2012 iris.csv
-rw-rw-r-- 1 chang chang  670 Jun 28  2012 mindex_073.pickle
-rw-rw-r-- 1 chang chang 1249 Feb  9  2012 multiindex_v1.pickle
-rw-rw-r-- 1 chang chang 8188 Apr 10 00:02 tips.csv
-rw-rw-r-- 1 chang chang  595 Apr 26  2012 unicode_series.csv

@ghost
Copy link

ghost commented Apr 22, 2013

those file sizes match what I get from the tarball as well,
but not the sizes of my repo clone., which are the sizes correlated
with passing tests.

total 36
drwxr-xr-x 2 user1 user1 4096 Apr 22 20:38 ./
drwxr-xr-x 3 user1 user1 4096 Apr 22 20:55 ../
-rw-r--r-- 1 user1 user1 4600 Apr 22 20:38 iris.csv
-rw-r--r-- 1 user1 user1  670 Apr 22 20:38 mindex_073.pickle
-rw-r--r-- 1 user1 user1 1101 Apr 22 20:38 multiindex_v1.pickle
-rw-r--r-- 1 user1 user1 7943 Apr 22 20:38 tips.csv
-rw-r--r-- 1 user1 user1  577 Apr 22 20:38 unicode_series.csv

note also

λ wget https://github.com/pydata/pandas/raw/master/pandas/tests/data/multiindex_v1.pickle
λ ll
total 2736
drwxr-xr-x  3 user1 user1    4096 Apr 22 22:28 ./
drwxrwxrwt 16 root  root    36864 Apr 22 22:26 ../
-rw-r--r--  1 user1 user1    1101 Apr 22 22:28 multiindex_v1.pickle

@neirbowj
Copy link
Contributor Author

You should be able to replicate the test failure on any system just by

import pickle
obj = pickle.load(open('multiindex_v1.pickle', 'r'))

If your pickle has '\r's this load should always fail, no matter what size it is.

I'm not sure it really matters for anything other than this v0 pickle file, but wouldn't it be best to follow the common practice of shipping source code archives that match the likely convention of their intended platforms (.zip:CRLF, .tar.gz:LF)? If so, there should be a test.

@ghost
Copy link

ghost commented Apr 22, 2013

The contents of binary files should not change to match the platform you're on.

@neirbowj
Copy link
Contributor Author

Absolutely not, but a significant majority of what's in a source code archive is not binary.

% find ./ -path "./.git*" -prune -o -type f -print | sed 's/.*\(\.[^.]*\)/\1/' | sort | uniq -c
   1 ./LICENSE
   1 ./LICENSES/NUMPY_LICENSE
   1 ./LICENSES/OTHER
   1 ./LICENSES/PSF_LICENSE
   1 ./LICENSES/SCIPY_LICENSE
   1 ./Makefile
   1 ./doc/data/fx_prices                       # binary
   1 ./doc/source/_static/stub                  # empty
   1 ./examples/data/SOURCES                    # empty
   1 ./pandas/src/parser/Makefile
   1 ./scripts/git-mrb
   1 ./vb_suite/source/_static/stub             # empty
   5 .R
   7 .bat
   5 .c
   2 .conf
   1 .coveragerc
   2 .css_t
  10 .csv
   1 .data
   2 .gitignore
  17 .h
   6 .h5                                        # binary
   2 .html
   2 .in
   2 .ini
   1 .md
  10 .pickle                                    # binary... except multiindex_v1.pickle (sort of)
   2 .png                                       # binary
   7 .pxd
 263 .py
  16 .pyx
  27 .rst
  12 .sh
   1 .table
  19 .txt
   4 .xls                                       # binary
   1 .xlsx                                      # binary
   1 .yml

So there are two issues:

  1. Automation is confused by multiindex_v1.pickle, because it looks like non-binary, and therefore is subject to newline conversion, but it should be treated as if it were binary, because pickle protocol v0 cannot handle CRLF newlines, even (I think) on platforms where CRLF is the norm. pandas should either perform the test using universal newlines so that it CRLFs in the file don't cause the test to fail, or pandas should perform a new test as described next.
  2. In general, newline conversion to suit the target platform is a good practice, so for files that really are text, pandas should test to ensure that the newlines in its text files match the convention of the host.

The former is the best way to resolve this issue. If you agree with the latter, I will open a new issue targeted to 0.12.

@jreback
Copy link
Contributor

jreback commented Apr 22, 2013

FYI I am not sure when/how the multindex.pickle was generated (it is possible it was written not in binary mode)
and I think it is pretty old anyhow, not even sure which version it is supposed to be testing

we have in place pickle compat tests going forward

@changhiskhan
Copy link
Contributor

@y-p oh I see. I think you flipped the file sizes in your original comparison? What I got for the sizes on linux is what you posted under windows and vice versa.

@changhiskhan
Copy link
Contributor

at the very least we should switch to creating the tarball/zip on linux instead.

@ghost
Copy link

ghost commented Apr 22, 2013

I sure did. sorry.

doesn't

git config  core.autocrlf false

cure it?

@wesm
Copy link
Member

wesm commented Apr 23, 2013

Sorry for the trouble I caused by building the source distros on Windows, had been all linux til now =) Chang is uploading new rc1 tarballs and I'm going to work on cutting the 0.11 final now

@wesm wesm closed this as completed Apr 23, 2013
@changhiskhan
Copy link
Contributor

@y-p sorry, it did fix it and I put the new tarball and zip files up there. File sizes look alright to me now.
@neirbowj if you haven't reached your pain tolerance yet, please give it one more shot. Should be fine but can reopen the issue if it's still a problem

@neirbowj
Copy link
Contributor Author

@changhiskhan You've probably noticed that I have what some might call an unhealthy tolerance for pain. I appreciate you and @y-p hanging in and resolving these test failures.

@wesm No trouble. Just another opportunity for me to learn something about something, and a new corner case that might admit a new test or two. Congrats on the latest release.

All tests passing (skipped 115) on FreeBSD 9.1-STABLE (r248078), with:

SHA256 (pandas-0.11.0rc1.tar.gz) = d7adf3cbd7febe4d3ad35cd5cd13f464c0aa9add58b5cf3a19c2444f6dbe1014

Off to update the port for 0.11.0 release.

@changhiskhan
Copy link
Contributor

hooray! thank @y-p for the fix.

edit: yp -> y-p

@ghost
Copy link

ghost commented Apr 23, 2013

no damn it. thank me.

@neirbowj
Copy link
Contributor Author

My work here is done (for now). Good night.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Unicode Unicode strings
Projects
None yet
Development

No branches or pull requests

4 participants