BUG: Series dtype casting to platform numeric (GH #2751) #2838

jreback · 2013-02-11T02:04:12Z

use int64/float64 defaults in construction when dtype is not specified
this removes a platform dependency issue that caused dataframe and series to
have different dtypes on 32- bit

fixes issues raised in PR #2837

stephenwlin · 2013-02-11T02:40:15Z

just curious, does this still work without overflow if the list explicitly contains longs? (i.e. df=p.DataFrame({'a' : [2**31,2**31+1]}))

jreback · 2013-02-11T02:56:14Z

that first commit blew up....was always using platform int, but that's not exactly right, trying asarray (which is what DataFrame does)...

jreback · 2013-02-11T03:03:05Z

@stephenwlin

can you take a look with either of my commits? these should work
I am in theory duplicating what DataFrame([1,2]) does for DataFrame({'a' : [1,2]})
(all tests pass on 64-bit), something is forcing int64 elsewhere and so a bunch of tests fails

stephenwlin · 2013-02-11T03:06:01Z

sure. i guess you're not developing on 32-bit so you don't have an easy way to test this? i'll look into it after i finish with something else...

jreback · 2013-02-11T03:11:12Z

nope....64-bit...linux....no easy way to even put a dev build (i did it on windows, but then all kind of things are hard :)

I guess it comes down to:

should DataFrame([1,2]) be int32 on 32-bit and int64 on 64-bit, or always int64
before the dtypes change this was very clear, always int64, now not so sure

@wesm, @changhiskhan any thoughts?

jreback · 2013-02-11T14:44:53Z

latest commit only breaks 4 tests on 32-bit!
I changed DataFrame([1,2]) to do what series does now (which is upconverts lists to int64 I think)

jreback · 2013-02-11T14:48:16Z

I need to rebase this...later..

stephenwlin · 2013-02-11T19:02:29Z

Yeah, I agree that it's better to use the same code path throughout for consistentcy. Not entirely sure that upcasting to int64 always is the best, but it's the path of least resistance for now.

I found some other issues in maybe_convert_objects (#2845) which I've fixed in a branch but I haven't changed that part of the behavior, pending a decision from someone else on what should be done.

jreback · 2013-02-11T19:26:18Z

ok..thanks....let me know if you have any success with this branch.....I cannot easily debug this....but think that unless explicty types, ints get int64 and floats get float64 regardless of architecture (which is how numpy defaults). This, I is also what you get when you convert to objects and then maybe_convert_objects. Bottom line is that DataFrame([1,2]) should be int64

ghost · 2013-02-12T21:03:16Z

@jreback, you can setup a VM for 32 bit testing with vagrant, ubuntu has
precise32 boxes all ready, just put the source tree in a shared directory and go.

jreback · 2013-02-12T21:09:42Z

oh...this looks nice....any particular box you think?

jreback · 2013-02-12T21:20:04Z

@y-p ignore my last - didn't read it

jreback · 2013-02-12T22:03:58Z

http://www.dejonghenico.be/unix/setup-vagrant-and-small-quick-start

setup was pretty easy

jreback · 2013-02-12T22:19:49Z

@y-p any chance u have a chef recipe laying around?

ghost · 2013-02-12T22:24:48Z

sorry. I think once you create the box, you can edit the shared folders
through the vbox gui, if that's what you're after.

jreback · 2013-02-12T22:27:31Z

it's more like a Travis install script to setup the base environment
no worries
thanks

ghost · 2013-02-12T22:32:06Z

I wouldn't bother with chef, just use virtualenv for 3.x, like you would on your local box.
tox used to work, but bugged out at some point during the distribute/virtualenv mess.

jreback · 2013-02-12T22:35:24Z

I just did manual installation
all worked
(pip was a little funny - I had to install everything via apt-get)
but all ok and got tests to run

now to debug!

…eric when a list is specified; use the Series codepath for initial list conversion (change from using DataFrame) TST: added test for overflow in df creation

jreback · 2013-02-13T04:15:59Z

@stephenwlin ok I fixed this up
pls check this out and lmk if anything looks weird

API change on 32- bit is that
DataFrame([1,2]) is now int64 (and not int 32)
what was this in 0.10.1?

{'a' : [1,2] } will still be int64

stephenwlin · 2013-02-13T05:12:19Z

Looks good but I'm curious why you're calling _possibly_convert_objects with three non-default arguments just to end up getting to values = lib.list_to_object_array(values) and values = lib.maybe_convert_objects(values) instead of just calling the latter two directly (as was originally done in _sanitize_array). No big deal, but it just looks kind of ugly with the calls spread out over four different lines.

Also, from looking at the test it seems that now DataFrame([1, 2]) and DataFrame({'a' : [1, 2]}) yield int64 but DataFrame({'a' : 1}) yields platform int...maybe that should change too?

stephenwlin · 2013-02-13T06:30:38Z

also "platform independent manor" typo in doc/source/v0.11.0.txt

jreback · 2013-02-13T12:06:30Z

did all of these 3 cases yield int64 in 0.10.1 on 32-bit?

jreback · 2013-02-13T13:19:26Z

what's wrong with 'platform independent manor'?

jreback · 2013-02-13T16:33:32Z

getting closer!

@stephenwlin FYI I moved _dtype_from_scalar to common
I think it makes sense to combine this with your _maybe_promote (and _try_cast)
maybe needs another argument - and have to fix API

stephenwlin · 2013-02-13T16:41:07Z

platform independent "manner", right?

stephenwlin · 2013-02-13T16:45:26Z

0.10.0 good enough?

In [1]: import pandas as p

In [2]: p.__version__
Out[2]: '0.10.0'

In [3]: p.DataFrame({'a': 1}, index=[0]).dtypes
Out[3]: a    int64

In [4]: p.DataFrame({'a': [1, 2]}).dtypes
Out[4]: a    int64

In [5]: p.DataFrame([1, 2]).dtypes
Out[5]: 0    int64

jreback · 2013-02-13T17:31:26Z

yes

ok so the behavior I am suggesting will not actually change the API - that is good

jreback · 2013-02-13T17:33:13Z

duh!

thxs

On Feb 13, 2013, at 11:41 AM, stephenwlin notifications@github.com wrote:

platform independent "manner", right?

—
Reply to this email directly or view it on GitHub.

…m_scalar

jreback · 2013-02-13T22:27:50Z

finally all tests pass!

@stephenwlin pls take a look when you have a chance....

yield _infer_dtype_from_scalar

in rehashpe.py - removed block2d_to_block3d in favor of block2d_to_blocknd

stephenwlin · 2013-02-14T05:18:59Z

It looks great! But I changed the signature of _maybe_promote to return a (possibly modified) fill_value (it makes sense when you see it) in my "stephenwlin/opt-take-2" branch, so it'll conflict during a merge.

If you want to proactively fix it, I've already resolved the differences in "stephenwlin/dtypes_bug" (currently pointing to f74571f7a613d1f971c99cac7a53ee077b1582f6), so if you "git reset" to that you'll have all your changes merged with the _maybe_promote signature change (but not the rest of "stephenwlin/opt-take-2"), rebased appropriately so we'll merge cleanly later. (Basically, I made a minimal commit off master with just the signature change and then rebased the two branches off that, so they share the same history)

If you want to look at the differences to make sure they're ok, just take a look at the commit 05a4991f014f7ed55b6d8270f06c3c554f05189b, which shows only the differences between the current state of your branch (at 3cb91f0) and "stephenwlin/dtypes_bug", without any rebasing.

If you'd rather not bother, that's ok too...I can always do the conflict resolution again later after one of our PRs gets merged.

jreback · 2013-02-14T12:05:54Z

actually with that change I think I can get rid of _maybe_upcast
if I change a bit more (let me see)

I think we are close to merging - what do u think of this order
to minimize effort:

maybe_convert_objects (you)
dypes_bug (me)
opt-take2 (you - does this supersede opt-take)?
dtypes1 (me)

we each can rebase after each step
it doesn't look like much conflicts
in any event

jreback · 2013-02-14T12:09:19Z

@wesm any comments on these 4 branches
they all pass (and work pretty hard to avoid API issues) and lots of new tests
?

jreback · 2013-02-14T13:08:05Z

@stephenwlin also should either reverse return values from _maybe_promote or _infer_dtype_fr_scalar

value, dtype vs dtype,value

I don't have a preference
(could actually use a named tuple)??
or is that too weird ?

stephenwlin · 2013-02-14T16:05:30Z

i prefer "dtype, value" since both of their names refer primarily to the dtype (modifying the value appropriately is just an extra benefit)

jreback · 2013-02-14T16:31:16Z

great

I will change

btw - how can I pull in your change commit for _maybe_promote to mine
?

stephenwlin · 2013-02-14T18:38:42Z

hmm, at this point, probably just do

git remote add stephenwlin https://github.com/stephenwlin/pandas.git
git fetch stephenwlin
git cherry-pick 05a4991f014f7ed55b6d8270f06c3c554f05189b
git commit --amend -m "CLN: change _maybe_promote signature to return modified fill_value"

stephenwlin · 2013-02-14T19:03:10Z

I don't think it's that necessary to get rid of _maybe_upcast either...it doesn't do much but

    datacopy, fill_value = com._maybe_upcast(data, copy=True)

is pretty succinct and might be useful in more places than I what I used it for so far. (it guarantees either a straight copy or an upcasted copy, but won't copy twice in the latter case as was being done previously where I added it)

and _infer_dtype_from_scalar to match (both return dtype, fill_value) Diff between 'jreback/dtypes_bug' and 'stephenwlin/dtypes_bug' Conflicts: pandas/core/common.py

jreback · 2013-02-14T19:40:26Z

all fixed up...thanks...that was a neat trick!

andn I changed _infer_dtype_from_scalar return signature to match
_maybe_upcast left alone....may be useful in future...
now I think all dtype determination is done in these 2 routines (instead of in myriad places before)

once travis finishes....can prob start merging

and issue with merge order?

stephenwlin · 2013-02-14T21:11:05Z

The merge order is ok in theory but I tested it and the conflicts are kind of nasty right now because git isn't smart enough to figure out that we made more-or-less the same changes twice.

Can you do:

git fetch stephenwlin
git checkout dtypes_bug
# just in case
git branch dtypes_bug_backup dtypes_bug
git reset stephenwlin/dtypes_bug
# compare against dtypes_bug_backup (should be no change)
git diff dtypes_bug_backup
git branch -d dtypes_bug_backup

"stephenwlin/dtypes_bug" should be identical to your "dtypes_bug" except rebased against b3202ebc282eace11de3089498a5a5ea3689f9e4

EDIT: sorry, actually that doesn't help that much (most of the conflicts are still there)...never mind for now...

jreback · 2013-02-14T21:33:19Z

the branches are identical...

what is conflicting?

if i merged dtypes_bug to master...then you can rebase....if i have the changes (and you do u)
should git either ignore, or just make you do a commit to accept them?

stephenwlin · 2013-02-14T21:38:49Z

lots of stuff in common.py conflicts if you merge "stephenwlin/opt-take-2" on top of "jreback/dtypes_bug" but I can't really figure out why. it's not a big deal because I know how they resolve but I won't be doing the actual merge into master so I figure I might as well try to proactively arrange things so that the conflicts go away, helping out whoever does do it. I just can't seem to figure out how, though: not enough git-fu yet.

it'll be fine if I rebase "stephenwlin/opt-take-2" after "jreback/dtypes_bug" is in master and resolve things then before it's merged. if you're going to do it yourself, just let me know and I'll rebase after when I get a chance. I just figured I could fix it proactively, but I guess not.

stephenwlin · 2013-02-14T21:48:22Z

(i went through the commits one by and and the two branches are basically independent after rebasing against b3202ebc282eace11de3089498a5a5ea3689f9e4 ... it's really odd that they conflict so much if you try to merge them...oh well, I'm not going to try to spend more time fixing this)

jreback · 2013-02-14T21:59:22Z

ok
If I merge your convert_objects
then dtypes_bug

you can then rebase
I don't think there will be any actual conflicts
just dups that git can't resolve

ok?

stephenwlin · 2013-02-14T22:15:28Z

yeah, there's no real conflicts it just looks like there are.

jreback · 2013-02-14T22:18:44Z

ok merged your convert branch
dtypes_bug in a few

jreback · 2013-02-14T22:34:01Z

just waiting for travis to finish...then will merge...i had no problems rebasing (to be fair rebase only on top of convert_objects)!

jreback · 2013-02-14T23:25:46Z

@stephenwlin you are up....rebase and let me know....

stephenwlin · 2013-02-14T23:50:10Z

all fixed up

jreback added 2 commits February 12, 2013 22:16

BUG: fixup GH pandas-dev#2751; make sure that we cast to platform num…

3c345a1

…eric when a list is specified; use the Series codepath for initial list conversion (change from using DataFrame) TST: added test for overflow in df creation

DOC: RELEASE and whatsnew updated for DataFrame from lists change

37bb22a

jreback added 3 commits February 13, 2013 16:34

CLN: cleaned up _possibly_convert_platform

6cdea33

CLN: moved some functionality from series._sanitize to com._dtype_fro…

43a0102

…m_scalar

DOC: whatsnew updates

ac3cdab

jreback added 2 commits February 13, 2013 18:36

CLN: in common.py merged _dtype_from_scalar and _infer_dtype

0e7c20e

yield _infer_dtype_from_scalar

CLN: in common.py - revised _maybe_upcast to use _maybe_promote

3cb91f0

in rehashpe.py - removed block2d_to_block3d in favor of block2d_to_blocknd

TST: force rebuild

2ce3b56

CLN: change call signature of _maybe_promote (from stephenwlin branch)

cb56c98

and _infer_dtype_from_scalar to match (both return dtype, fill_value) Diff between 'jreback/dtypes_bug' and 'stephenwlin/dtypes_bug' Conflicts: pandas/core/common.py

jreback merged commit cb56c98 into pandas-dev:master Feb 14, 2013

jreback mentioned this pull request Feb 15, 2013

ENH: should shift return same dtype objects as input? #2761

Closed

BUG: Series dtype casting to platform numeric (GH #2751) #2838

BUG: Series dtype casting to platform numeric (GH #2751) #2838

Conversation

jreback commented Feb 11, 2013

stephenwlin commented Feb 11, 2013

jreback commented Feb 11, 2013

jreback commented Feb 11, 2013

stephenwlin commented Feb 11, 2013

jreback commented Feb 11, 2013

jreback commented Feb 11, 2013

jreback commented Feb 11, 2013

stephenwlin commented Feb 11, 2013

jreback commented Feb 11, 2013

ghost commented Feb 12, 2013

jreback commented Feb 12, 2013

jreback commented Feb 12, 2013

jreback commented Feb 12, 2013

jreback commented Feb 12, 2013

ghost commented Feb 12, 2013

jreback commented Feb 12, 2013

ghost commented Feb 12, 2013

jreback commented Feb 12, 2013

jreback commented Feb 13, 2013

stephenwlin commented Feb 13, 2013

stephenwlin commented Feb 13, 2013

jreback commented Feb 13, 2013

jreback commented Feb 13, 2013

jreback commented Feb 13, 2013

stephenwlin commented Feb 13, 2013

stephenwlin commented Feb 13, 2013

jreback commented Feb 13, 2013

jreback commented Feb 13, 2013

jreback commented Feb 13, 2013

stephenwlin commented Feb 14, 2013

jreback commented Feb 14, 2013

jreback commented Feb 14, 2013

jreback commented Feb 14, 2013

stephenwlin commented Feb 14, 2013

jreback commented Feb 14, 2013

stephenwlin commented Feb 14, 2013

stephenwlin commented Feb 14, 2013

jreback commented Feb 14, 2013

stephenwlin commented Feb 14, 2013

jreback commented Feb 14, 2013

stephenwlin commented Feb 14, 2013

stephenwlin commented Feb 14, 2013

jreback commented Feb 14, 2013

stephenwlin commented Feb 14, 2013

jreback commented Feb 14, 2013

jreback commented Feb 14, 2013

jreback commented Feb 14, 2013

stephenwlin commented Feb 14, 2013