Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Series dtype casting to platform numeric (GH #2751) #2838

Merged
merged 9 commits into from
Feb 14, 2013

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Feb 11, 2013

use int64/float64 defaults in construction when dtype is not specified
this removes a platform dependency issue that caused dataframe and series to
have different dtypes on 32- bit

fixes issues raised in PR #2837

@stephenwlin
Copy link
Contributor

just curious, does this still work without overflow if the list explicitly contains longs? (i.e. df=p.DataFrame({'a' : [2**31,2**31+1]}))

@jreback
Copy link
Contributor Author

jreback commented Feb 11, 2013

that first commit blew up....was always using platform int, but that's not exactly right, trying asarray (which is what DataFrame does)...

@jreback
Copy link
Contributor Author

jreback commented Feb 11, 2013

@stephenwlin

can you take a look with either of my commits? these should work
I am in theory duplicating what DataFrame([1,2]) does for DataFrame({'a' : [1,2]})
(all tests pass on 64-bit), something is forcing int64 elsewhere and so a bunch of tests fails

@stephenwlin
Copy link
Contributor

sure. i guess you're not developing on 32-bit so you don't have an easy way to test this? i'll look into it after i finish with something else...

@jreback
Copy link
Contributor Author

jreback commented Feb 11, 2013

nope....64-bit...linux....no easy way to even put a dev build (i did it on windows, but then all kind of things are hard :)

I guess it comes down to:

should DataFrame([1,2]) be int32 on 32-bit and int64 on 64-bit, or always int64
before the dtypes change this was very clear, always int64, now not so sure

@wesm, @changhiskhan any thoughts?

@jreback
Copy link
Contributor Author

jreback commented Feb 11, 2013

latest commit only breaks 4 tests on 32-bit!
I changed DataFrame([1,2]) to do what series does now (which is upconverts lists to int64 I think)

@jreback
Copy link
Contributor Author

jreback commented Feb 11, 2013

I need to rebase this...later..

@stephenwlin
Copy link
Contributor

Yeah, I agree that it's better to use the same code path throughout for consistentcy. Not entirely sure that upcasting to int64 always is the best, but it's the path of least resistance for now.

I found some other issues in maybe_convert_objects (#2845) which I've fixed in a branch but I haven't changed that part of the behavior, pending a decision from someone else on what should be done.

@jreback
Copy link
Contributor Author

jreback commented Feb 11, 2013

ok..thanks....let me know if you have any success with this branch.....I cannot easily debug this....but think that unless explicty types, ints get int64 and floats get float64 regardless of architecture (which is how numpy defaults). This, I is also what you get when you convert to objects and then maybe_convert_objects. Bottom line is that DataFrame([1,2]) should be int64

@ghost
Copy link

ghost commented Feb 12, 2013

@jreback, you can setup a VM for 32 bit testing with vagrant, ubuntu has
precise32 boxes all ready, just put the source tree in a shared directory and go.

@jreback
Copy link
Contributor Author

jreback commented Feb 12, 2013

oh...this looks nice....any particular box you think?

@jreback
Copy link
Contributor Author

jreback commented Feb 12, 2013

@y-p ignore my last - didn't read it

@jreback
Copy link
Contributor Author

jreback commented Feb 12, 2013

@jreback
Copy link
Contributor Author

jreback commented Feb 12, 2013

@y-p any chance u have a chef recipe laying around?

@ghost
Copy link

ghost commented Feb 12, 2013

sorry. I think once you create the box, you can edit the shared folders
through the vbox gui, if that's what you're after.

@jreback
Copy link
Contributor Author

jreback commented Feb 12, 2013

it's more like a Travis install script to setup the base environment
no worries
thanks

@ghost
Copy link

ghost commented Feb 12, 2013

I wouldn't bother with chef, just use virtualenv for 3.x, like you would on your local box.
tox used to work, but bugged out at some point during the distribute/virtualenv mess.

@jreback
Copy link
Contributor Author

jreback commented Feb 12, 2013

I just did manual installation
all worked
(pip was a little funny - I had to install everything via apt-get)
but all ok and got tests to run

now to debug!

…eric

     when a list is specified; use the Series codepath
     for initial list conversion (change from using DataFrame)
TST: added test for overflow in df creation
@jreback
Copy link
Contributor Author

jreback commented Feb 13, 2013

@stephenwlin ok I fixed this up
pls check this out and lmk if anything looks weird

API change on 32- bit is that
DataFrame([1,2]) is now int64 (and not int 32)
what was this in 0.10.1?

{'a' : [1,2] } will still be int64

@stephenwlin
Copy link
Contributor

Looks good but I'm curious why you're calling _possibly_convert_objects with three non-default arguments just to end up getting to values = lib.list_to_object_array(values) and values = lib.maybe_convert_objects(values) instead of just calling the latter two directly (as was originally done in _sanitize_array). No big deal, but it just looks kind of ugly with the calls spread out over four different lines.

Also, from looking at the test it seems that now DataFrame([1, 2]) and DataFrame({'a' : [1, 2]}) yield int64 but DataFrame({'a' : 1}) yields platform int...maybe that should change too?

@stephenwlin
Copy link
Contributor

also "platform independent manor" typo in doc/source/v0.11.0.txt

@jreback
Copy link
Contributor Author

jreback commented Feb 13, 2013

did all of these 3 cases yield int64 in 0.10.1 on 32-bit?

@jreback
Copy link
Contributor Author

jreback commented Feb 13, 2013

what's wrong with 'platform independent manor'?

@jreback
Copy link
Contributor Author

jreback commented Feb 13, 2013

getting closer!

@stephenwlin FYI I moved _dtype_from_scalar to common
I think it makes sense to combine this with your _maybe_promote (and _try_cast)
maybe needs another argument - and have to fix API

@stephenwlin
Copy link
Contributor

platform independent "manner", right?

@stephenwlin
Copy link
Contributor

0.10.0 good enough?

In [1]: import pandas as p

In [2]: p.__version__
Out[2]: '0.10.0'

In [3]: p.DataFrame({'a': 1}, index=[0]).dtypes
Out[3]: a    int64

In [4]: p.DataFrame({'a': [1, 2]}).dtypes
Out[4]: a    int64

In [5]: p.DataFrame([1, 2]).dtypes
Out[5]: 0    int64

@jreback
Copy link
Contributor Author

jreback commented Feb 13, 2013

yes

ok so the behavior I am suggesting will not actually change the API - that is good

@jreback
Copy link
Contributor Author

jreback commented Feb 13, 2013

duh!

thxs

On Feb 13, 2013, at 11:41 AM, stephenwlin notifications@github.com wrote:

platform independent "manner", right?


Reply to this email directly or view it on GitHub.

@jreback
Copy link
Contributor Author

jreback commented Feb 13, 2013

finally all tests pass!

@stephenwlin pls take a look when you have a chance....

     in rehashpe.py - removed block2d_to_block3d in favor of block2d_to_blocknd
@stephenwlin
Copy link
Contributor

It looks great! But I changed the signature of _maybe_promote to return a (possibly modified) fill_value (it makes sense when you see it) in my "stephenwlin/opt-take-2" branch, so it'll conflict during a merge.

If you want to proactively fix it, I've already resolved the differences in "stephenwlin/dtypes_bug" (currently pointing to f74571f7a613d1f971c99cac7a53ee077b1582f6), so if you "git reset" to that you'll have all your changes merged with the _maybe_promote signature change (but not the rest of "stephenwlin/opt-take-2"), rebased appropriately so we'll merge cleanly later. (Basically, I made a minimal commit off master with just the signature change and then rebased the two branches off that, so they share the same history)

If you want to look at the differences to make sure they're ok, just take a look at the commit 05a4991f014f7ed55b6d8270f06c3c554f05189b, which shows only the differences between the current state of your branch (at 3cb91f0) and "stephenwlin/dtypes_bug", without any rebasing.

If you'd rather not bother, that's ok too...I can always do the conflict resolution again later after one of our PRs gets merged.

@jreback
Copy link
Contributor Author

jreback commented Feb 14, 2013

actually with that change I think I can get rid of _maybe_upcast
if I change a bit more (let me see)

I think we are close to merging - what do u think of this order
to minimize effort:

maybe_convert_objects (you)
dypes_bug (me)
opt-take2 (you - does this supersede opt-take)?
dtypes1 (me)

we each can rebase after each step
it doesn't look like much conflicts
in any event

@jreback
Copy link
Contributor Author

jreback commented Feb 14, 2013

@wesm any comments on these 4 branches
they all pass (and work pretty hard to avoid API issues) and lots of new tests
?

@jreback
Copy link
Contributor Author

jreback commented Feb 14, 2013

@stephenwlin also should either reverse return values from _maybe_promote or _infer_dtype_fr_scalar

value, dtype vs dtype,value

I don't have a preference
(could actually use a named tuple)??
or is that too weird ?

@stephenwlin
Copy link
Contributor

i prefer "dtype, value" since both of their names refer primarily to the dtype (modifying the value appropriately is just an extra benefit)

@jreback
Copy link
Contributor Author

jreback commented Feb 14, 2013

great

I will change

btw - how can I pull in your change commit for _maybe_promote to mine
?

@stephenwlin
Copy link
Contributor

hmm, at this point, probably just do

git remote add stephenwlin https://github.com/stephenwlin/pandas.git
git fetch stephenwlin
git cherry-pick 05a4991f014f7ed55b6d8270f06c3c554f05189b
git commit --amend -m "CLN: change _maybe_promote signature to return modified fill_value"

@stephenwlin
Copy link
Contributor

I don't think it's that necessary to get rid of _maybe_upcast either...it doesn't do much but

    datacopy, fill_value = com._maybe_upcast(data, copy=True)

is pretty succinct and might be useful in more places than I what I used it for so far. (it guarantees either a straight copy or an upcasted copy, but won't copy twice in the latter case as was being done previously where I added it)

     and _infer_dtype_from_scalar to match (both return dtype, fill_value)

Diff between 'jreback/dtypes_bug' and 'stephenwlin/dtypes_bug'

Conflicts:

	pandas/core/common.py
@jreback
Copy link
Contributor Author

jreback commented Feb 14, 2013

all fixed up...thanks...that was a neat trick!

andn I changed _infer_dtype_from_scalar return signature to match
_maybe_upcast left alone....may be useful in future...
now I think all dtype determination is done in these 2 routines (instead of in myriad places before)

once travis finishes....can prob start merging

and issue with merge order?

@stephenwlin
Copy link
Contributor

The merge order is ok in theory but I tested it and the conflicts are kind of nasty right now because git isn't smart enough to figure out that we made more-or-less the same changes twice.

Can you do:

git fetch stephenwlin
git checkout dtypes_bug
# just in case
git branch dtypes_bug_backup dtypes_bug
git reset stephenwlin/dtypes_bug
# compare against dtypes_bug_backup (should be no change)
git diff dtypes_bug_backup
git branch -d dtypes_bug_backup

"stephenwlin/dtypes_bug" should be identical to your "dtypes_bug" except rebased against b3202ebc282eace11de3089498a5a5ea3689f9e4

EDIT: sorry, actually that doesn't help that much (most of the conflicts are still there)...never mind for now...

@jreback
Copy link
Contributor Author

jreback commented Feb 14, 2013

the branches are identical...

what is conflicting?

if i merged dtypes_bug to master...then you can rebase....if i have the changes (and you do u)
should git either ignore, or just make you do a commit to accept them?

@stephenwlin
Copy link
Contributor

lots of stuff in common.py conflicts if you merge "stephenwlin/opt-take-2" on top of "jreback/dtypes_bug" but I can't really figure out why. it's not a big deal because I know how they resolve but I won't be doing the actual merge into master so I figure I might as well try to proactively arrange things so that the conflicts go away, helping out whoever does do it. I just can't seem to figure out how, though: not enough git-fu yet.

it'll be fine if I rebase "stephenwlin/opt-take-2" after "jreback/dtypes_bug" is in master and resolve things then before it's merged. if you're going to do it yourself, just let me know and I'll rebase after when I get a chance. I just figured I could fix it proactively, but I guess not.

@stephenwlin
Copy link
Contributor

(i went through the commits one by and and the two branches are basically independent after rebasing against b3202ebc282eace11de3089498a5a5ea3689f9e4 ... it's really odd that they conflict so much if you try to merge them...oh well, I'm not going to try to spend more time fixing this)

@jreback
Copy link
Contributor Author

jreback commented Feb 14, 2013

ok
If I merge your convert_objects
then dtypes_bug

you can then rebase
I don't think there will be any actual conflicts
just dups that git can't resolve

ok?

@stephenwlin
Copy link
Contributor

yeah, there's no real conflicts it just looks like there are.

@jreback
Copy link
Contributor Author

jreback commented Feb 14, 2013

ok merged your convert branch
dtypes_bug in a few

@jreback
Copy link
Contributor Author

jreback commented Feb 14, 2013

just waiting for travis to finish...then will merge...i had no problems rebasing (to be fair rebase only on top of convert_objects)!

@jreback jreback merged commit cb56c98 into pandas-dev:master Feb 14, 2013
@jreback
Copy link
Contributor Author

jreback commented Feb 14, 2013

@stephenwlin you are up....rebase and let me know....

@stephenwlin
Copy link
Contributor

all fixed up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants