pandas converts int32 to int64 #622

gdementen · 2012-01-13T14:11:51Z

Is this intended? I had hoped no copying at all would happen in that case.

In [65]: a = np.random.randint(10, size=1e6)

In [66]: a.dtype
Out[66]: dtype('int32')

In [67]: b = np.random.randint(2, size=1e6)

In [68]: df = pandas.DataFrame({'a': a, 'b': b})

In [69]: df.dtypes
Out[69]:
a int64
b int64

adamklein · 2012-01-13T14:45:56Z

Yes, pandas has only four dtypes right now: int64, float64, bool, and object. This is in the interest of making it user-friendly, but at the expense of memory conservation obviously. In the future it might make sense to add more as long as it doesn't complicate the user-facing API.

jseabold · 2012-10-08T14:48:54Z

Just got bit by this, upcasting from float32, int8 and int16.

cpcloud · 2012-12-11T16:54:20Z

I actually like the fact that the dtypes are simpler when using pandas. Also, If you don't use a dict, then the dtype is preserved.

In practice is this a big deal? Maybe I'm a bit green, but I've never run into a situation using pandas where it really mattered whether I used int32 vs int64.

It matters for things like reading raw bytes from binary files, but if you're creating arrays large enough that the distinction between 32 and 64-bit width numbers matters, you'd be better off just getting more RAM.

For example, even if you had 4GB of RAM on your machine and you had a 2GB array of 32-bit integers, you're still going to need another 2GB if you want to do any non destructive arithmetic on that array thus maxing out your system's RAM.

Point is, doesn't seem like this is a bug. Just my two cents.

wesm · 2012-12-11T18:36:29Z

I agree that the simplicity is good-- you don't have to have to write down the dtype of a DataFrame like you do with a structured array. I think the design should be: have simple defaults, but when a data type is already set (e.g. int32), it's OK to "live and let live".

adamsd5 · 2013-01-17T21:13:43Z

I am new to Pandas, but would like to put in my vote for supporting all ndarray types. From my testing, Series already will support other types, but DataFrame will not. I have two arguments... memory and speed. cpcloud suggested that you can always buy more memory, which is a reasonable suggestion. However, systems do have memory limits, and there are computation tasks that will use all of it (yes, even 256GB or more). Being able to fit twice as many samples on the system, regardless of how much memory you have, is a good thing.

On the speed front, I want to load binary files quickly into memory and process them with pandas. I wrote a C++ module for this purpose. I don't want to copy the memory after reading it from disk. For the processing we are doing, this would double the number of memory operations, which slows down the processing by almost 1/2. Unfortunately, after reading the binary into memory, I need to iterate over it and copy the int32 array into an int64 array. It is even worse than just a large memory copy because it also must up-cast each value to int64.

I like wesm's suggestion.

cpcloud · 2013-01-17T21:36:06Z

@adamsd5 You might try numpy's memmap ndarray subclass. It allows you to treat a file like an in-memory array. Of course, if your file is not just an array then this might be tricky. You could then pass the memmap to the pandas dataframe constructor and the dtype should be preserved. I agree with you that in the long run dtype preservation is desirable. Just out of curiosity, what kind of data are you working with?

jreback · 2013-01-17T21:50:10Z

@adamsd5 sounds like what you really want is out-of-core computation (similar to what @cpcloud suggested).
that is, your data is represented on disk, then slices are put in memory as needed and computed. I know @wesm has this as a goal as well. this will allow you to not even worry about the memory issue at all

HDFStore supports this now, though in a somewhat non-transparent manner.

Here's what you could do

store your data on-disk using HDFStore in a table format (could be a series of append operations from say csv files, or wherever you have now)
iterate over either a) a series of queries, or b) the indicies of the 'mapped frame'
compute and repeat

so imagine this pseudo code (this is the 2 b) part):

store = HDFStore('a_big_file.h5')

# pretend we have a store of the frame as a table 'df'

nrows = store.get_storer('df').nrows
chunk_size = 100000

for i in xrange(int(nrows / chunk_size) + 1):
    start_i = i * chunk_size
    stop_i = min((i + 1) * chunk_size, nrows)

    data_for_this_chunk = store.select('df', start = start_i, stop = stop_i)
    store.append('df_result', process_chunk(data_for_this_chunk))

would essentially give you a transformation operation, similar to process_chunk(df),
but processed in chunks. should be quite memory and speed insensitive, and could be easily parrallellizable
reduction operations are even simpler (as they can be accumulated in memory)

not that hard to create a wrapper around this....

adamsd5 · 2013-01-18T20:47:50Z

cpcloud, does pandas.DataFrame treat such memmap ndarrays differently? You've presented a technique that I might use, but I think the DataFrame will still convert all int32 into int64.

I'm not actually trying to process things out of memory. I'm happy loading the entire Data Frame into memory. However, I would like to minimize the memory operations. Once the bytes are loaded from disk (and alas, I have no control over the format they are written), I do not want to copy them around at all (and I don't want pandas to make a copy for me either). From what I can tell, Pandas will always up-convert Int32 to Int64, which is a slow operation.

cpcloud · 2013-01-18T21:12:46Z

@adamsd5 A cursory glance at frame.py suggests that the dtype is preserved with instances of ndarray (isinstance tests for subclasses) that are not record arrays. You can also pass the dtype in the constructor.

cpcloud · 2013-01-20T20:29:52Z

@adamsd5 I was wrong. It seems that floating point types are preserved in the DataFrame constructor, but integer types are not. E.g.,

The issue still stands. I poked around in pandas/core/internals.py and saw that the function make_block converts any integer subtypes to int64, but preserves other types. Is there any reason to suspect that getting of the call to values.astype('i8') would break anything? Either way I'll try it and report back.

jreback · 2013-01-20T20:59:41Z

@cpcloud see PR #2705. this is a bit more complicated than it first appears, this change will appear on 0.10.2. the existing implementation will upcast for most operations, eg even though u can create a float32 (or int32) frame most operations will not preserve it

jreback · 2013-01-22T04:42:57Z

what dtypes should pandas fully support - this means all types of pad,fill,take,diff operations - their are specific cython functions created for each of the dtypes - the following are currently supported
float64,int64,int32,datetime64[ns],bool,object

float32 should be added clearly
what about float16,int16,int8,uint64,uint32,uint16,uint8?

you can always store these other dtypes, but certain operations will raise (or can auto upcast them)
eg say we don't support int16, can upcast to int32 and perform the ops

downside of adding more fully supported dtypes is additional compile time on installation and testing
and after a certain point prob should move to code generation (rather than copy/paste of the functions)
comments?

adamsd5 · 2013-01-22T05:18:50Z

For my purposes, int32 and float32 would suffice. I see value in the smaller types for some people. If operations mean an upcast during the operation, the value is diminished. A use case would be a huge time series DataFrame on disk that has many int8 columns (perhaps factors), and the user wants to load, then filter based on time stamp and save a sub range of time. None of the int8 columns should be up converted. Just my ideas, hope it is helpful.

Darryl

On Jan 21, 2013, at 11:43 PM, jreback notifications@github.com wrote:

what dtypes should pandas fully support - this means all types of pad,fill,take,diff operations - their are specific cython functions created for each of the dtypes - the following are currently supported
float64,int64,int32,datetime64[ns],bool,object

float32 should be added clearly
what about float16,int16,int8,uint64,uint32,uint16,uint8?

you can always store these other dtypes, but certain operations will raise (or can auto upcast them)
eg say we don't support int16, can upcast to int32 and perform the ops

downside of adding more fully supported dtypes is additional compile time on installation and testing
and after a certain point prob should move to code generation (rather than copy/paste of the functions)
comments?

—
Reply to this email directly or view it on GitHub.

…ndas-dev#622) construction of multi numeric dtypes with other types in a dict validated get_numeric_data returns correct dtypes added blocks attribute (and as_blocks()) method that returns a dict of dtype -> homogeneous Frame to DataFrame added keyword 'raise_on_error' to astype, which can be set to false to exluded non-numeric columns fixed merging to correctly merge on multiple dtypes with blocks (e.g. float64 and float32 in other merger) changed implementation of get_dtype_counts() to use .blocks revised DataFrame.convert_objects to use blocks to be more efficient added Dtype printing to show on default with a Series added convert_dates='coerce' option to convert_objects, to force conversions to datetime64[ns] where can upcast integer to float as needed (on inplace ops pandas-dev#2793) added fully cythonized support for int8/int16 no support for float16 (it can exist, but no cython methods for it) TST: fixed test in test_from_records_sequencelike (dict orders can be different on different arch!) NOTE: using tuples will remove dtype info from the input stream (using a record array is ok though!) test updates for merging (multi-dtypes) added tests for replace (but skipped for now, algos not set for float32/16) tests for astype and convert in internals fixes for test_excel on 32-bit fixed test_resample_median_bug_1688 I belive separated out test_from_records_dictlike testing of panel constructors (GH pandas-dev#797) where ops now have a full test suite allow slightly less sensitive decimal tests for less precise dtypes BUG: fixed GH pandas-dev#2778, fillna on empty frame causes seg fault fixed bug in groupby where types were not being casted to original dtype respect the dtype of non-natural numeric (Decimal) don't upcast ints/bools to floats (if you say were agging on len, you can get an int) DOC: added astype conversion examples to whatsnew and docs (dsintro) updated RELEASE notes whatsnew for 0.10.2 added upcasting gotchas docs CLN: updated convert_objects to be more consistent across frame/series moved most groupby functions out of algos.pyx to generated.pyx fully support cython functions for pad/bfill/take/diff/groupby for float32 moved more block-like conversion loops from frame.py to internals.py (created apply method) (e.g. diff,fillna,where,shift,replace,interpolate,combining), to top-level methods in BlockManager

* jreback/dtypes: ENH: allow propgation and coexistance of numeric dtypes (closes GH #622) construction of multi numeric dtypes with other types in a dict validated get_numeric_data returns correct dtypes added blocks attribute (and as_blocks()) method that returns a dict of dtype -> homogeneous Frame to DataFrame added keyword 'raise_on_error' to astype, which can be set to false to exluded non-numeric columns fixed merging to correctly merge on multiple dtypes with blocks (e.g. float64 and float32 in other merger) changed implementation of get_dtype_counts() to use .blocks revised DataFrame.convert_objects to use blocks to be more efficient added Dtype printing to show on default with a Series added convert_dates='coerce' option to convert_objects, to force conversions to datetime64[ns] where can upcast integer to float as needed (on inplace ops #2793) added fully cythonized support for int8/int16 no support for float16 (it can exist, but no cython methods for it)

wesm · 2013-02-10T16:23:57Z

Boom. resolved by #2708, merged to master today

…-dev#622) * initial implementation of default halnding for pickled frmes * MDP-3767 throw exceptions instead of falling back to default pickle behaviour * updated the strict handler check mechanism to be at the library level, and then use the os.envion (if set), else by default disabled * sanitized the tests for the strict handler checks * clarified the decision of having the handler_supports_read_option option in the do_read of version store instead inside individual handlers

adamklein mentioned this issue Jan 30, 2012

pivot_table aggregate changes data type #713

Closed

adamklein mentioned this issue Feb 22, 2012

Panel constructor ignores dtype #797

Closed

jreback mentioned this issue Jan 19, 2013

ENH/BUG/DOC: allow propogation and coexistance of numeric dtypes #2708

Merged

jreback mentioned this issue Jan 24, 2013

ENH: PyTables Enhancements for future #2391

Closed

jreback mentioned this issue Jan 27, 2013

Integer dtype is promoted to int64 #2759

Closed

ghost assigned jreback Feb 10, 2013

wesm closed this as completed Feb 10, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pandas converts int32 to int64 #622

pandas converts int32 to int64 #622

gdementen commented Jan 13, 2012

adamklein commented Jan 13, 2012

jseabold commented Oct 8, 2012

cpcloud commented Dec 11, 2012

wesm commented Dec 11, 2012

adamsd5 commented Jan 17, 2013

cpcloud commented Jan 17, 2013

jreback commented Jan 17, 2013

adamsd5 commented Jan 18, 2013

cpcloud commented Jan 18, 2013

cpcloud commented Jan 20, 2013

jreback commented Jan 20, 2013

jreback commented Jan 22, 2013

adamsd5 commented Jan 22, 2013

wesm commented Feb 10, 2013

pandas converts int32 to int64 #622

pandas converts int32 to int64 #622

Comments

gdementen commented Jan 13, 2012

adamklein commented Jan 13, 2012

jseabold commented Oct 8, 2012

cpcloud commented Dec 11, 2012

wesm commented Dec 11, 2012

adamsd5 commented Jan 17, 2013

cpcloud commented Jan 17, 2013

jreback commented Jan 17, 2013

adamsd5 commented Jan 18, 2013

cpcloud commented Jan 18, 2013

cpcloud commented Jan 20, 2013

jreback commented Jan 20, 2013

jreback commented Jan 22, 2013

adamsd5 commented Jan 22, 2013

wesm commented Feb 10, 2013