-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pandas converts int32 to int64 #622
Comments
Yes, pandas has only four dtypes right now: int64, float64, bool, and object. This is in the interest of making it user-friendly, but at the expense of memory conservation obviously. In the future it might make sense to add more as long as it doesn't complicate the user-facing API. |
Just got bit by this, upcasting from float32, int8 and int16. |
I actually like the fact that the dtypes are simpler when using pandas. Also, If you don't use a In practice is this a big deal? Maybe I'm a bit green, but I've never run into a situation using pandas where it really mattered whether I used It matters for things like reading raw bytes from binary files, but if you're creating arrays large enough that the distinction between 32 and 64-bit width numbers matters, you'd be better off just getting more RAM. For example, even if you had 4GB of RAM on your machine and you had a 2GB array of 32-bit integers, you're still going to need another 2GB if you want to do any non destructive arithmetic on that array thus maxing out your system's RAM. Point is, doesn't seem like this is a bug. Just my two cents. |
I agree that the simplicity is good-- you don't have to have to write down the dtype of a DataFrame like you do with a structured array. I think the design should be: have simple defaults, but when a data type is already set (e.g. int32), it's OK to "live and let live". |
I am new to Pandas, but would like to put in my vote for supporting all ndarray types. From my testing, Series already will support other types, but DataFrame will not. I have two arguments... memory and speed. cpcloud suggested that you can always buy more memory, which is a reasonable suggestion. However, systems do have memory limits, and there are computation tasks that will use all of it (yes, even 256GB or more). Being able to fit twice as many samples on the system, regardless of how much memory you have, is a good thing. On the speed front, I want to load binary files quickly into memory and process them with pandas. I wrote a C++ module for this purpose. I don't want to copy the memory after reading it from disk. For the processing we are doing, this would double the number of memory operations, which slows down the processing by almost 1/2. Unfortunately, after reading the binary into memory, I need to iterate over it and copy the int32 array into an int64 array. It is even worse than just a large memory copy because it also must up-cast each value to int64. I like wesm's suggestion. |
@adamsd5 You might try numpy's memmap ndarray subclass. It allows you to treat a file like an in-memory array. Of course, if your file is not just an array then this might be tricky. You could then pass the memmap to the pandas dataframe constructor and the dtype should be preserved. I agree with you that in the long run dtype preservation is desirable. Just out of curiosity, what kind of data are you working with? |
@adamsd5 sounds like what you really want is out-of-core computation (similar to what @cpcloud suggested). HDFStore supports this now, though in a somewhat non-transparent manner. Here's what you could do
so imagine this pseudo code (this is the 2 b) part):
would essentially give you a transformation operation, similar to process_chunk(df), not that hard to create a wrapper around this.... |
cpcloud, does pandas.DataFrame treat such memmap ndarrays differently? You've presented a technique that I might use, but I think the DataFrame will still convert all int32 into int64. I'm not actually trying to process things out of memory. I'm happy loading the entire Data Frame into memory. However, I would like to minimize the memory operations. Once the bytes are loaded from disk (and alas, I have no control over the format they are written), I do not want to copy them around at all (and I don't want pandas to make a copy for me either). From what I can tell, Pandas will always up-convert Int32 to Int64, which is a slow operation. |
@adamsd5 A cursory glance at frame.py suggests that the dtype is preserved with instances of ndarray (isinstance tests for subclasses) that are not record arrays. You can also pass the dtype in the constructor. |
@adamsd5 I was wrong. It seems that floating point types are preserved in the The issue still stands. I poked around in |
what dtypes should pandas fully support - this means all types of pad,fill,take,diff operations - their are specific cython functions created for each of the dtypes - the following are currently supported float32 should be added clearly you can always store these other dtypes, but certain operations will raise (or can auto upcast them) downside of adding more fully supported dtypes is additional compile time on installation and testing |
For my purposes, int32 and float32 would suffice. I see value in the smaller types for some people. If operations mean an upcast during the operation, the value is diminished. A use case would be a huge time series DataFrame on disk that has many int8 columns (perhaps factors), and the user wants to load, then filter based on time stamp and save a sub range of time. None of the int8 columns should be up converted. Just my ideas, hope it is helpful. Darryl On Jan 21, 2013, at 11:43 PM, jreback notifications@github.com wrote: what dtypes should pandas fully support - this means all types of pad,fill,take,diff operations - their are specific cython functions created for each of the dtypes - the following are currently supported float32 should be added clearly you can always store these other dtypes, but certain operations will raise (or can auto upcast them) downside of adding more fully supported dtypes is additional compile time on installation and testing — |
…ndas-dev#622) construction of multi numeric dtypes with other types in a dict validated get_numeric_data returns correct dtypes added blocks attribute (and as_blocks()) method that returns a dict of dtype -> homogeneous Frame to DataFrame added keyword 'raise_on_error' to astype, which can be set to false to exluded non-numeric columns fixed merging to correctly merge on multiple dtypes with blocks (e.g. float64 and float32 in other merger) changed implementation of get_dtype_counts() to use .blocks revised DataFrame.convert_objects to use blocks to be more efficient added Dtype printing to show on default with a Series added convert_dates='coerce' option to convert_objects, to force conversions to datetime64[ns] where can upcast integer to float as needed (on inplace ops pandas-dev#2793) added fully cythonized support for int8/int16 no support for float16 (it can exist, but no cython methods for it) TST: fixed test in test_from_records_sequencelike (dict orders can be different on different arch!) NOTE: using tuples will remove dtype info from the input stream (using a record array is ok though!) test updates for merging (multi-dtypes) added tests for replace (but skipped for now, algos not set for float32/16) tests for astype and convert in internals fixes for test_excel on 32-bit fixed test_resample_median_bug_1688 I belive separated out test_from_records_dictlike testing of panel constructors (GH pandas-dev#797) where ops now have a full test suite allow slightly less sensitive decimal tests for less precise dtypes BUG: fixed GH pandas-dev#2778, fillna on empty frame causes seg fault fixed bug in groupby where types were not being casted to original dtype respect the dtype of non-natural numeric (Decimal) don't upcast ints/bools to floats (if you say were agging on len, you can get an int) DOC: added astype conversion examples to whatsnew and docs (dsintro) updated RELEASE notes whatsnew for 0.10.2 added upcasting gotchas docs CLN: updated convert_objects to be more consistent across frame/series moved most groupby functions out of algos.pyx to generated.pyx fully support cython functions for pad/bfill/take/diff/groupby for float32 moved more block-like conversion loops from frame.py to internals.py (created apply method) (e.g. diff,fillna,where,shift,replace,interpolate,combining), to top-level methods in BlockManager
* jreback/dtypes: ENH: allow propgation and coexistance of numeric dtypes (closes GH #622) construction of multi numeric dtypes with other types in a dict validated get_numeric_data returns correct dtypes added blocks attribute (and as_blocks()) method that returns a dict of dtype -> homogeneous Frame to DataFrame added keyword 'raise_on_error' to astype, which can be set to false to exluded non-numeric columns fixed merging to correctly merge on multiple dtypes with blocks (e.g. float64 and float32 in other merger) changed implementation of get_dtype_counts() to use .blocks revised DataFrame.convert_objects to use blocks to be more efficient added Dtype printing to show on default with a Series added convert_dates='coerce' option to convert_objects, to force conversions to datetime64[ns] where can upcast integer to float as needed (on inplace ops #2793) added fully cythonized support for int8/int16 no support for float16 (it can exist, but no cython methods for it)
Boom. resolved by #2708, merged to master today |
…-dev#622) * initial implementation of default halnding for pickled frmes * MDP-3767 throw exceptions instead of falling back to default pickle behaviour * updated the strict handler check mechanism to be at the library level, and then use the os.envion (if set), else by default disabled * sanitized the tests for the strict handler checks * clarified the decision of having the handler_supports_read_option option in the do_read of version store instead inside individual handlers
Is this intended? I had hoped no copying at all would happen in that case.
In [65]: a = np.random.randint(10, size=1e6)
In [66]: a.dtype
Out[66]: dtype('int32')
In [67]: b = np.random.randint(2, size=1e6)
In [68]: df = pandas.DataFrame({'a': a, 'b': b})
In [69]: df.dtypes
Out[69]:
a int64
b int64
The text was updated successfully, but these errors were encountered: