ENH/BUG/DOC: allow propogation and coexistance of numeric dtypes #2708

jreback · 2013-01-19T21:27:44Z

Support for numeric dtype propogation and coexistance in DataFrames. Prior to 0.10.2, numeric dtypes passed to DataFrames were always casted to int64 or float64. Now, if a dtype is passed (either directly via the dtype keyword, a passed ndarray, or a passed Series, then it will be preserved in DataFrame operations. Furthermore, different numeric dtypes will NOT be combined. The following example will give you a taste. This closes GH #622

other changes introduced in this PR (i removed all datetime like issues to PR # 2752 - should be merged first)

ENH:

validated get_numeric_data returns correct dtypes
added blocks attribute (and as_blocks()) method that returns a
dict of dtype -> homogeneous dtyped DataFrame, analagous to values attribute
added keyword 'raise_on_error' to astype, which can be set to false to exluded non-numeric columns
changed get_dtype_counts() to use blocks attribute
changed convert_objects() to use the internals method convert (which is block operated)
- added option to convert_numeric=False to convert_objects to force numeric conversion (or set to np.nan, turned off by default)
- added option to convert_dates='coerce' to convert_objects to force datetimelike conversions (or set to NaT) for invalid values, turned off by default, returns datetime64[ns] dtype
groupby operations to respect dtype inputs wherever possible, even if intermediate casting is required (obviously if the input are ints and nans are resulting, this is casted),
all cython functions are implemented
auto generation of most groupby functions by type is now in generated_code.py
e.g. (group_add,group_mean)
added full float32/int16/int8 support for all numeric operations, including (diff, backfill, pad, take)
added dtype display to show on Series as a default

BUG:

fixed up tests from from_records to use record arrays directly
NOTE: using tuples will remove dtype info from the input stream (using a record array is ok though!)
fixed merging to correctly merge on multiple dtypes with blocks (e.g. float64 and float32 in other merger)
(DataFrame.from_records incorrectly up-converts dtypes to object. #2623 can be fixed, but is dependent on BUG: various bug fixes for DataFrame/Series construction #2752)
fixed BUG: fillna with method segfaults on zero-length input (fixes #2775) #2778, bug in pad/backfill with 0-len frame
fixed very obscure bug in DataFrame.from_records with dictionary and columns passed and hash randomization is on!
integer upcasts will now happend on where when using inplace ops (BUG: DataFrame inplace where doesn't work for mixed datatype frames #2793)

TST:

tests added for merging changes, astype, convert
fixes for test_excel on 32-bit
fixed test_resample_median_bug_1688
separated out test_from_records_dictlike
added tests for (GH Panel constructor ignores dtype #797)
added lots of tests forwhere

DOC:

added DataTypes section in Data Structres intro
whatsnew examples

It would be really helpful if some users could give this a test run before merging. I have put in test cases for numeric operations, combining with DataFrame and Series, but I am sure there are some corner cases that were missed

In [17]: df1 = DataFrame(randn(8, 1), columns = ['A'], dtype = 'float32')

In [18]: df1
Out[18]: 
          A
0 -0.007220
1 -0.236432
2  2.427172
3 -0.998639
4 -1.039410
5  0.336029
6  0.832988
7 -0.413241

In [19]: df1.dtypes
Out[19]: A    float32

In [20]: df2 = DataFrame(dict( A = Series(randn(8),dtype='float16'), 
                                                B = Series(randn(8)), 
                                                C = Series(randn(8),dtype='uint8') ))

In [22]: df2
Out[22]: 
          A         B    C
0  1.150391 -1.033296    0
1  0.123047  1.915564    0
2  0.151367 -0.489826    0
3 -0.565430 -0.734238    0
4 -0.352295 -0.451430    0
5 -0.618164  0.673102  255
6  1.554688  0.322035    0
7  0.160767  0.420718    0


In [23]: df2.dtypes
Out[23]: 
A    float16
B    float64
C      uint8

In [24]: # here you get some upcasting

In [25]: df3 = df1.reindex_like(df2).fillna(value=0.0) + df2

In [26]: df3
Out[26]: 
          A         B    C
0  1.143170 -1.033296    0
1 -0.113385  1.915564    0
2  2.578539 -0.489826    0
3 -1.564069 -0.734238    0
4 -1.391705 -0.451430    0
5 -0.282135  0.673102  255
6  2.387676  0.322035    0
7 -0.252475  0.420718    0

In [27]: df3.dtypes
Out[27]: 
A    float32
B    float64
C    float64

the example from #622

In [23]: a = np.array(np.random.randint(10, size=1e6),dtype='int32')

In [24]: b = np.array(np.random.randint(10, size=1e6),dtype='int64')

In [25]: df = pandas.DataFrame(dict(a = a, b = b))

In [26]: df.dtypes
Out[26]: 
a    int32
b    int64

Conversion examples

# conversion of dtypes
In [81]: df3.astype('float32').dtypes
Out[81]: 
A    float32
B    float32
C    float32

# mixed type conversions
In [82]: df3['D'] = '1.'

In [83]: df3['E'] = '1'

In [84]: df3.convert_objects(convert_numeric=True).dtypes
Out[84]: 
A    float32
B    float64
C    float64
D    float64
E      int64

# same, but specific dtype conversion
In [85]: df3['D'] = df3['D'].astype('float16')

In [86]: df3['E'] = df3['E'].astype('int32')

In [87]: df3.dtypes
Out[87]: 
A    float32
B    float64
C    float64
D    float16
E      int32

# forcing date coercion
In [18]: s = Series([datetime(2001,1,1,0,0), 'foo', 1.0, 1,
   ....:             Timestamp('20010104'), '20010105'],dtype='O')
   ....:

In [19]: s.convert_objects(convert_dates='coerce')
Out[19]: 
0   2001-01-01 00:00:00
1                   NaT
2                   NaT
3                   NaT
4   2001-01-04 00:00:00
5   2001-01-05 00:00:00
Dtype: datetime64[ns]

wesm · 2013-01-19T21:35:34Z

This is pretty great. I'm going to delay merging until post 0.10.1 (which we're sprinting on now, critical bug fixes only), but only to have a chance to beat on it some.

jreback · 2013-01-19T21:43:05Z

@wesm agreed...even though the change is not that big, this touches like everything indirectly. there might be some wierd corner cases.

jreback · 2013-02-02T06:13:06Z

Travis all green - ready 4 merging!

…ndas-dev#622) construction of multi numeric dtypes with other types in a dict validated get_numeric_data returns correct dtypes added blocks attribute (and as_blocks()) method that returns a dict of dtype -> homogeneous Frame to DataFrame added keyword 'raise_on_error' to astype, which can be set to false to exluded non-numeric columns fixed merging to correctly merge on multiple dtypes with blocks (e.g. float64 and float32 in other merger) changed implementation of get_dtype_counts() to use .blocks revised DataFrame.convert_objects to use blocks to be more efficient added Dtype printing to show on default with a Series added convert_dates='coerce' option to convert_objects, to force conversions to datetime64[ns] where can upcast integer to float as needed (on inplace ops pandas-dev#2793) added fully cythonized support for int8/int16 no support for float16 (it can exist, but no cython methods for it) TST: fixed test in test_from_records_sequencelike (dict orders can be different on different arch!) NOTE: using tuples will remove dtype info from the input stream (using a record array is ok though!) test updates for merging (multi-dtypes) added tests for replace (but skipped for now, algos not set for float32/16) tests for astype and convert in internals fixes for test_excel on 32-bit fixed test_resample_median_bug_1688 I belive separated out test_from_records_dictlike testing of panel constructors (GH pandas-dev#797) where ops now have a full test suite allow slightly less sensitive decimal tests for less precise dtypes BUG: fixed GH pandas-dev#2778, fillna on empty frame causes seg fault fixed bug in groupby where types were not being casted to original dtype respect the dtype of non-natural numeric (Decimal) don't upcast ints/bools to floats (if you say were agging on len, you can get an int) DOC: added astype conversion examples to whatsnew and docs (dsintro) updated RELEASE notes whatsnew for 0.10.2 added upcasting gotchas docs CLN: updated convert_objects to be more consistent across frame/series moved most groupby functions out of algos.pyx to generated.pyx fully support cython functions for pad/bfill/take/diff/groupby for float32 moved more block-like conversion loops from frame.py to internals.py (created apply method) (e.g. diff,fillna,where,shift,replace,interpolate,combining), to top-level methods in BlockManager

wesm · 2013-02-10T16:19:57Z

Just merged this and next release will be 0.11. Will start looking through PRs that depend on it

This was referenced Jan 21, 2013

BUG: should astype only TRY to convert string columns? #2718

Closed

BUG: HDFStore fixes #2675

Merged

ENH: PyTables Enhancements for future #2391

Closed

Integer dtype is promoted to int64 #2759

Closed

This was referenced Jan 30, 2013

GroupBy on DatetimeIndex with float32 values VERY slow #2772

Closed

BUG: fillna with method segfaults on zero-length input (fixes #2775) #2778

Closed

stephenwlin mentioned this pull request Jan 31, 2013

segmentation fault in fillna #2775

Closed

This was referenced Feb 3, 2013

BUG: DataFrame inplace where doesn't work for mixed datatype frames #2793

Closed

ENH: should boolean indexing preserve input dtypes where possible? #2794

Closed

alvorithm mentioned this pull request Feb 7, 2013

BUG: issue in HDFStore with too many selectors in a where #2755

Merged

stephenwlin mentioned this pull request Feb 8, 2013

ENH: Optimize take_*; improve non-NA fill_value support #2819

Merged

wesm merged commit 166a80d into pandas-dev:master Feb 10, 2013

wesm mentioned this pull request Feb 10, 2013

pandas converts int32 to int64 #622

Closed

This was referenced Feb 10, 2013

BUG: various bug fixes for DataFrame/Series construction #2752

Merged

TST: fix failing #2752 tests for 32-bit builds #2837

Closed

This was referenced Feb 12, 2013

pandas.tests.test_graphics.TestDataFramePlots test_unsorted_index failure #2854

Closed

ENH: should shift return same dtype objects as input? #2761

Closed

benjello mentioned this pull request Feb 28, 2014

Better handling of memory usage openfisca/openfisca-core#92

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH/BUG/DOC: allow propogation and coexistance of numeric dtypes #2708

ENH/BUG/DOC: allow propogation and coexistance of numeric dtypes #2708

jreback commented Jan 19, 2013

wesm commented Jan 19, 2013

jreback commented Jan 19, 2013

jreback commented Feb 2, 2013

wesm commented Feb 10, 2013

ENH/BUG/DOC: allow propogation and coexistance of numeric dtypes #2708

ENH/BUG/DOC: allow propogation and coexistance of numeric dtypes #2708

Conversation

jreback commented Jan 19, 2013

wesm commented Jan 19, 2013

jreback commented Jan 19, 2013

jreback commented Feb 2, 2013

wesm commented Feb 10, 2013