Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH/BUG/DOC: allow propogation and coexistance of numeric dtypes #2708

Merged
merged 1 commit into from
Feb 10, 2013

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Jan 19, 2013

Support for numeric dtype propogation and coexistance in DataFrames. Prior to 0.10.2, numeric dtypes passed to DataFrames were always casted to int64 or float64. Now, if a dtype is passed (either directly via the dtype keyword, a passed ndarray, or a passed Series, then it will be preserved in DataFrame operations. Furthermore, different numeric dtypes will NOT be combined. The following example will give you a taste. This closes GH #622

other changes introduced in this PR (i removed all datetime like issues to PR # 2752 - should be merged first)

ENH:

  • validated get_numeric_data returns correct dtypes
  • added blocks attribute (and as_blocks()) method that returns a
    dict of dtype -> homogeneous dtyped DataFrame, analagous to values attribute
  • added keyword 'raise_on_error' to astype, which can be set to false to exluded non-numeric columns
  • changed get_dtype_counts() to use blocks attribute
  • changed convert_objects() to use the internals method convert (which is block operated)
    • added option to convert_numeric=False to convert_objects to force numeric conversion (or set to np.nan, turned off by default)
    • added option to convert_dates='coerce' to convert_objects to force datetimelike conversions (or set to NaT) for invalid values, turned off by default, returns datetime64[ns] dtype
  • groupby operations to respect dtype inputs wherever possible, even if intermediate casting is required (obviously if the input are ints and nans are resulting, this is casted),
    all cython functions are implemented
  • auto generation of most groupby functions by type is now in generated_code.py
    e.g. (group_add,group_mean)
  • added full float32/int16/int8 support for all numeric operations, including (diff, backfill, pad, take)
  • added dtype display to show on Series as a default

BUG:

TST:

  • tests added for merging changes, astype, convert
  • fixes for test_excel on 32-bit
  • fixed test_resample_median_bug_1688
  • separated out test_from_records_dictlike
  • added tests for (GH Panel constructor ignores dtype #797)
  • added lots of tests forwhere

DOC:

  • added DataTypes section in Data Structres intro
  • whatsnew examples

It would be really helpful if some users could give this a test run before merging. I have put in test cases for numeric operations, combining with DataFrame and Series, but I am sure there are some corner cases that were missed

In [17]: df1 = DataFrame(randn(8, 1), columns = ['A'], dtype = 'float32')

In [18]: df1
Out[18]: 
          A
0 -0.007220
1 -0.236432
2  2.427172
3 -0.998639
4 -1.039410
5  0.336029
6  0.832988
7 -0.413241

In [19]: df1.dtypes
Out[19]: A    float32

In [20]: df2 = DataFrame(dict( A = Series(randn(8),dtype='float16'), 
                                                B = Series(randn(8)), 
                                                C = Series(randn(8),dtype='uint8') ))

In [22]: df2
Out[22]: 
          A         B    C
0  1.150391 -1.033296    0
1  0.123047  1.915564    0
2  0.151367 -0.489826    0
3 -0.565430 -0.734238    0
4 -0.352295 -0.451430    0
5 -0.618164  0.673102  255
6  1.554688  0.322035    0
7  0.160767  0.420718    0


In [23]: df2.dtypes
Out[23]: 
A    float16
B    float64
C      uint8

In [24]: # here you get some upcasting

In [25]: df3 = df1.reindex_like(df2).fillna(value=0.0) + df2

In [26]: df3
Out[26]: 
          A         B    C
0  1.143170 -1.033296    0
1 -0.113385  1.915564    0
2  2.578539 -0.489826    0
3 -1.564069 -0.734238    0
4 -1.391705 -0.451430    0
5 -0.282135  0.673102  255
6  2.387676  0.322035    0
7 -0.252475  0.420718    0

In [27]: df3.dtypes
Out[27]: 
A    float32
B    float64
C    float64

the example from #622

In [23]: a = np.array(np.random.randint(10, size=1e6),dtype='int32')

In [24]: b = np.array(np.random.randint(10, size=1e6),dtype='int64')

In [25]: df = pandas.DataFrame(dict(a = a, b = b))

In [26]: df.dtypes
Out[26]: 
a    int32
b    int64

Conversion examples

# conversion of dtypes
In [81]: df3.astype('float32').dtypes
Out[81]: 
A    float32
B    float32
C    float32

# mixed type conversions
In [82]: df3['D'] = '1.'

In [83]: df3['E'] = '1'

In [84]: df3.convert_objects(convert_numeric=True).dtypes
Out[84]: 
A    float32
B    float64
C    float64
D    float64
E      int64

# same, but specific dtype conversion
In [85]: df3['D'] = df3['D'].astype('float16')

In [86]: df3['E'] = df3['E'].astype('int32')

In [87]: df3.dtypes
Out[87]: 
A    float32
B    float64
C    float64
D    float16
E      int32

# forcing date coercion
In [18]: s = Series([datetime(2001,1,1,0,0), 'foo', 1.0, 1,
   ....:             Timestamp('20010104'), '20010105'],dtype='O')
   ....:

In [19]: s.convert_objects(convert_dates='coerce')
Out[19]: 
0   2001-01-01 00:00:00
1                   NaT
2                   NaT
3                   NaT
4   2001-01-04 00:00:00
5   2001-01-05 00:00:00
Dtype: datetime64[ns]

@wesm
Copy link
Member

wesm commented Jan 19, 2013

This is pretty great. I'm going to delay merging until post 0.10.1 (which we're sprinting on now, critical bug fixes only), but only to have a chance to beat on it some.

@jreback
Copy link
Contributor Author

jreback commented Jan 19, 2013

@wesm agreed...even though the change is not that big, this touches like everything indirectly. there might be some wierd corner cases.

@jreback
Copy link
Contributor Author

jreback commented Feb 2, 2013

Travis all green - ready 4 merging!

…ndas-dev#622)

     construction of multi numeric dtypes with other types in a dict
     validated get_numeric_data returns correct dtypes
     added blocks attribute (and as_blocks()) method that returns a dict of dtype -> homogeneous Frame to DataFrame
     added keyword 'raise_on_error' to astype, which can be set to false to exluded non-numeric columns
     fixed merging to correctly merge on multiple dtypes with blocks (e.g. float64 and float32 in other merger)
     changed implementation of get_dtype_counts() to use .blocks
     revised DataFrame.convert_objects to use blocks to be more efficient
     added Dtype printing to show on default with a Series
     added convert_dates='coerce' option to convert_objects, to force conversions to datetime64[ns]
     where can upcast integer to float as needed (on inplace ops pandas-dev#2793)
     added fully cythonized support for int8/int16
     no support for float16 (it can exist, but no cython methods for it)

TST: fixed test in test_from_records_sequencelike (dict orders can be different on different arch!)
       NOTE: using tuples will remove dtype info from the input stream (using a record array is ok though!)
     test updates for merging (multi-dtypes)
     added tests for replace (but skipped for now, algos not set for float32/16)
     tests for astype and convert in internals
     fixes for test_excel on 32-bit
     fixed test_resample_median_bug_1688 I belive
     separated out test_from_records_dictlike
     testing of panel constructors (GH pandas-dev#797)
     where ops now have a full test suite
     allow slightly less sensitive decimal tests for less precise dtypes

BUG: fixed GH pandas-dev#2778, fillna on empty frame causes seg fault
     fixed bug in groupby where types were not being casted to original dtype
     respect the dtype of non-natural numeric (Decimal)
     don't upcast ints/bools to floats (if you say were agging on len, you can get an int)
DOC: added astype conversion examples to whatsnew and docs (dsintro)
     updated RELEASE notes
     whatsnew for 0.10.2
     added upcasting gotchas docs

CLN: updated convert_objects to be more consistent across frame/series
     moved most groupby functions out of algos.pyx to generated.pyx
     fully support cython functions for pad/bfill/take/diff/groupby for float32
     moved more block-like conversion loops from frame.py to internals.py (created apply method)
       (e.g. diff,fillna,where,shift,replace,interpolate,combining), to top-level methods in BlockManager
@wesm wesm merged commit 166a80d into pandas-dev:master Feb 10, 2013
@wesm
Copy link
Member

wesm commented Feb 10, 2013

Just merged this and next release will be 0.11. Will start looking through PRs that depend on it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants