-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: .convert_objects is deprecated, do we want a .convert to replace? #11221
Comments
There is already _convert which could be promoted. On Fri, Oct 2, 2015, 10:16 AM Jeff Reback notifications@github.com wrote:
|
The advantage of a well designed |
@bashtage oh I agree. The problem is with |
I was just thinking of the case where I imported data that should be numeric into a DF, but it has some mixed characters, and I want just numbers or NaNs. This type of conversion is what I ultimately wanted when I started looking at |
but the problem is that of a mixed boolean/nan one is ambiguous (so maybe just need to 'handle' that) |
Some comments/observations:
|
Maybe this could be an extra parameter to |
ok so the question is should we u deprecate convert_objects thrn? I actually think convert is s much better name snd we certainly could add the options u describe to make it more useful |
tries to convert to type a A better design would only convert a single type which removes any ambiguity if some data is ever convertible to more than one type. to |
Long live convert_objects! |
maybe what we need in the docs are some examples showing:
|
Hi all, I currently use convert_objects in many of my codes and I think this functionality is very useful when importing datasets that may differ every day in terms of columns composition. Is it really necessary to deprecate it or there's a chance to keep it alive? Many thanks, |
|
I agree with @jreback - A well designed guesser with clear, simple rule and no option to coerce could be useful, but it isn't hard to write your own with your favorite set of rules. |
FYI, the convert all (errors='coerce') and ignore (errors='ignore') options in .to_numeric is a problem in data files containing columns of strings that you want to keep and columns of strings that are actually numbers expressed in scientific notation (e.g, 6.2e+15) which require 'coerce' to convert from strings to float64. The (deprecated) convert.py file has a handy soft convert function that checks if a forced conversion produces all NaNs (such as a string that you want to keep) and then declines to convert the whole column. A fourth error option, such as 'soft-coerce,' would catch scientific notation numbers while not forcing all strings to NaNs. At the moment, my work around is:
|
The great thing about Is there any replacement for a function that will match this functionality, in particular infer dates/datetimes? |
xref #15757 (comment) I think it would be worth exposing whatever the new soft convert api is 0.20 (I haven't looked at it in detail), referencing it in the convert_objects depr message, then deferring convert_objects to the next version, if possible. I say this because I know there are people (for example, me) who have ignored the convert_objects depr message in a couple cases, in particular working with data where you don't necessarily know the columns. Real instance: df = pd.read_html(source)[0] # poorly formatted table, everything inferred to object
# exact columns can vary
df.columns = df.loc[0, :]
df = df.drop(0).dropna()
df = df.convert_objects() Looking at this again, I realize |
IF we decide to expose a 'soft convert objects', would we want this called |
xref #15550 so I think a resolution to this could be:
then easy enough to do:
if you really really want to actually convert things ala the original
And I suppose could offer a convenience feature for this:
|
I think the most useful soft conversion function would have either the ability to order the I agree extending the |
Thanks @jreback - I like adding df = pd.DataFrame({'num_objects': [1, 2, 3], 'num_str': ['1', '2', '3']}, dtype=object)
df
Out[2]:
num_objects num_str
0 1 1
1 2 2
2 3 3
df.dtypes
Out[3]:
num_objects object
num_str object
dtype: object The default behavior of df.convert_objects().dtypes
Out[4]:
num_objects int64
num_str object
dtype: object
In [5]: df.apply(pd.to_numeric).dtypes
Out[5]:
num_objects int64
num_str int64
dtype: object So is it worth adding a |
The
I would assume that a successor to |
The reason that I don't like adding the So for me one of the reasons to have a |
ok if we resurrect this with an all new signature. this is current.
IIRC @jorisvandenbossche suggested. (with a mod).
Though if everything is changed. Then maybe we should just rename this. (note the |
Sorry I'm just getting back to this here's a proposal of how I think this could work, open to suggestions on any piece.
First, for conversions that are simply unboxing of python objects, add a new method df = pd.DataFrame({'a': ['a', 1, 2, 3],
'b': ['b', 2.0, 3.0, 4.1],
'c': ['c', datetime.datetime(2016, 1, 1), datetime.datetime(2016, 1, 2),
datetime.datetime(2016, 1, 3)]})
df = df.iloc[1:]
In [194]: df
Out[194]:
a b c
1 1 2 2016-01-01 00:00:00
2 2 3 2016-01-02 00:00:00
3 3 4.1 2016-01-03 00:00:00
In [195]: df.dtypes
Out[195]:
a object
b object
c object
dtype: object
# exactly what convert_objects does in this scenario today!
In [196]: df.infer_objects().dtypes
Out[196]:
a int64
b float64
c datetime64[ns]
dtype: object Second, for all other conversions, add
Example frame, with what is needed today: df1 = pd.DataFrame({
'date': pd.date_range('2014-01-01', periods=3),
'date_unconverted': ['2014-01', '2015-01', '2016-01'],
'number': [1, 2, 3],
'number_unconverted': ['1', '2', '3']})
In [198]: df1
Out[198]:
date date_unconverted number number_unconverted
0 2014-01-01 2014-01 1 1
1 2014-01-02 2015-01 2 2
2 2014-01-03 2016-01 3 3
In [199]: df1.dtypes
Out[199]:
date datetime64[ns]
date_unconverted object
number int64
number_unconverted object
dtype: object
In [202]: df1.convert_objects(convert_numeric=True, convert_dates='coerce').dtypes
C:\Users\chris.bartak\AppData\Local\Continuum\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
"""Entry point for launching an IPython kernel.
Out[202]:
date datetime64[ns]
date_unconverted datetime64[ns]
number int64
number_unconverted int64
dtype: object With the new api: In [202]: df1.to_numeric().to_datetime()
Out[202]:
date datetime64[ns]
date_unconverted datetime64[ns]
number int64
number_unconverted int64
dtype: object |
And to be honest, I don't personally care much about the second API, my pushback over deprecating |
I would second I think function like |
Cool, yeah the more I think about the less I think adding In [251]: from pandas._libs.lib import maybe_convert_objects
In [252]: converter = lambda x: maybe_convert_objects(np.asarray(x, dtype='O'), convert_datetime=True, convert_timedelta=True)
In [253]: converter([1,2,3])
Out[253]: array([1, 2, 3], dtype=int64)
In [254]: converter([1,2,3])
Out[254]: array([1, 2, 3], dtype=int64)
In [255]: converter([1,2,'3'])
Out[255]: array([1, 2, '3'], dtype=object)
In [256]: converter([datetime.datetime(2015, 1, 1), datetime.datetime(2015, 1, 2)])
Out[256]: array(['2015-01-01T00:00:00.000000000', '2015-01-02T00:00:00.000000000'], dtype='datetime64[ns]')
In [257]: converter([datetime.datetime(2015, 1, 1), 'a'])
Out[257]: array([datetime.datetime(2015, 1, 1, 0, 0), 'a'], dtype=object)
In [258]: converter([datetime.datetime(2015, 1, 1), 1])
Out[258]: array([datetime.datetime(2015, 1, 1, 0, 0), 1], dtype=object)
In [259]: converter([datetime.timedelta(seconds=1), datetime.timedelta(seconds=1)])
Out[259]: array([1000000000, 1000000000], dtype='timedelta64[ns]')
In [260]: converter([datetime.timedelta(seconds=1), 1])
Out[260]: array([datetime.timedelta(0, 1), 1], dtype=object) |
yes maybe_convert_objects is a soft conversion |
I could be on board with a very simple |
could add the new function and change the msg on the convert_objects deprecation to point to |
IIUC, to what extent is |
|
Ah right, so do you mean then that |
|
Ah, okay, that makes sense. I was just trying to understand and collate the comments made in this discussion in my mind. |
fyi, opened #16915 for |
xref #11173
or IMHO simply replace by use of
pd.to_datetime,pd.to_timedelta,pd.to_numeric
.Having an auto-guesser is ok, but when you try to forcefully coerce things can easily go awry.
The text was updated successfully, but these errors were encountered: