ER: compound dtypes - DataFrame constructor/astype #4464

mamikonyan · 2013-08-05T15:04:16Z

xref #9133, maybe allow a dict of dtypes to be passed as well

I trying to use the dtype argument in the DataFrame constructor to set the types of several columns, and I'm getting incorrect types. Everything works well, however, when the dtypes come from the recarray itself.

In [61]:
data = [(1,1.2), (2,2.3)]
dtype = [('a','i4'),('b','f4')]
a = np.array(data, dtype=dtype)
pd.DataFrame(a).dtypes

Out [61]:
a      int32
b    float32
dtype: object

But if I use the dtype constructor argument, I get incorrect types:

In [65]:
pd.DataFrame(data, dtype=dtype).dtypes
Out [65]:
0    object
1    object
dtype: object

The astype() member function doesn't work either:

In [75]:
pd.DataFrame(data).astype(dtype)

Truncated Traceback (Use C-c C-x to view full TB):
c:\Anaconda\lib\site-packages\pandas\core\common.pyc in take_nd(arr, indexer,  axis, out, fill_value, mask_info, allow_fill)
    491         indexer = _ensure_int64(indexer)
    492         if not allow_fill:
--> 493             dtype, fill_value = arr.dtype, arr.dtype.type()
    494             mask_info = None, False
    495         else:

TypeError: function takes exactly 1 argument (0 given)

I'm using Pandas 0.11.0 from Anaconda.
Thanks in advance.

The text was updated successfully, but these errors were encountered:

cpcloud · 2013-08-05T15:12:44Z

this will be fixed, but is there a case where you need to pass the dtype? I suppose if you have the array and the dtype separate then you might want to just pass them in, but if you have that you can construct the recarray and then pass to DataFrame...

jreback · 2013-08-05T15:24:24Z

actually, this is not currently implemented

jreback · 2013-08-05T15:25:04Z

this should raise now, marking as a bug for that (it expectes a single dtype, not a compound one)

cpcloud · 2013-08-05T15:29:46Z

@jreback so it should raise? what's wrong with accepting a compound dtype?

jreback · 2013-08-05T15:34:45Z

@cpcloud nothing wrong with accepting it, but its NotImplementedError (until it is)

mamikonyan · 2013-08-05T15:40:40Z

OK, thanks, guys.

jreback · 2013-08-05T15:46:08Z

@mamikony going to reopen...thanks for noticing this....don't really have any test for this

mamikonyan · 2013-08-05T15:53:21Z

So, what do you think about the second issue of astype() throwing?

jreback · 2013-08-05T15:55:24Z

same issue, its setup to deal with a single dtype (not compound). the purpose is to coerce your data. What is your goal here?

mamikonyan · 2013-08-05T16:18:19Z

I had a series where the elements were compound strings,

In [130]:
pd.Series(['A:1:3.14'])

Out [130]:
0    A:1:3.14
dtype: object

So I wanted to split them into a data frame with appropriate types, but I got this garbage:

In [131]:
pd.DataFrame(_.map(lambda s: s.split(':')).tolist(), dtype=[('a','S1'),('b','i4'),('c','f4')])
Out [131]:
             0            1                  2
0  (A, 0, 0.0)  (1, 0, 0.0)  (3, 3420462, 0.0)

I'll just have to do it column by column.

jreback · 2013-08-05T16:29:38Z

The split creates a series with elements that are lists
The apply creates a frame from this
convert_objects with the convert_numeric=True coerces strings to numbers (if it can)

In [1]: s = pd.Series(['A:1:3.14'])

In [7]: s.str.split(':').apply(Series).convert_objects(convert_numeric=True)
Out[7]: 
   0  1     2
0  A  1  3.14

In [8]: s.str.split(':').apply(Series).convert_objects(convert_numeric=True).dtypes
Out[8]: 
0     object
1      int64
2    float64
dtype: object

mamikonyan · 2013-08-05T17:09:59Z

Fair enough, thanks. You can't set the dtype, but at least it guesses correctly. I need precise control of the dtype because I write it out to an HDF5 file (with Pytables and df.to_records()) and I'd like to have the proper dtype at the start. Also, I didn't know about the str namespace. Thanks.

jreback · 2013-08-05T17:15:01Z

@mamikony

you can certainly change it if you like (just do it one-by-one)

are you not using HDFStore? (which uses PyTables under the hood)? (and will be significantly faster than using to_records, btw), not to mention that it save indices and almost all dtypes. see here: http://pandas.pydata.org/pandas-docs/dev/io.html#hdf5-pytables

mamikonyan · 2013-08-05T18:09:38Z

I understand about the HDFStore, and I use it to read files. But unfortunately, I don't like to use it for output because it creates Pandas-specific HDF5 files that look quite incomprehensible for standard tools, e.g., h5ls(1). Also, I usually don't want to save details like index. At least I remember this to be the case a few months ago, I'm not sure if there is a new way of creating clean HDF5 files.

jreback · 2013-08-05T18:15:03Z

up 2 you. They actually are fully compatible HDF5 Files, just with extra meta-data (and the indices are just columns).
If you are using PyTables, you should use their tools, ptdump,ptrepack, rather than a 'standard' HDF5 tool which doesn't understand even the PyTables meta data

mamikonyan · 2013-08-05T18:32:27Z

Sure, but if you're sharing data with people who use other tools and languages to read your HDF5 files, then your meta-data becomes extraneous garbage. Actually, if I could put in a feature request, I'd like to ask you guys to include an option to create plain HDF5 files. I think there is an option (don't remember right now) that makes slightly cleaner files, but even that puts extra stuff in.

In any case, I appreciate your help.

ehein6 · 2017-09-26T22:42:03Z

Neither of those options produce the correct output, see below. The internal lists are being stored in one cell instead of being exploded into rows. I'd be happy to know if there was an idiomatic way to load this in one step.

>>> print pd.DataFrame(json_data)
         day          humidity              temp
0     Monday  [50, 60, 70, 60]  [32, 33, 34, 34]
1    Tuesday  [50, 60, 70, 60]  [32, 33, 34, 34]
2  Wednesday  [50, 60, 70, 60]               NaN
3   Thursday  [50, 60, 70, 60]  [32, 33, 34, 34]
4     Friday               NaN  [32, 33, 34, 34]

# Converting data structure to string, could skip this step in my application
>>> print pd.read_json(json.dumps(json_data))
         day          humidity              temp
0     Monday  [50, 60, 70, 60]  [32, 33, 34, 34]
1    Tuesday  [50, 60, 70, 60]  [32, 33, 34, 34]
2  Wednesday  [50, 60, 70, 60]               NaN
3   Thursday  [50, 60, 70, 60]  [32, 33, 34, 34]
4     Friday               NaN  [32, 33, 34, 34]

jreback · 2017-09-26T22:55:36Z

you are doing things in a very inefficient manner
by storing lists

ehein6 · 2017-09-26T23:03:35Z

I don't want to store lists. I want the DataFrame structure from the previous post. Do you know of a fast, idiomatic way to create that structure from the json I provided?

jreback · 2017-09-26T23:05:21Z

use read_json directly

ehein6 · 2017-09-27T02:39:11Z

Like this?

>>> print pd.read_json(json_data)
Traceback (most recent call last):
  File "example.py", line 23, in <module>
    print pd.read_json(json_data)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/json/json.py", line 322, in read_json
    encoding=encoding)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/common.py", line 210, in get_filepath_or_buffer
    raise ValueError(msg.format(_type=type(filepath_or_buffer)))
ValueError: Invalid file path or buffer object type: <type 'list'>

It doesn't work because in my example code, json_data is an object, not a path or string. This is why I called json.dumps() before passing it to pd.read_json in my previous example. But that still gives the wrong output, with lists nested inside cells of the DataFrame.

Please give me a clear example of what you mean. Again, I want an fast, idiomatic way to go from this:

json_data = [
    {"day":"Monday", "temp":[32, 33, 34, 34], "humidity": [50, 60, 70, 60]},
    {"day":"Tuesday", "temp":[32, 33, 34, 34], "humidity": [50, 60, 70, 60]},
    {"day":"Wednesday", "humidity": [50, 60, 70, 60]},
    {"day":"Thursday", "temp":[32, 33, 34, 34], "humidity": [50, 60, 70, 60]},
    {"day":"Friday", "temp":[32, 33, 34, 34]},
]

to this:

         day  humidity  temp
0     Monday      50.0  32.0
1     Monday      60.0  33.0
2     Monday      70.0  34.0
3     Monday      60.0  34.0
0    Tuesday      50.0  32.0
1    Tuesday      60.0  33.0
2    Tuesday      70.0  34.0
3    Tuesday      60.0  34.0
0  Wednesday      50.0   NaN
1  Wednesday      60.0   NaN
2  Wednesday      70.0   NaN
3  Wednesday      60.0   NaN
0   Thursday      50.0  32.0
1   Thursday      60.0  33.0
2   Thursday      70.0  34.0
3   Thursday      60.0  34.0
0     Friday       NaN  32.0
1     Friday       NaN  33.0
2     Friday       NaN  34.0
3     Friday       NaN  34.0

jreback · 2017-09-28T00:16:59Z

In [29]: pd.concat([pd.DataFrame(l) for l in json_data])
Out[29]: 
         day  humidity  temp
0     Monday      50.0  32.0
1     Monday      60.0  33.0
2     Monday      70.0  34.0
3     Monday      60.0  34.0
0    Tuesday      50.0  32.0
1    Tuesday      60.0  33.0
2    Tuesday      70.0  34.0
3    Tuesday      60.0  34.0
0  Wednesday      50.0   NaN
1  Wednesday      60.0   NaN
2  Wednesday      70.0   NaN
3  Wednesday      60.0   NaN
0   Thursday      50.0  32.0
1   Thursday      60.0  33.0
2   Thursday      70.0  34.0
3   Thursday      60.0  34.0
0     Friday       NaN  32.0
1     Friday       NaN  33.0
2     Friday       NaN  34.0
3     Friday       NaN  34.0

jreback · 2017-09-28T00:18:06Z

DataFrame.from_records above has a bit of overhead. Your basic problem is the data starts out in a python structure (dict of lists).

ehein6 · 2017-09-29T21:59:39Z

Right, it's the creation of DataFrames in a tight loop that causes the overhead, and unfortunately it looks like this can't be avoided without changing my input data structure. My original question was whether passing a compound dtype argument to this constructor would help a little bit, but this seems unlikely now. Thanks for taking a look!

jreback · 2017-10-01T17:27:48Z

@ehein6 you don't need to create a dataframe in each part of the loop, just do it once at the end.

ehein6 · 2017-10-02T00:45:18Z

Your latest example is calling pd.DataFrame() inside a list comprehension. Aside from the syntax sugar, the code I originally posted is doing exactly the same thing: building up a list of DataFrames, then calling pd.concat() to join them together. If it can indeed be done without creating DataFrames in a loop, please show me how.

jreback · 2017-10-02T00:53:02Z

my example is not the same as yours
you are calling DataFrame.from_records which is very different

ehein6 · 2017-10-02T01:25:55Z

In this case they are identical. Using just pd.DataFrame() instead of pd.DataFrame.from_records() doesn't change the output. It doesn't change the performance on my full dataset, either:

Best of 20 runs:
pd.DataFrame.from_records(): 3.65 seconds
pd.DataFrame(): 3.65 seconds

Either way, both examples are still creating a DataFrame in a loop.

Quetzalcohuatl · 2021-06-06T00:01:38Z

Got any ideas on how one can fix this? I have a list of dictionaries, and I'm just doing pd.DataFrame(list_of_dicts) but I would prefer to specify the dtypes because I'm low on RAM.

Also, consider changing the title of this Github issue? Most people in here and other mentioned issues all talking about passing a dictionary into dtypes either during construction or after it's already defined. Kind of a cryptic title.

Raphencoder · 2023-06-07T08:39:23Z

Any news or recommendation on this subject ? 10 years it has been open..

mamikonyan closed this as completed Aug 5, 2013

jreback reopened this Aug 5, 2013

jreback mentioned this issue Sep 1, 2013

CLN: DataFrame constructor should dispatch to from_records #4728

Closed

jreback mentioned this issue Oct 12, 2013

Segmentation fault when conctructing DataFrame with specified datetime dtype of one column #5191

Closed

jreback mentioned this issue Dec 28, 2013

Cannot produce DataFrame from dict-yielding iterator. #2193

Closed

jreback modified the milestones: 0.15.0, 0.14.0 Mar 30, 2014

jreback mentioned this issue Sep 17, 2014

ERR: DataFrame.from_records raises misleading exception on shape mismatch #8294

Merged

jreback mentioned this issue Dec 22, 2014

ENH: Enable passing in dict to specify datatypes to DataFrame constructor #9133

Closed

jreback mentioned this issue Jan 9, 2015

No way to construct mixed dtype DataFrame without total copy, proposed solution #9216

Closed

jreback mentioned this issue Jan 17, 2015

Dataframe creation: Specifying dtypes with a dictionary #9287

Closed

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

WillAyd mentioned this issue Jun 18, 2018

Setting types on DataFrame constructor not working, while setting them after construction does. #21445

Closed

datapythonista modified the milestones: Contributions Welcome, Someday Jul 8, 2018

datapythonista mentioned this issue Jul 23, 2018

Refactor from_records to load data in an efficient way #22025

Open

WillAyd mentioned this issue Feb 13, 2019

variable dtype does not update when populating a dataframe #25294

Closed

krey mentioned this issue Apr 2, 2019

Allow default_value to Instance ipython/traitlets#518

Closed

jreback mentioned this issue May 7, 2019

DataFrame should allow to hand in dtypes for every column #26305

Closed

jbrockmendel added the Constructors Series/DataFrame/Index/pd.array Constructors label Jul 23, 2019

jbrockmendel removed Effort Medium labels Oct 21, 2019

jcezarms mentioned this issue Apr 9, 2020

ENH: Optionally pass dtypes as a dict into json_normalize #33414

Open

mroeschke added Enhancement and removed Bug labels Jun 28, 2020

This was referenced Aug 19, 2022

QST: #48091

Closed

ENH: Dataframe constructor and DataFrame.from_records shoudl allow specifying column dtypes #46868

Closed

mroeschke removed this from the Someday milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ER: compound dtypes - DataFrame constructor/astype #4464

ER: compound dtypes - DataFrame constructor/astype #4464

mamikonyan commented Aug 5, 2013

cpcloud commented Aug 5, 2013

jreback commented Aug 5, 2013

jreback commented Aug 5, 2013

cpcloud commented Aug 5, 2013

jreback commented Aug 5, 2013

mamikonyan commented Aug 5, 2013

jreback commented Aug 5, 2013

mamikonyan commented Aug 5, 2013

jreback commented Aug 5, 2013

mamikonyan commented Aug 5, 2013

jreback commented Aug 5, 2013

mamikonyan commented Aug 5, 2013

jreback commented Aug 5, 2013

mamikonyan commented Aug 5, 2013

jreback commented Aug 5, 2013

mamikonyan commented Aug 5, 2013

ehein6 commented Sep 26, 2017

jreback commented Sep 26, 2017

ehein6 commented Sep 26, 2017

jreback commented Sep 26, 2017

ehein6 commented Sep 27, 2017

jreback commented Sep 28, 2017

jreback commented Sep 28, 2017

ehein6 commented Sep 29, 2017

jreback commented Oct 1, 2017

ehein6 commented Oct 2, 2017

jreback commented Oct 2, 2017

ehein6 commented Oct 2, 2017

Quetzalcohuatl commented Jun 6, 2021

Raphencoder commented Jun 7, 2023

ER: compound dtypes - DataFrame constructor/astype #4464

ER: compound dtypes - DataFrame constructor/astype #4464

Comments

mamikonyan commented Aug 5, 2013

cpcloud commented Aug 5, 2013

jreback commented Aug 5, 2013

jreback commented Aug 5, 2013

cpcloud commented Aug 5, 2013

jreback commented Aug 5, 2013

mamikonyan commented Aug 5, 2013

jreback commented Aug 5, 2013

mamikonyan commented Aug 5, 2013

jreback commented Aug 5, 2013

mamikonyan commented Aug 5, 2013

jreback commented Aug 5, 2013

mamikonyan commented Aug 5, 2013

jreback commented Aug 5, 2013

mamikonyan commented Aug 5, 2013

jreback commented Aug 5, 2013

mamikonyan commented Aug 5, 2013

ehein6 commented Sep 26, 2017

jreback commented Sep 26, 2017

ehein6 commented Sep 26, 2017

jreback commented Sep 26, 2017

ehein6 commented Sep 27, 2017

jreback commented Sep 28, 2017

jreback commented Sep 28, 2017

ehein6 commented Sep 29, 2017

jreback commented Oct 1, 2017

ehein6 commented Oct 2, 2017

jreback commented Oct 2, 2017

ehein6 commented Oct 2, 2017

Quetzalcohuatl commented Jun 6, 2021

Raphencoder commented Jun 7, 2023