Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not put a dataframe into hdfstore *completely* #3012

Closed
simomo opened this issue Mar 11, 2013 · 3 comments · Fixed by #3013
Closed

Can not put a dataframe into hdfstore *completely* #3012

simomo opened this issue Mar 11, 2013 · 3 comments · Fixed by #3013

Comments

@simomo
Copy link

simomo commented Mar 11, 2013

  • I load a dataframe from mysql:
df_bugs_activity_4w = psql.read_frame('select * from bugs_activity limit 0, 40000', conn)
  • and the structure of df_bugs_activity_4w:
In[19]: df_bugs_activity_4w
Out[19]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 40000 entries, 0 to 39999
Data columns:
bug_id       40000  non-null values
attach_id    13  non-null values
who          40000  non-null values
bug_when     40000  non-null values
fieldid      40000  non-null values
added        40000  non-null values
removed      40000  non-null values
id           40000  non-null values
dtypes: float64(1), int64(4), object(3)
  • then, convert the object columns
In [60]: df_bugs_activity_4w = df_bugs_activity_4w.convert_objects()
In [61]: df_bugs_activity_4w
Out[61]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 40000 entries, 0 to 39999
Data columns:
bug_id       40000  non-null values
attach_id    13  non-null values
who          40000  non-null values
bug_when     40000  non-null values
fieldid      40000  non-null values
added        40000  non-null values
removed      40000  non-null values
id           40000  non-null values
dtypes: datetime64[ns](1), float64(1), int64(4), object(2)
  • I put it into a hdfstore, and then get it out, found the number of dataframe entries changed from 40,000 to 13 ! That's weird. It seems that the number of 'attach_id' columns limits the total number of dataframe when putting it into a hdfstore.
In [63]: %prun store.put('df_bugs_activity_4w1', df_bugs_activity_4w, table=True)

In [64]: %time store.get('df_bugs_activity_4w1')
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.01 s
Out[64]:
bug_id  attach_id   who bug_when    fieldid added   removed id
2012     301879  0   35 1999-04-06 11:22:11  16  dev     bug     2013
2014     301879  0   35 1999-04-06 11:22:12  5   para    us 2015
2835     301879  0   56 1999-05-14 15:56:12  10  clo     op  2836
31244    301879  0   207    2001-07-18 14:11:38  10  op  clo     31245
31252    301879  0   207    2001-07-18 15:40:52  10  ana     op  31253
31283    301879  0   35 2001-07-18 21:21:33  16  lui     dev     31284
31285    301879  0   35 2001-07-18 21:21:34  15  296     10  31286
31287    301879  0   35 2001-07-18 21:21:35  5   unk     para    31288
31393    301879  0   159    2001-07-19 12:41:07  16  prat    lui     31394
31472    301879  0   207    2001-07-19 17:27:31  10  ope     ana     31473
32675    301879  0   207    2001-08-02 10:09:08  10  clos    op   32676
38609    235837  0   201    2001-09-26 20:28:11  15  310-    300-3   38610
38610    235838  0   201    2001-09-26 20:28:11  15  310-    300     38611

2013-03-11 21:43:21

In [66]: store
Out[66]:
<class 'pandas.io.pytables.HDFStore'>
File path: sample_no_fill.h5
/df_bugs_4w                      frame_table  (typ->appendable,nrows->40000,ncols->52,indexers->[index])
/df_bugs_4w1                     frame_table  (typ->legacy,nrows->None,ncols->0,indexers->[])           
/df_bugs_activity_4w             frame_table  (typ->appendable,nrows->13,ncols->8,indexers->[index])    
/df_bugs_activity_4w1            frame_table  (typ->appendable,nrows->13,ncols->8,indexers->[index]) 
@jreback
Copy link
Contributor

jreback commented Mar 11, 2013

pls post a sample of the data as it comes out of SQL, before any other operations. post as a text string, EXACTLY as you have it (e.g. do a df.to_csv()) with a subset of the rows,

@jreback
Copy link
Contributor

jreback commented Mar 11, 2013

@simomo PR #3013 should fix your 2nd issue (there was a bug), pls give a try and let me know

@simomo
Copy link
Author

simomo commented Mar 14, 2013

This issue has been solved. thanks~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants