Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiIndex DataFrame to_csv() ignores date_format #7791

Closed
Poquaruse opened this issue Jul 18, 2014 · 7 comments
Closed

MultiIndex DataFrame to_csv() ignores date_format #7791

Poquaruse opened this issue Jul 18, 2014 · 7 comments
Labels
Bug MultiIndex Output-Formatting __repr__ of pandas objects, to_string
Milestone

Comments

@Poquaruse
Copy link

Hi all,

I'm running pandas 0.14.1:

INSTALLED VERSIONS

commit: None
python: 3.4.1.final.0
python-bits: 64
OS: Windows
OS-release: 8
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: DE

pandas: 0.14.1
nose: 1.3.3
Cython: 0.20.1
numpy: 1.8.1
scipy: 0.14.0
statsmodels: None
IPython: 2.1.0
sphinx: 1.2.2
patsy: 0.2.1
scikits.timeseries: None
dateutil: 2.1
pytz: 2014.4
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.3.1
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: None
xlsxwriter: 0.5.5
lxml: 3.3.5
bs4: 4.3.1
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.4
pymysql: None
psycopg2: None

I'm experiencing a bug which might be related to #7622

How to reproduce the bug:
At first, a new DataFrame is created

dt_rng = pd.date_range(start='2014-01-01 00:00', periods = 1000, freq='1s')
df = pd.DataFrame({'a':np.random.randn(1000), 'b': np.random.randn(1000)},index = dt_rng)
df['b'] = df['b'].round()

Then, this DataFrame is exported as csv with a custom date_format:

df.to_csv(date_format='%Y demo')

This does work: ',a,b\n2014 demo,-0.5582228932333034,1.0\n2014 demo,[...]'

Now, the other scenario. Same DataFrame but a groupby('b') and a resample:

df.groupby(df['b']).resample('1min').to_csv(date_format='%y demo')

Here, the custom date_format is not used:
'b,,a\n-4.0,2014-01-01 00:12:00,1.571130069273494[...]'

Thanks and best regards

@jreback
Copy link
Contributor

jreback commented Jul 18, 2014

  • use the format below for grouping (this is new in 0.14.0), as its more flexible (e.g. the resample could be on a column).
  • this is a bug, but I think for another reason, the index is not being constructured 100% correctly, xref BUG: multi-index with one level with a tz buggy #7793
In [5]: df.groupby(['b',pd.Grouper(freq='1h')]).mean()
Out[5]: 
                      a
b                      
-4 2014-01-01  0.236842
-3 2014-01-01 -0.196441
-2 2014-01-01  0.084048
-1 2014-01-01 -0.079776
 0 2014-01-01  0.059750
 1 2014-01-01 -0.010162
 2 2014-01-01 -0.124559
 3 2014-01-01 -0.009600

In [6]: df.groupby(['b',pd.Grouper(freq='1h')]).mean().to_csv('test.csv',mode='w',date_format='%Y')

In [7]: !cat test.csv
b,,a
-4.0,2014-01-01 00:00:00,0.23684162290873984
-3.0,2014-01-01 00:00:00,-0.1964410632489068
-2.0,2014-01-01 00:00:00,0.08404773044520912
-1.0,2014-01-01 00:00:00,-0.07977596435616328
0.0,2014-01-01 00:00:00,0.05975025407287002
1.0,2014-01-01 00:00:00,-0.010162014106808491
2.0,2014-01-01 00:00:00,-0.1245593183893703
3.0,2014-01-01 00:00:00,-0.00959968236480397

@Poquaruse
Copy link
Author

Thanks! :-)

@jreback
Copy link
Contributor

jreback commented Jul 19, 2014

ok, since #7746 fix was just merged, this is a separate issue.

I think that the multi-index to_native_types is not passing thru the date_format (and maybe other formatters) to the sub-index formatting.

@jreback jreback modified the milestones: 0.15.1, 0.15.0 Sep 10, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@nbonnotte
Copy link
Contributor

@jreback i've been looking at that issue, and i would not say the difficulty is "novice"...

The issue I'm struggling with is that (if i understand things correctly) a MultiIndex is a MultiIndex whatever the type of the content, while a 1D index with datetimes is a DatetimeIndex.

So when we need to format a DatetimeIndex, DatetimeIndex._format_native_types is there and does what it has to do, knowing its values are Timestamps https://github.com/pydata/pandas/blob/d0a21032be1e92d4fcb92b8a1c16dee514cb61dc/pandas/tseries/index.py#L737

On the other hand, a MultiIndex has only a generic _format_native_types method, which basically does nothing but returning its content https://github.com/pydata/pandas/blob/d0a21032be1e92d4fcb92b8a1c16dee514cb61dc/pandas/core/index.py#L4329

I could try to go through the individual values of the MultiIndex and format all Timestamp that I encounter, but I'm not sure that's such a good idea. Probably finding a way to reuse DatetimeIndex._format_native_types would be better.

Is it possible to transform a MultiIndex into a list of regular indexes (e.g. DatetimeIndex) back and forth?

@jreback
Copy link
Contributor

jreback commented Nov 8, 2015

I'll bump the difficulty up 1 :)

here's what you have to do

redefine _format_native_types for a MulitIndex

for each level, call _format_native_types, passing thru the kwargs. these are the new levels,
then reconstruct a MultiIndexIndex and call .values on it.

In [1]: mi = pd.MultiIndex.from_product([range(3),list('abc'),date_range('20130101 09:00:00',periods=3)])
In [4]: mi.levels

Out[4]: FrozenList([[0, 1, 2], [u'a', u'b', u'c'], [2013-01-01 09:00:00, 2013-01-02 09:00:00, 2013-01-03 09:00:00]])

In [5]: mi.levels[0]
Out[5]: Int64Index([0, 1, 2], dtype='int64')

In [6]: mi.levels[1]
Out[6]: Index([u'a', u'b', u'c'], dtype='object')

In [7]: mi.levels[2]
Out[7]: DatetimeIndex(['2013-01-01 09:00:00', '2013-01-02 09:00:00', '2013-01-03 09:00:00'], dtype='datetime64[ns]', freq='D')

In [8]: mi.levels[2].to_native_types()
Out[8]: array(['2013-01-01 09:00:00', '2013-01-02 09:00:00', '2013-01-03 09:00:00'], dtype=object)

In [9]: mi.levels[2].to_native_types(date_format='%Y%m%d')
Out[9]: array(['20130101', '20130102', '20130103'], dtype=object)

something like this:

In [10]: MultiIndex(levels=[ l._format_native_types() for l in mi.levels],labels=mi.labels,sortorder=mi.sortorder,verify_integrity=False)
Out[10]: 
MultiIndex(levels=[[u'0', u'1', u'2'], [u'a', u'b', u'c'], [u'2013-01-01 09:00:00', u'2013-01-02 09:00:00', u'2013-01-03 09:00:00']],
           labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2], [0, 0, 0, 1, 1, 1, 2, 2, 2, 0, 0, 0, 1, 1, 1, 2, 2, 2, 0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]])

In [11]: MultiIndex(levels=[ l._format_native_types() for l in mi.levels],labels=mi.labels,sortorder=mi.sortorder,verify_integrity=False).values
Out[11]: 
array([('0', 'a', '2013-01-01 09:00:00'),
       ('0', 'a', '2013-01-02 09:00:00'),
       ('0', 'a', '2013-01-03 09:00:00'),
       ('0', 'b', '2013-01-01 09:00:00'),
       ('0', 'b', '2013-01-02 09:00:00'),
       ('0', 'b', '2013-01-03 09:00:00'),
       ('0', 'c', '2013-01-01 09:00:00'),
       ('0', 'c', '2013-01-02 09:00:00'),
       ('0', 'c', '2013-01-03 09:00:00'),
       ('1', 'a', '2013-01-01 09:00:00'),
       ('1', 'a', '2013-01-02 09:00:00'),
       ('1', 'a', '2013-01-03 09:00:00'),
       ('1', 'b', '2013-01-01 09:00:00'),
       ('1', 'b', '2013-01-02 09:00:00'),
       ('1', 'b', '2013-01-03 09:00:00'),
       ('1', 'c', '2013-01-01 09:00:00'),
       ('1', 'c', '2013-01-02 09:00:00'),
       ('1', 'c', '2013-01-03 09:00:00'),
       ('2', 'a', '2013-01-01 09:00:00'),
       ('2', 'a', '2013-01-02 09:00:00'),
       ('2', 'a', '2013-01-03 09:00:00'),
       ('2', 'b', '2013-01-01 09:00:00'),
       ('2', 'b', '2013-01-02 09:00:00'),
       ('2', 'b', '2013-01-03 09:00:00'),
       ('2', 'c', '2013-01-01 09:00:00'),
       ('2', 'c', '2013-01-02 09:00:00'), ('2', 'c', '2013-01-03 09:00:00')], dtype=object)

@nbonnotte
Copy link
Contributor

Nice! I didn't know the concept of "level", and this is exactly what i was missing, thx ;)

Here we go

@jreback
Copy link
Contributor

jreback commented Nov 11, 2015

closed by #11151

@jreback jreback closed this as completed Nov 11, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug MultiIndex Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants