Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: df.rolling(..).corr()/cov() when pairwise=True to return MI DataFrame #15677

Closed
wants to merge 3 commits into from

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Mar 13, 2017

from #15601 (comment).

Unfortunately I don't see an easy way to even deprecate this and we simply have to switch. Good news is this will simply fail fast in accessing, as the Panels have a different access pattern (names of indices and indexing) that MI DataFrames (and another reason to remove them :>).

@jreback jreback added API Design Numeric Operations Arithmetic, Comparison, and Logical operations Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Mar 13, 2017
@jreback jreback added this to the 0.20.0 milestone Mar 13, 2017
@jreback jreback force-pushed the corr branch 3 times, most recently from a259eb5 to db9f2c0 Compare March 13, 2017 21:44
@codecov-io
Copy link

codecov-io commented Mar 13, 2017

Codecov Report

Merging #15677 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #15677      +/-   ##
==========================================
- Coverage   90.97%   90.96%   -0.01%     
==========================================
  Files         145      145              
  Lines       49483    49487       +4     
==========================================
+ Hits        45015    45018       +3     
- Misses       4468     4469       +1
Flag Coverage Δ
#multiple 88.73% <100%> (-0.01%) ⬇️
#single 40.62% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/window.py 96.2% <100%> (+0.02%) ⬆️
pandas/indexes/multi.py 96.59% <100%> (-0.01%) ⬇️
pandas/core/indexing.py 94.01% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update da0523a...0f5092c. Read the comment docs.

@jreback
Copy link
Contributor Author

jreback commented Mar 14, 2017

@chrisaycock thoughts?

@chrisaycock
Copy link
Contributor

I didn't understand the "pairwise=True" part of the PR title, but the actual documentation is pretty straightforward. You are just returning a MI DataFrame by transposing the Panel result under-the-hood. That all makes sense.

@jreback jreback force-pushed the corr branch 5 times, most recently from da60531 to 0ee6303 Compare March 22, 2017 19:41
@jreback
Copy link
Contributor Author

jreback commented Mar 22, 2017

any comments?

@jorisvandenbossche
Copy link
Member

I find it at first sight a bit strange to get a frame with multi-indexed columns instead of a frame with multi-indexed index.
If you have a multi-index columns, accessing one correlation matrix would also be simpler, as you don't need the unstack

(never use this, so I don't speak out of experience)

@jreback
Copy link
Contributor Author

jreback commented Mar 27, 2017

I find it at first sight a bit strange to get a frame with multi-indexed columns instead of a frame with multi-indexed index.
If you have a multi-index columns, accessing one correlation matrix would also be simpler, as you don't need the unstack

whats your example?

@jorisvandenbossche
Copy link
Member

So example from the whatsnew:

In [3]: df = DataFrame(np.random.rand(100, 2))

In [4]: res = df.rolling(12).corr()

In [5]: res
Out[5]: 
major    0                   1     
minor    0         1         0    1
0      NaN       NaN       NaN  NaN
1      NaN       NaN       NaN  NaN
2      NaN       NaN       NaN  NaN
3      NaN       NaN       NaN  NaN
4      NaN       NaN       NaN  NaN
5      NaN       NaN       NaN  NaN
6      NaN       NaN       NaN  NaN
7      NaN       NaN       NaN  NaN
8      NaN       NaN       NaN  NaN
9      NaN       NaN       NaN  NaN
10     NaN       NaN       NaN  NaN
11     1.0  0.131988  0.131988  1.0
12     1.0  0.115938  0.115938  1.0
13     1.0  0.142035  0.142035  1.0
14     1.0  0.160646  0.160646  1.0
15     1.0 -0.011628 -0.011628  1.0
16     1.0  0.480531  0.480531  1.0
17     1.0  0.317300  0.317300  1.0
18     1.0  0.297592  0.297592  1.0

I expected something like:

          0         1     
0  0    NaN       NaN 
   1    NaN       NaN 
1  0    NaN       NaN   
   1    NaN       NaN   
2  0    NaN       NaN   
   1    NaN       NaN 
...

But as I said, don't have experience with what would be the most practical to work with further.

@jreback
Copy link
Contributor Author

jreback commented Mar 27, 2017

ok same as my example.

The reason for this returning this way is that the index is the same as the original. Otherwise its actually a transpose. I guess this is kind of arbitrary. But IMHO this makes more sense.

In [14]: np.random.seed(1234)
    ...: df = DataFrame(np.random.rand(100, 2), columns=list('AB'))
    ...: 
    ...: 

In [15]: df.head()
Out[15]: 
          A         B
0  0.191519  0.622109
1  0.437728  0.785359
2  0.779976  0.272593
3  0.276464  0.801872
4  0.958139  0.875933

In [16]: df.rolling(12).corr().head()
Out[16]: 
major   A       B    
minor   A   B   A   B
0     NaN NaN NaN NaN
1     NaN NaN NaN NaN
2     NaN NaN NaN NaN
3     NaN NaN NaN NaN
4     NaN NaN NaN NaN

@jreback
Copy link
Contributor Author

jreback commented Mar 27, 2017

ahh, you want this?

In [11]: df.rolling(12).corr().stack('minor', dropna=False).head()
Out[11]: 
major     A   B
  minor        
0 A     NaN NaN
  B     NaN NaN
1 A     NaN NaN
  B     NaN NaN
2 A     NaN NaN

@jorisvandenbossche
Copy link
Member

yes, that is what I had in mind (the whatsnew example is maybe a bit confusing with its integer column names)

@jreback
Copy link
Contributor Author

jreback commented Mar 27, 2017

ok, let me see what I can do.

@jreback
Copy link
Contributor Author

jreback commented Mar 27, 2017

updated. and here's the new example

In [1]: pd.options.display.max_rows=12

In [2]:    np.random.seed(1234)
   ...:    df = DataFrame(np.random.rand(100, 2),
   ...:                  columns=['A', 'B'],
   ...:                  index=pd.date_range('20160101', periods=100, freq='D'))
   ...:    df
   ...: 
Out[2]: 
                   A         B
2016-01-01  0.191519  0.622109
2016-01-02  0.437728  0.785359
2016-01-03  0.779976  0.272593
2016-01-04  0.276464  0.801872
2016-01-05  0.958139  0.875933
2016-01-06  0.357817  0.500995
...              ...       ...
2016-04-04  0.475567  0.344417
2016-04-05  0.640880  0.126205
2016-04-06  0.171465  0.737086
2016-04-07  0.127029  0.369650
2016-04-08  0.604334  0.103104
2016-04-09  0.802374  0.945553

[100 rows x 2 columns]

In [3]: df.rolling(12).corr()
Out[3]: 
                         A         B
major      minor                    
2016-01-01 A           NaN       NaN
           B           NaN       NaN
2016-01-02 A           NaN       NaN
           B           NaN       NaN
2016-01-03 A           NaN       NaN
           B           NaN       NaN
...                    ...       ...
2016-04-07 A      1.000000 -0.132090
           B     -0.132090  1.000000
2016-04-08 A      1.000000 -0.145775
           B     -0.145775  1.000000
2016-04-09 A      1.000000  0.119645
           B      0.119645  1.000000

[200 rows x 2 columns]

@jreback
Copy link
Contributor Author

jreback commented Mar 27, 2017

I think maybe we should completely zonk the index level names. I am not sure what to do with them w/o making it look weird. The problem is that the 2nd level AND the columns are named the same which, when you print it is odd. Could name the 1st level though.

@jorisvandenbossche
Copy link
Member

Yes, the 'major' and 'minor' do not necessarily make sense anymore, as this is rather Panel-specific terminology.

@chrisaycock Do you think this shape makes sense? (compared to the initial proposal in the PR?)

Unfortunately I don't see an easy way to even deprecate this and we simply have to switch.

@jreback We could easily add a keyword for switching this behaviour, and raise a deprecation warning on the default value, indicating they can change behaviour + suppress warning by specifying the keyword.
But, this approach is always a bit ugly. Not sure if it is needed in this case.

@chrisaycock
Copy link
Contributor

The major/minor names are weird since they don't come from the user. It's hard to follow what those are from just looking at a code sample.

@jreback
Copy link
Contributor Author

jreback commented Mar 28, 2017

so here is an easy thing to do. I have annotated names to make this explict

In [8]:    pd.options.display.max_rows=12
   ...:    np.random.seed(1234)
   ...:    df = pd.DataFrame(np.random.rand(100, 2),
   ...:                      columns=pd.Index(['A', 'B'], name='bar'),
   ...:                      index=pd.date_range('20160101',
   ...:                                          periods=100, freq='D', name='foo'))
   ...:    df2 = df.copy()
   ...:    df2.columns = pd.Index(['A', 'B'], name='bar2')
   ...:    df2.index = date_range('20160101', periods=100, freq='D', name='foo2')
   ...: 
   ...: 

In [9]: df.rolling(12).corr(df2, pairwise=True)
Out[9]: 
bar2                   A         B
bar        foo                    
2016-01-01 A         NaN       NaN
           B         NaN       NaN
2016-01-02 A         NaN       NaN
           B         NaN       NaN
2016-01-03 A         NaN       NaN
           B         NaN       NaN
...                  ...       ...
2016-04-07 A    1.000000 -0.132090
           B   -0.132090  1.000000
2016-04-08 A    1.000000 -0.145775
           B   -0.145775  1.000000
2016-04-09 A    1.000000  0.119645
           B    0.119645  1.000000

[200 rows x 2 columns]

I would do this (but maybe zonk the column name). This can actually be confusing if you do the typical cross-corr.

In [10]: df.rolling(12).corr()
Out[10]: 
bar                    A         B
bar        foo                    
2016-01-01 A         NaN       NaN
           B         NaN       NaN
2016-01-02 A         NaN       NaN
           B         NaN       NaN
2016-01-03 A         NaN       NaN
           B         NaN       NaN
...                  ...       ...
2016-04-07 A    1.000000 -0.132090
           B   -0.132090  1.000000
2016-04-08 A    1.000000 -0.145775
           B   -0.145775  1.000000
2016-04-09 A    1.000000  0.119645
           B    0.119645  1.000000

[200 rows x 2 columns]

@jreback
Copy link
Contributor Author

jreback commented Mar 28, 2017

ok latest push gives [10].

@jorisvandenbossche
Copy link
Member

In the [10] above, shouldn't the index level names 'bar' and 'foo' not be switched ?

@jreback
Copy link
Contributor Author

jreback commented Mar 29, 2017

@jorisvandenbossche this is on latest. I had them switched (in error) before when I did that example.

result.index.name =  [index, columns] of the source df
result.columns.name = None
In [2]: pd.options.display.max_rows=12
   ...: np.random.seed(1234)
   ...: df = pd.DataFrame(np.random.rand(100, 2),
   ...:                         columns=pd.Index(['A', 'B'], name='bar'),
   ...:                         index=pd.date_range('20160101',
   ...:                                              periods=100, freq='D', name='foo'))
   ...: df2 = df.copy()
   ...: df2.columns = pd.Index(['A', 'B'], name='bar2')
   ...: df2.index = date_range('20160101', periods=100, freq='D', name='foo2')
   ...:     

In [3]: df.rolling(12).corr()
Out[3]: 
                       A         B
foo        bar                    
2016-01-01 A         NaN       NaN
           B         NaN       NaN
2016-01-02 A         NaN       NaN
           B         NaN       NaN
2016-01-03 A         NaN       NaN
           B         NaN       NaN
...                  ...       ...
2016-04-07 A    1.000000 -0.132090
           B   -0.132090  1.000000
2016-04-08 A    1.000000 -0.145775
           B   -0.145775  1.000000
2016-04-09 A    1.000000  0.119645
           B    0.119645  1.000000

[200 rows x 2 columns]

In [4]: df.rolling(12).corr(df2, pairwise=True)
Out[4]: 
                       A         B
foo        bar                    
2016-01-01 A         NaN       NaN
           B         NaN       NaN
2016-01-02 A         NaN       NaN
           B         NaN       NaN
2016-01-03 A         NaN       NaN
           B         NaN       NaN
...                  ...       ...
2016-04-07 A    1.000000 -0.132090
           B   -0.132090  1.000000
2016-04-08 A    1.000000 -0.145775
           B   -0.145775  1.000000
2016-04-09 A    1.000000  0.119645
           B    0.119645  1.000000

[200 rows x 2 columns]

@jreback
Copy link
Contributor Author

jreback commented Apr 7, 2017

closing in favor of #15601 (which incorporates these commits)

@jreback jreback closed this Apr 7, 2017
@jorisvandenbossche
Copy link
Member

@jreback In your Out[3], the columns loose its name. Keeping the name would duplicate the 'bar' in this case, but I think that is OK (loosing its name seems worse?)

@jreback
Copy link
Contributor Author

jreback commented Apr 7, 2017

@jreback In your Out[3], the columns loose its name. Keeping the name would duplicate the 'bar' in this case, but I think that is OK (loosing its name seems worse?)

The problem is it will always duplicate if you have the same frame (e.g. in df.rolling(12).corr() its against itself. In the case of using another frame
df.rolling(12).corr(df2) I can see this.

I am explicity setting this to None. I'll push a change (on the deprecate PR now) with this soon.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Apr 7, 2017

Yes, you will always get a double name in that case, but I don't think that is that worse (the actual column labels are also actually duplicated, so that seems only consistent).

It would also be consistent with the non-rolling corr:

In [3]: df = pd.DataFrame(np.random.randn(10,2), columns=['A', 'B'])

In [5]: df.corr()
Out[5]: 
          A         B
A  1.000000 -0.089014
B -0.089014  1.000000

In [6]: df.columns.name = 'name'

In [7]: df.corr()
Out[7]: 
name         A         B
name                    
A     1.000000 -0.089014
B    -0.089014  1.000000

jreback added a commit that referenced this pull request Apr 7, 2017
closes #13563
on top of #15677

Author: Jeff Reback <jeff@reback.net>

Closes #15601 from jreback/panel and squashes the following commits:

04104a7 [Jeff Reback] fine grained catching warnings in tests
f8800dc [Jeff Reback] add numpy reference for searchsorted
fa136dd [Jeff Reback] doc correction
c39453a [Jeff Reback] add perf optimization in searchsorted for FrozenNDArray
0e9c4a4 [Jeff Reback] fix docs as per review & column name changes
3df0abe [Jeff Reback] remove Panel from doc-strings, catch internal warning on Panel construction
755606d [Jeff Reback] more docs
d04db2e [Jeff Reback] add deprecate_panel section to docs
538b8e8 [Jeff Reback] pep fix
912d523 [Jeff Reback] TST: separate out test_append_to_multiple_dropna to two tests; when drop=False this is sometimes failing
a2625ba [Jeff Reback] remove most Term references in test_pytables.py
cd5b6b8 [Jeff Reback] DEPR: Panel deprecated
6b20ddc [Jeff Reback] fix names on return structure
f41d3df [Jeff Reback] API: df.rolling(..).corr()/cov() when pairwise=True to return MI DataFrame
84e788b [Jeff Reback] BUG/PERF: handle a slice correctly in get_level_indexer
@stanleyng8
Copy link

stanleyng8 commented Jun 24, 2017

Apologies if this is not the right place to ask a question about the above change. Pairwise rolling correlation used to give a Panel. Asking for the shape gives three numbers. But with the above change, it returns a 2-dimensional DataFrame. My question is what is the easiest way to update legacy code? For e.g., panel_corr.ix[:, 0, 0] and panel_corr.ix[i, :, :]. What do they look like in the dataframe language? I could try coming up with some arithmetic to somehow translate from a 3-dimensional panel to a 2-dimensional dataframe. But that seems rather inefficient and error-prone. Can I turn the new multi-index dataframe output into a 3-dimensional object so that the legacy codes would work?

@jreback
Copy link
Contributor Author

jreback commented Jun 24, 2017

you should read the whatsnew note and section on how to index

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Numeric Operations Arithmetic, Comparison, and Logical operations Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants