Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: automatic rpy2 instance conversion #7385

Closed
wants to merge 1 commit into from

Conversation

sinhrks
Copy link
Member

@sinhrks sinhrks commented Jun 7, 2014

Derived from #7309. Create a wrapper for robjects.r in pandas.rpy.common to perform automatic pandas DataFrame and Series conversion. Series will be converted to R data.frame to preserve rownames (index).

If looks OK, I'll modify the doc (#7309) based on following API.

import pandas as pd
import pandas.rpy.common as com

iris = com.load_data('iris')
com.r.assign('iris', iris)
returned = com.r['iris']
type(returned)
# <class 'pandas.core.frame.DataFrame'>

df = pd.DataFrame(np.random.randn(20, 5),
                  index=pd.date_range(start='2011/01/01', freq='D', periods=20))
com.r.assign('df', df)
returned = com.r['df']
type(returned)
# <class 'pandas.core.frame.DataFrame'>

s = pd.Series(np.random.randn(20), name='test')
com.r.assign('s', s)
returned = com.r['s']
type(returned)
# <class 'pandas.core.frame.DataFrame'>

def __getattribute__(self, attr):
if attr == 'assign':
return _assign
return robj.r.__getattribute__(attr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to use the interface provided, i.e., instead of robj.r.__getattribute__(attr), just do getattr(robj.r, attr). Same for the below methods: just call their respective toplevel functions or behavior as you would if you were a user. Sometimes Python itself performs ops on the result of a special method call, e.g., for rich comparisons Python will automatically compare the ids of two objects if either of their comparison methods of the same name return NotImplemented. This is done internally in Python, but if you directly call the method like __eq__ you don't get this convenience.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, modified.

@jorisvandenbossche
Copy link
Member

See also comment of @sinhrks here: #7309 (comment):

I've briefly checked pandas2ri of rpy2.4.0, and found current pandas conversion looks better. pandas2ri doesn't convert returned rpy2 DataFrame automatically, and may raise ValueError for DatetimeIndex.

I think we have to decide where we want this conversion machinery to live (because now you have one in ipython magic (but that is moved to rpy2), rpy2 and pandas):

  • If we think the place is rpy2, we should rather try to improve pandas2ri:
    • maybe port some of the code in pandas.rpy.common to rpy2, so the things that are now converted better with convert_to_r_dataframe can be done with pandas2ri in the future.
    • in that regard the functionality in this PR is somewhat replicating pandas2ri.activate() (at least for the pandas->r case, the other way around is indeed differently with rpy2)
  • If we want this in pandas, then this PR is certainly OK I think. But ideally we don't try to implement similar functionality in the two places I think.

@lgautier @davclark

@davclark
Copy link
Contributor

davclark commented Jun 8, 2014

I'm leaving town for a week, so I'll pick this up next weekend, but wanted to let folks know that rpy2 needs to have some machinery for R -> python conversion (obviously), and so it makes the most sense to me to have the code live there, and I'm pretty sure any reasonable patch would be happily accepted.

You can see that the rmagic code (in the process of being deprecated in IPython, now living in rpy2.ipython) hands all conversion over to ro.conversion.ri2ro. So, to do this in rpy2, the idea would be to make pandas2ri.activate() set up better conversion in the dynamically patched ri2ro function.

I actually opened up an issue about this, as my memory was that things were better than they currently are! I haven't had time to go digging though:

https://bitbucket.org/lgautier/rpy2/issue/206/numpy2ri-pandas2ri-no-longer-properly

For what it's worth, I think if we have pandas installed (and invoke pandas2ri.activate()), a pandas.Series is a much better choice for conversion of R lists and vectors than a numpy object, as you get a proper index.

@lgautier
Copy link
Contributor

lgautier commented Jun 8, 2014

(...)

I'm pretty sure any reasonable patch would be happily accepted.
(...)

So am I.

@sinhrks
Copy link
Member Author

sinhrks commented Jun 17, 2014

Maybe interface and conversion logic should be discussed separately.

Conversion Functions

Currently pandas conversion looks better for me. I agree it should be merged in the future, and it should be decided on which module the conversion function maintained. I think the conversion more rely on the type of Index and pandas version, thus it may better to hold the conversion logic on pandas and call it from rpy2 ?

Conversion Interface

In my use case, sometimes I want to handle rpy2 raw values otherwise want automatic conversion. As pandas2ri overwrites all the rpy2 default conversion func, I have to activate and deactivate every time based on my operation to do this. Thus I prefer pandas r to be wrapped to perform automatic conversion, and I think it is natural.

@jreback
Copy link
Contributor

jreback commented Jun 22, 2014

@sinhrks @jorisvandenbossche what's the status on this?

@davclark
Copy link
Contributor

Saw this question was unanswered while checking into another issue. It should be noted that @lgautier fixed an issue with already existing code to convert pandas DataFrames automatically into rpy2 wrapped function calls.

The logic for the direction rpy2 has moved is that conversion to (other) python objects has been deprecated in favor of rpy2 proxy objects (wrapped R objects) supporting the array interface so numpy calls work directly on rpy2.robjects objects. And if you want a true numpy.array, you can just use numpy.asarray.

It's less obvious how to do that in pandas as there's nothing equivalent to the standard array / buffer API for tables of data.

The other piece is that we've been talking about moving to a generics approach to handling conversion on the rpy2 end in the future.

So, that's the state of things on the rpy2 side. Probably in any case it's good to have the code that inspects the guts of R objects live in rpy2. If folks want to coordinate, that'd be great. In particular, no one has asked for anything on the rpy2 side, right?

@jorisvandenbossche
Copy link
Member

Conversion functions

@davclark Do you mean that the future of the pandas2ri module in rpy2 is uncertain? (as this does not fit in the generic approach?)
The question on the conversion functions is where this should live, in pandas or in rpy2? So in fact, that is asking something on the rpy2 side, as the current conversion functions in rpy2 are lacking in some ways and should be improved if we decide that it should live in rpy2 (or at least accepting PRs).

@sinhrks I think you could also say the conversion depends more on the internals of the rpy2 objects and so rpy2 version, and should only use public pandas API. But if more contributors of pandas are interested in keeping this up to date, it is maybe easier to do it here.

@davclark What do you think of the conversion interface issue raised by @sinhrks above?

@davclark
Copy link
Contributor

The functionality of rpy2.pandas2ri.activate() should remain about the same. The infrastructure that supports should become more robust and extensible via generics. This not-yet-implemented generic system would be a good place for pandas code to modify conversion to and from R.

My feeling is that advanced users like @sinhrks would be better served by using the conversion functions directly (pandas2ri.pandas2ri() and pandas2ri.ri2pandas()), rather than activating and deactivating (i.e., swapping functions assigned to a given symbol). Note that there is no longer a general ri2py, as one can use ri2ro to get an object that supports the array interface. From there, it is easy to do numpy.asarray(). However, it seems maybe ri2py should come back if there is strong demand.

@sinhrks - is there a reason that simply using the functions directly doesn't work for you?

Can someone provide a conceptual diff on those pandas2ri functions with the pandas.rpy. I know there are things I'd like to see in rpy2: for example, by adding multi-index support (it's not clear what the right way to do this is!). I'm not sure why a pandas.Series should be a data.frame in R as R vectors and lists have names().

@jorisvandenbossche, sorry if I came across as snarky. Does someone want to provide a PR against rpy2? We had a strange default branch for a while, but it's been rebased onto the 2.4.x branch, and is now targeting a 2.5.x release. So default is a good place to start (equivalent of master on git). Or, an answer to the above "conceptual diff" (or issues on the rpy2 issue tracker) would be enough to get us going in the right direction.

@jorisvandenbossche
Copy link
Member

@davclark Ah, I didn't interpreted you as snarky! Sorry if I implied that I did :-) Your input is certainly valued!

@jreback jreback added this to the 0.15.0 milestone Jun 26, 2014
@sinhrks
Copy link
Member Author

sinhrks commented Jun 26, 2014

@davclark Ah, what I meant is I want to perform automatic conversion in separate ways, sometimes numpy and otherwise pandas, etc. And I'm not willing to to call each raw function like pandas2ri or activate/diactivate every time. My idea is preparing separate input path (such as robjects.r and pandas.rpy2.common.r) which performs automatic conversion separate ways. But whatever possible.

And agreed to Series should be converted to vector, I'll fix this.

@davclark
Copy link
Contributor

Thanks @sinhrks. That clarifies your concerns. It strikes me that this might be best expressed via a context manager... Can you provide the two use-cases or user models that would differentiate between the rpy2 model and the pandas model? It would be good to be clear on that as we coordinate.

@lgautier
Copy link
Contributor

@sinhrks

@davclark Ah, what I meant is I want to perform automatic conversion in separate ways, sometimes numpy and otherwise pandas, etc. And I'm not willing to to call each raw function like pandas2ri or activate/diactivate every time. My idea is preparing separate input path (such as robjects.r and pandas.rpy2.common.r) which performs automatic conversion separate ways. But whatever possible.

And agreed to Series should be converted to vector, I'll fix this.

"automatic" conversion that would change its conversion logic is possible with the existing conversion infrastructure in rpy2. You just have to make your own conversion logic and register it.

Should you want to have you own conversion rules that disregards existing conversion, this is also possible. As a module owner you can decide on the way it should be done: this is between you and your users. In the present case, may be worth considering looking at how the existing conversion in rpy2 could address your needs, and suggest changes where it does not.

The case of explicitly parallel and active conversion rules is not very well addressed by the current design in rpy2 (as it is using the fact that imported modules are singletons, and the active conversion is always at rpy2.robjects.conversion.<function>.

@lgautier
Copy link
Contributor

@davclark

Thanks @sinhrks. That clarifies your concerns. It strikes me that this might be best expressed via a context manager... Can you provide the two use-cases or user models that would differentiate between the rpy2 model and the pandas model? It would be good to be clear on that as we coordinate.

Using a context manager would be an elegant idea. The only potential is issue would be with if several threads are used, as the conversion system would be modified "globally", even if encapsulated in a context.

@davclark
Copy link
Contributor

Just to touch base, I'm spending some time with @mrocklin thinking about how to do general conversion. He and some folks at Continuum are working on a project you've likely heard of called blaze, which in particular contains a simple conversion system called into that exercises @mrocklin's multiple dispatch mechanism. There's a related package called dynd, which we're looking at as a way to handle sensible handling of things like missing data for conversion to R. We're also discussing difficulties that arise with multi-indices.

But he seems willing to break out into as a separate project, and this could perhaps be a way to coordinate conversion between data-frame (and other) packages like pytables, pandas, R, SQL, etc.

In any case, I'd still love to hear a bit more about what kind of API people would like to see.

@jreback
Copy link
Contributor

jreback commented Jan 18, 2015

@sinhrks can you rebase / update

what is the status of this?

@jreback
Copy link
Contributor

jreback commented Jan 25, 2015

@sinhrks what's the status of this?

@sinhrks
Copy link
Member Author

sinhrks commented Jan 31, 2015

@jreback @jorisvandenbossche Based on #9187, direct conversion is maintained in rpy2 ? Then I'll forward the request to rpy2 if any.

@jreback
Copy link
Contributor

jreback commented Mar 8, 2015

see #9602 we are deprecating in 0.16.0. and redirecting to rpy2 for future conversions.

@jreback jreback closed this Mar 8, 2015
@jorisvandenbossche jorisvandenbossche modified the milestones: No action, 0.16.1 Mar 8, 2015
@sinhrks sinhrks deleted the rapi branch March 31, 2015 13:31
@sinhrks sinhrks restored the rapi branch March 31, 2015 13:31
@sinhrks sinhrks deleted the rapi branch November 7, 2015 03:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants