-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversion of DataFrame to R's data.frame #350
Comments
I've been experimenting with alternative solutions so far, but at the moment most of them, save for the dict intermediate, are horribly slow. |
I think I might get some code to merge in for 0.8.0, but I'd need to adapt my unit tests (which use unittest) to pandas. Is there any document on how unit tests are handled in pandas and any guidelines to follow? |
Progress: the current implementation (not integrated in my pandas clone yet as I have no idea on how to handle things like MultiIndex, which I don't use in my normal workflow). The main advantage is that now nans are translated to R's NA: import numpy as np
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
import rpy2.rlike.container as rlc
def dataset_to_data_frame(dataset, strings_as_factors=True):
base = importr("base")
columns = rlc.OrdDict()
# Type casting is more efficient than rpy2's own numpy2ri
vectors = {np.float64: robjects.FloatVector,
np.float32: robjects.FloatVector,
np.float: robjects.FloatVector,
np.int: robjects.IntVector,
np.object_: robjects.StrVector,
np.str: robjects.StrVector}
columns = rlc.OrdDict()
for column in dataset:
value = dataset[column]
value = vectors[value.dtype.type](value)
# These SHOULD be fast as they use vector operations
if isinstance(value, robjects.StrVector):
value.rx[value.ro == "nan"] = robjects.NA_Character
else:
value.rx[base.is_nan(value)] = robjects.NA_Logical
if not strings_as_factors:
value = base.I(value)
columns[column] = value
dataframe = robjects.DataFrame(columns)
del columns
dataframe.rownames = robjects.StrVector(dataset.index)
return dataframe |
I'm an rpy2 user. This is what I'm using to go between pandas and rpy2:
This:
I'm not sure if the call to |
The call to "Importr" can be substituted with a much faster: I = robjects.baseenv.get("I")
is_nan = robjects.baseenv.get("is.nan") but yes, it is necessary if you deal with, e.g., "omics" data where you have primary identifiers (the index) and a series of non-float columns (annotation) alongside measurements (floats). If you handle strings as factors, later on if you convert them back to Python objects you will get (unless you're careful) a list of ints, rather than of strings. Also "pandas.notnull" doesn't work with strings (and R has a NA character type, again useful for annotations). Of course (hopefully!) this will become much simpler when numpy adapts a missing data type. |
Okay, fair enough on the need to use I think it's a mistake to use R's definition of missing data when converting from pandas, though -- since Finally, I'd definitely move the precomputed values ( |
After reviewing my own code that uses this, I noticed that It's there mostly for "historical" reasons: probably it can be substituted by pandas' own notnull. |
import numpy as np
from pandas import notnull
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
import rpy2.rlike.container as rlc
I = robjects.baseenv.get("I")
VECTOR_TYPES = {np.float64: robjects.FloatVector,
np.float32: robjects.FloatVector,
np.float: robjects.FloatVector,
np.int: robjects.IntVector,
np.object_: robjects.StrVector,
np.str: robjects.StrVector}
def dataset_to_data_frame(dataset, strings_as_factors=True):
columns = rlc.OrdDict()
for column in dataset:
values = dataset[column]
value_type = values.dtype.type
values = [item if notnull(item) else robjects.NA_Logical for item in values]
values = VECTOR_TYPES[value_type](values)
if not strings_as_factors:
values = I(values)
columns[column] = value
dataframe = robjects.DataFrame(columns)
del columns
dataframe.rownames = robjects.StrVector(dataset.index)
return dataframe Here's another version. I should try to port my own unit tests for this to pandas.... |
Since I don't use stuff like MultiIndex etc.: How do those DataFrames get converted by convert_robj? |
Although I have already produced code for this (see below) I'm posting this as an issue rather than a pull request to discuss the design, because there are some issues open in my code:
The code in the current form is posted below. If there is interest, I will work towards integrating it in pandas.rpy.common and add unit tests.
The text was updated successfully, but these errors were encountered: