Skip to content
This repository has been archived by the owner on Apr 10, 2024. It is now read-only.

DESIGN: Wishlist from scikit-learn, keras, tensorflow? #52

Open
wesm opened this issue Oct 18, 2016 · 1 comment
Open

DESIGN: Wishlist from scikit-learn, keras, tensorflow? #52

wesm opened this issue Oct 18, 2016 · 1 comment

Comments

@wesm
Copy link
Owner

wesm commented Oct 18, 2016

What can pandas provide in the way of a C/C++/Cython API to better enable upstack ML / statistical libraries? @ogrisel @amueller, who might have some good perspectives?

@amueller
Copy link

amueller commented Oct 18, 2016

Thanks for reaching out :)

I think @ogrisel has thought more about this than me, and maybe @GaelVaroquaux and @jnothman too. Just thinking out loud for now.

I guess there are two main reasons why we would like to have access to dataframes on a C/Cython level:

  • We don't want to copy the data into a numpy array, simply to avoid the memory copy.
  • We want to use pandas features like categorical variables or missing values.

For the first one we ideally wouldn't want to write any pandas specific code. So if a dataframe could provide a cython typed memory view interface, that might solve the use-case -- though the question is whether that might be a lot slower than doing a copy if the memory is not aligned nicely?

For the second use-case, I would think that writing data-frame specific cython (restricted to the trees for categoricals and missing values, and to imputation for missing values) would be ok - supporting these data types directly without creating boolean masks might speed things up and make them much more convenient for the user.

We don't really want a pandas dependency, but if the DataFrame API was defined in Cython (that's how it goes for numpy, right?) that would probably work for us.
In that case something like the typed memory view with indexing and slicing and the right data types would be enough?
It might be that that's currently possible with pandas, I don't really know the code. I guess apart from our limited bandwidth, what kept us mostly from working more with dataframes was that we don't want to have a pandas dependency and that we don't want to code against the codebase--as opposed to a well-defined API.

I guess we are pretty simple in that we require homogeneous float dataframes for basically everything, apart from some corner-cases where we allow certain more general input. But the types of operations we do and the types of input we consume are pretty restricted. We're not gonna consume any complex nested data structures anytime soon (or hopefully ever).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants