Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C Functionality for Accessing DataFrame Values in Column Order #32343

Closed
wants to merge 12 commits into from

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Feb 29, 2020

This is very rough and was a little hesitant to even push here, but figured worth opening up to feedback.

The main motivation for this PR is to get block management out of the JSON serializer, but it could potentially be extended beyond that for external uses

The gist of this is that we can hold a struct PdOrderedArrays which gives the length of the DataFrame and sequential access to Numpy arrays that mirror what the user sees when dealing with a DataFrame, all by deconstructing the blocks that already exist down in some C functions

@pep8speaks
Copy link

Hello @WillAyd! Thanks for opening this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 13:5: E265 block comment should start with '# '

goto LOOP_ERROR;
}

// TODO: We should actually be grabbing an ndarray reference here
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is rather verbose, I should point out that this comment might be the most helpful in understanding what is going on here (or at least what should be)

@WillAyd WillAyd changed the title C Function for Accessing DataFrame Values in Order C Functionality for Accessing DataFrame Values in Order Feb 29, 2020
@WillAyd WillAyd changed the title C Functionality for Accessing DataFrame Values in Order C Functionality for Accessing DataFrame Values in Column Order Feb 29, 2020
@jbrockmendel
Copy link
Member

Is doing this in cython not an option? i guess including it in the json code would be troublesome?

Do you envision this being used outside of the json code? e.g. core.internals?

FWIW BlockManager.shape is surprisingly slow in some recent profilings, so optimizing it sounds good.

@WillAyd
Copy link
Member Author

WillAyd commented Feb 29, 2020

Is doing this in cython not an option? i guess including it in the json code would be troublesome?

Truthfully didn't think much about that option, but maybe. I personally find this easier and think going the Cython route would still require a lot of C "bridging" (particularly to manage memory) though I'm probably biased by the amount of time I've spent in the JSON extension recently

Do you envision this being used outside of the json code? e.g. core.internals?

For sure; would still have to evolve and wouldn't be something tackled at the outset, but I think giving third party developers a C API to the DataFrame could open up a lot of things

@jbrockmendel
Copy link
Member

going the Cython route would still require a lot of C "bridging"

Yah, ive never gotten this to work myself

@WillAyd
Copy link
Member Author

WillAyd commented Mar 4, 2020

I'll reopen later when more polished

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants