-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make BitsetHandler
variable width and a pandas extension
#1448
Conversation
…gletons to the set class
…g sets to and and or
BitsetHandler
a pandas extensionBitsetHandler
variable width and a pandas extension
@matt-graham I won't tag you for review just yet since there's still those bugs mentioned in the PR, but any comments you have (particularly if this was something you were expecting in here) would be appreciated 😁 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow this is looking great @willGraham01. I've added some initial comments but no major issues sprang out at me from a quick once over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @willGraham01 for the updates in response to my comments and adding in the tests. This looks pretty much ready to merge to me? I think the checks fail is just due to a isort
complaining about a missing new line.
Yeah I think everything is ready - will drop a comment on #1316 saying we can now go ahead with the plan in the issue & remove the (Bumping for review again sorry since new commit = new review required. Also we're hitting an error in the checks on a file that is not touched in this PR) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks all good to me
With regards to the failing Pylint check - I did a bit of reading at it seems there are some on-going issues with handling definitions in -from numpy.dtypes import BytesDType
+from numpy.dtypes import BytesDType # pylint: disable=E0611 |
4d65d0e
to
f38b15a
Compare
pylint update fixes everything - adding a reminder note here for me to press merge once CI passes. |
Concerns #1316 |
Introduces the functionality for us to store bitsets as an extension to pandas, rather than having to use the
BitSetHandler
as an intermediary class.bitset
series (and thus dataframe columns) are stored as fixed-width bytestrings, where each character is represented by anp.uint8
. The size of the bytestring (number of "characters") is chosen when the bitset is created to use the minimum number of bytes for each entry (IE we use 1 byte per 8 possible entries in a bitset).The
BitsetArray
provides the necessaryExtensionArray
instance for pandas series and dataframes. The operators like<=, ==, <
, etc have been overloaded to perform their entry-wiseset
equivalents, so users should be able to interact with the series in the following manner:There are some more usage examples in the
bitset_extension.py
file, which is currently my quick-and-easy live testing ground.TODO:
Tests
The basic pandas suite for testing
DtypeExtension
s passes, however we need some more tests for the functionality we are including on top. Primarily tests for how we handle different inputs when assigning and comparing (since allowing sets to be input values upsets pandas, since they are also iterables and so can be interpreted in a wonky way).