Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make BitsetHandler variable width and a pandas extension #1448

Merged
merged 25 commits into from
Sep 25, 2024

Conversation

willGraham01
Copy link
Collaborator

@willGraham01 willGraham01 commented Jul 31, 2024

Concerns #1316 |

Introduces the functionality for us to store bitsets as an extension to pandas, rather than having to use the BitSetHandler as an intermediary class.

bitset series (and thus dataframe columns) are stored as fixed-width bytestrings, where each character is represented by a np.uint8. The size of the bytestring (number of "characters") is chosen when the bitset is created to use the minimum number of bytes for each entry (IE we use 1 byte per 8 possible entries in a bitset).

The BitsetArray provides the necessary ExtensionArray instance for pandas series and dataframes. The operators like <=, ==, <, etc have been overloaded to perform their entry-wise set equivalents, so users should be able to interact with the series in the following manner:

bdt = BitsetDtype(0, 1, 2, 3, 4, 5, 6, 7, 8)
a_series = pd.Series([{"1"}, {"4", "8"}, {"2", "7", "3"}], dtype=big_bdt)

# Return a boolean series indicating equality
big_s == {"4", "8"}
# Return a boolean series indicating entry-wise superset-ness
# Note this is equivalent to the 'in' functionality requested in
# https://github.com/UCL/TLOmodel/issues/1316#issuecomment-2052071639,
# but the 'in' operator has a particular meaning for series so we can't use it for this context.
big_s >= {"3"}
# Add an item to an entry
big_s.loc[0] += {"0"}

There are some more usage examples in the bitset_extension.py file, which is currently my quick-and-easy live testing ground.

TODO:

Tests

The basic pandas suite for testing DtypeExtensions passes, however we need some more tests for the functionality we are including on top. Primarily tests for how we handle different inputs when assigning and comparing (since allowing sets to be input values upsets pandas, since they are also iterables and so can be interpreted in a wonky way).

@willGraham01 willGraham01 changed the title Make BitsetHandler a pandas extension Make BitsetHandler variable width and a pandas extension Jul 31, 2024
@willGraham01
Copy link
Collaborator Author

@matt-graham I won't tag you for review just yet since there's still those bugs mentioned in the PR, but any comments you have (particularly if this was something you were expecting in here) would be appreciated 😁

Copy link
Collaborator

@matt-graham matt-graham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow this is looking great @willGraham01. I've added some initial comments but no major issues sprang out at me from a quick once over.

src/tlo/bitset_handler/bitset_extension.py Outdated Show resolved Hide resolved
src/tlo/bitset_handler/bitset_extension.py Outdated Show resolved Hide resolved
src/tlo/bitset_handler/bitset_extension.py Outdated Show resolved Hide resolved
src/tlo/bitset_handler/bitset_extension.py Outdated Show resolved Hide resolved
src/tlo/bitset_handler/bitset_extension.py Outdated Show resolved Hide resolved
src/tlo/bitset_handler/bitset_extension.py Outdated Show resolved Hide resolved
src/tlo/bitset_handler/bitset_extension.py Outdated Show resolved Hide resolved
src/tlo/bitset_handler/bitset_extension.py Outdated Show resolved Hide resolved
src/tlo/bitset_handler/bitset_extension.py Outdated Show resolved Hide resolved
@willGraham01 willGraham01 marked this pull request as ready for review August 22, 2024 08:42
Copy link
Collaborator

@matt-graham matt-graham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @willGraham01 for the updates in response to my comments and adding in the tests. This looks pretty much ready to merge to me? I think the checks fail is just due to a isort complaining about a missing new line.

tests/bitset_handler/test_bitset_pandas_dtype.py Outdated Show resolved Hide resolved
tests/bitset_handler/conftest.py Outdated Show resolved Hide resolved
@willGraham01
Copy link
Collaborator Author

willGraham01 commented Sep 9, 2024

Yeah I think everything is ready - will drop a comment on #1316 saying we can now go ahead with the plan in the issue & remove the BitsetHandler class in favour of direct column manipulation.

(Bumping for review again sorry since new commit = new review required. Also we're hitting an error in the checks on a file that is not touched in this PR)

Copy link
Collaborator

@matt-graham matt-graham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks all good to me

@matt-graham
Copy link
Collaborator

With regards to the failing Pylint check - I did a bit of reading at it seems there are some on-going issues with handling definitions in .pyi files in NumPy specifically within Pylint (pylint-dev/astroid#2375), so I would say as the error does appear to be a false positive here, the simplest solution would be to just add a comment to locally disable checking for E0611 in the relevant line, that is something like

-from numpy.dtypes import BytesDType
+from numpy.dtypes import BytesDType  # pylint: disable=E0611

@willGraham01 willGraham01 force-pushed the wgraham/bitsethandler-to-dtype branch from 4d65d0e to f38b15a Compare September 24, 2024 09:26
@willGraham01
Copy link
Collaborator Author

pylint update fixes everything - adding a reminder note here for me to press merge once CI passes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants