Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse Implementation of build_coleman_forest plus a fix for bottleneck import #317

Merged

Conversation

ryanhausen
Copy link
Contributor

Reference Issues/PRs

Fixes #310

What does this implement/fix? Explain your changes.

This adds a sparse implementation of build_coleman_forest to treeple.stats. The sparse implementation relies on scipy.stats.sparse and so only works in the case of binary classification and regression. In my tests using cProfile/memray, the sparse implementation is a little over 50% faster and uses about %7 less memory, see pdf for more information. I am also attaching the profiling data
if you'd like to take a look.

This PR also includres a change to the bottleneck implementation to for the dense implementation of build_coleman_forest. Specifically there are two important changes:

  1. The check for the existence of the bottleneck package is now done using importlib.util.find_spec("bottleneck"). The old check using sys seemed to work fine and passed tests, but actually didn't seem work unless bottleneck had been imported before, which when the tests are run is true, but may not be the case in other situations
  2. the aliased and lambda functions nanmean_f and anynan_f are now defined using def so that the warning message can moved into their respective calls rather than at import.

Any other comments?

The scipy.stats sparse implementation should be swapped out for pydata.sparse when that implementation is performant enough, as it is more general.

Copy link

codecov bot commented Aug 20, 2024

Codecov Report

Attention: Patch coverage is 89.79592% with 10 lines in your changes missing coverage. Please review.

Project coverage is 80.27%. Comparing base (1d970b0) to head (d733508).
Report is 3 commits behind head on main.

Files Patch % Lines
treeple/stats/forest.py 81.39% 5 Missing and 3 partials ⚠️
treeple/stats/utils.py 96.36% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #317      +/-   ##
==========================================
+ Coverage   80.00%   80.27%   +0.26%     
==========================================
  Files          24       24              
  Lines        2221     2296      +75     
  Branches      411      422      +11     
==========================================
+ Hits         1777     1843      +66     
- Misses        312      317       +5     
- Partials      132      136       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Thanks for the profiling data and documentation. It's very convincing.

Left a few comments to tighten up the PR

treeple/stats/forest.py Outdated Show resolved Hide resolved
treeple/stats/utils.py Show resolved Hide resolved
treeple/stats/utils.py Outdated Show resolved Hide resolved
treeple/stats/utils.py Outdated Show resolved Hide resolved
treeple/stats/utils.py Outdated Show resolved Hide resolved
@ryanhausen
Copy link
Contributor Author

@adam2392 I have updated the PR, let me know if I missed anything or if you want to see some more changes.

Copy link
Collaborator

@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly lgtm. Just some small nits

@ryanhausen
Copy link
Contributor Author

@adam2392 I added some documentation to address your comments. Let me know what you think.

@adam2392
Copy link
Collaborator

I will let @sampan501 @SUKI-O @YuxinB @PSSF23 take a look as well.

@@ -316,3 +329,209 @@ def get_per_tree_oob_samples(est: BaseForest):
)
oob_samples.append(unsampled_indices)
return oob_samples


def _get_forest_preds_sparse(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I missed it somewhere, but are there unit tests for these methods anywhere?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sampan501, @adam2392 No there isn't a test for this function in test_utils.py. This function, _get_forest_preds_sparse, is a good candidate for a unit test though. I can write one up.

I worry the other functions: _parallel_build_null_forests_sparse and _compute_null_distribution_coleman_sparse, like their dense counterparts, do too much to be easily unit tested and are tested via the integration tests. However, if you would like, we can refactor the dense/sparse functions into smaller units so that they can all have tests written. Would probably be cleaner, but would take time.

What would you like to see?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I think it's fine the way it is. If we find means of re-designing the internal API to make things simpler and easier to test, then it's warranted, but I don't see a clear path. Do you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would take some thought. There is some repetition we might be able to leverage, but I am not sure if it would come at the cost of clarity. So just write up a test for _get_forest_preds_sparse?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think that's sufficient then for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good. Maybe we can make an issue for having such unit tests/re-designing the API so we don't forget. Otherwise, I think this PR lgtm

@ryanhausen
Copy link
Contributor Author

@sampan501 @adam2392 Hey I added a test, but it looks like I some other tests are failing. I submitted an issue, #328. I'll start looking into it, but it might go faster if you have an intuition on what the issue is.

@adam2392
Copy link
Collaborator

This is gonna be an issue due to parameter validation in scikit-learn. I will push a fix when I have time. It seems the rest of the code works tho.

I would prioritize other issues until I have the time to push up the fix

@ryanhausen
Copy link
Contributor Author

@adam2392 sounds good. Should we let this sit and update and merge after you have a fix for the other issue or merge as is? @sampan501

@sampan501
Copy link
Member

I'm happy to merge if @adam2392 is cool with it

@sampan501 sampan501 self-requested a review September 23, 2024 23:55
@adam2392
Copy link
Collaborator

@adam2392 sounds good. Should we let this sit and update and merge after you have a fix for the other issue or merge as is? @sampan501

I will merge once the fix is added and the CI is confirmed to be happy. No rush on this one.

@adam2392
Copy link
Collaborator

@ryanhausen if you update wrt main, the CIs should be fixed

@adam2392 adam2392 merged commit e1c38ad into neurodata:main Oct 11, 2024
23 of 35 checks passed
@adam2392
Copy link
Collaborator

Thanks @ryanhausen !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bottleneck should not raise a warning unless relevant function in treeple.stats is called
3 participants