Sparse Implementation of `build_coleman_forest` plus a fix for bottleneck import #317

ryanhausen · 2024-08-20T19:59:54Z

Reference Issues/PRs

Fixes #310

What does this implement/fix? Explain your changes.

This adds a sparse implementation of build_coleman_forest to treeple.stats. The sparse implementation relies on scipy.stats.sparse and so only works in the case of binary classification and regression. In my tests using cProfile/memray, the sparse implementation is a little over 50% faster and uses about %7 less memory, see pdf for more information. I am also attaching the profiling data
if you'd like to take a look.

This PR also includres a change to the bottleneck implementation to for the dense implementation of build_coleman_forest. Specifically there are two important changes:

The check for the existence of the bottleneck package is now done using importlib.util.find_spec("bottleneck"). The old check using sys seemed to work fine and passed tests, but actually didn't seem work unless bottleneck had been imported before, which when the tests are run is true, but may not be the case in other situations
the aliased and lambda functions nanmean_f and anynan_f are now defined using def so that the warning message can moved into their respective calls rather than at import.

Any other comments?

The scipy.stats sparse implementation should be swapped out for pydata.sparse when that implementation is performant enough, as it is more general.

…an import

…tats

codecov · 2024-08-20T20:28:03Z

Codecov Report

Attention: Patch coverage is 89.79592% with 10 lines in your changes missing coverage. Please review.

Project coverage is 80.27%. Comparing base (1d970b0) to head (d733508).
Report is 3 commits behind head on main.

Files	Patch %	Lines
treeple/stats/forest.py	81.39%	5 Missing and 3 partials ⚠️
treeple/stats/utils.py	96.36%	0 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #317      +/-   ##
==========================================
+ Coverage   80.00%   80.27%   +0.26%     
==========================================
  Files          24       24              
  Lines        2221     2296      +75     
  Branches      411      422      +11     
==========================================
+ Hits         1777     1843      +66     
- Misses        312      317       +5     
- Partials      132      136       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

adam2392

Cool! Thanks for the profiling data and documentation. It's very convincing.

Left a few comments to tighten up the PR

treeple/stats/forest.py

treeple/stats/utils.py

…tats

Co-authored-by: Adam Li <adam2392@gmail.com>

… logic in test

ryanhausen · 2024-08-22T14:24:38Z

@adam2392 I have updated the PR, let me know if I missed anything or if you want to see some more changes.

treeple/stats/forest.py

treeple/stats/tests/test_forest.py

adam2392

Mostly lgtm. Just some small nits

ryanhausen · 2024-08-26T15:08:24Z

@adam2392 I added some documentation to address your comments. Let me know what you think.

adam2392 · 2024-08-26T15:50:02Z

I will let @sampan501 @SUKI-O @YuxinB @PSSF23 take a look as well.

sampan501 · 2024-09-12T17:50:50Z

treeple/stats/utils.py

@@ -316,3 +329,209 @@ def get_per_tree_oob_samples(est: BaseForest):
        )
        oob_samples.append(unsampled_indices)
    return oob_samples
+
+
+def _get_forest_preds_sparse(


maybe I missed it somewhere, but are there unit tests for these methods anywhere?

@ryanhausen

@sampan501, @adam2392 No there isn't a test for this function in test_utils.py. This function, _get_forest_preds_sparse, is a good candidate for a unit test though. I can write one up.

I worry the other functions: _parallel_build_null_forests_sparse and _compute_null_distribution_coleman_sparse, like their dense counterparts, do too much to be easily unit tested and are tested via the integration tests. However, if you would like, we can refactor the dense/sparse functions into smaller units so that they can all have tests written. Would probably be cleaner, but would take time.

What would you like to see?

For now, I think it's fine the way it is. If we find means of re-designing the internal API to make things simpler and easier to test, then it's warranted, but I don't see a clear path. Do you?

I think it would take some thought. There is some repetition we might be able to leverage, but I am not sure if it would come at the cost of clarity. So just write up a test for _get_forest_preds_sparse?

Yeah I think that's sufficient then for now.

sounds good. Maybe we can make an issue for having such unit tests/re-designing the API so we don't forget. Otherwise, I think this PR lgtm

…tests-treeple-stats

ryanhausen · 2024-09-23T21:59:13Z

@sampan501 @adam2392 Hey I added a test, but it looks like I some other tests are failing. I submitted an issue, #328. I'll start looking into it, but it might go faster if you have an intuition on what the issue is.

adam2392 · 2024-09-23T22:21:46Z

This is gonna be an issue due to parameter validation in scikit-learn. I will push a fix when I have time. It seems the rest of the code works tho.

I would prioritize other issues until I have the time to push up the fix

ryanhausen · 2024-09-23T23:44:17Z

@adam2392 sounds good. Should we let this sit and update and merge after you have a fix for the other issue or merge as is? @sampan501

sampan501 · 2024-09-23T23:55:51Z

I'm happy to merge if @adam2392 is cool with it

adam2392 · 2024-09-24T00:05:42Z

@adam2392 sounds good. Should we let this sit and update and merge after you have a fix for the other issue or merge as is? @sampan501

I will merge once the fix is added and the CI is confirmed to be happy. No rush on this one.

adam2392 · 2024-10-10T21:05:57Z

@ryanhausen if you update wrt main, the CIs should be fixed

adam2392 · 2024-10-11T12:50:27Z

Thanks @ryanhausen !

ryanhausen added 5 commits August 13, 2024 21:57

wip sparse prototype. fixed call bottleneck check

9f4862f

moved warning into a function so that it displays on a call rather th…

4dd682f

…an import

added a test for sparsity

1083610

Merge branch 'neurodata:main' into sparse-permutation-tests-treeple-s…

4904136

…tats

added to changelog

cfaea1b

ryanhausen mentioned this pull request Aug 20, 2024

dropped bottleneck warning message #311

Closed

adam2392 self-requested a review August 20, 2024 22:54

adam2392 reviewed Aug 21, 2024

View reviewed changes

treeple/stats/forest.py Outdated Show resolved Hide resolved

treeple/stats/utils.py Show resolved Hide resolved

treeple/stats/utils.py Outdated Show resolved Hide resolved

treeple/stats/utils.py Outdated Show resolved Hide resolved

treeple/stats/utils.py Outdated Show resolved Hide resolved

ryanhausen and others added 6 commits August 21, 2024 11:13

Merge branch 'neurodata:main' into sparse-permutation-tests-treeple-s…

821a8b7

…tats

prefix warning comment with XXX

101bab1

Co-authored-by: Adam Li <adam2392@gmail.com>

simplified sparse matrix instantiation

d2204e4

wip PR revisions. updated docs, revised bottleneck function selection…

766c6c4

… logic in test

removed uneeded main statement at bottom

f852fc4

added separate test for nanmean_f, reset the order of pytest fixtures

251392f

adam2392 reviewed Aug 22, 2024

View reviewed changes

treeple/stats/forest.py Show resolved Hide resolved

adam2392 reviewed Aug 22, 2024

View reviewed changes

treeple/stats/tests/test_forest.py Show resolved Hide resolved

adam2392 reviewed Aug 22, 2024

View reviewed changes

added clarifying comments

d733508

adam2392 approved these changes Aug 26, 2024

View reviewed changes

adam2392 requested review from PSSF23, SUKI-O, sampan501 and YuxinB August 26, 2024 15:49

sampan501 reviewed Sep 12, 2024

View reviewed changes

ryanhausen added 2 commits September 23, 2024 14:17

added a unit test for utils._get_forest_preds_sparse"

248ac3b

Merge remote-tracking branch 'upstream/main' into sparse-permutation-…

6be17a0

…tests-treeple-stats

sampan501 self-requested a review September 23, 2024 23:55

sampan501 approved these changes Sep 23, 2024

View reviewed changes

adam2392 merged commit e1c38ad into neurodata:main Oct 11, 2024
23 of 35 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparse Implementation of `build_coleman_forest` plus a fix for bottleneck import #317

Sparse Implementation of `build_coleman_forest` plus a fix for bottleneck import #317

ryanhausen commented Aug 20, 2024

codecov bot commented Aug 20, 2024 •

edited

Loading

adam2392 left a comment

ryanhausen commented Aug 22, 2024

adam2392 left a comment

ryanhausen commented Aug 26, 2024

adam2392 commented Aug 26, 2024

sampan501 Sep 12, 2024

adam2392 Sep 17, 2024

ryanhausen Sep 18, 2024

adam2392 Sep 18, 2024

ryanhausen Sep 18, 2024

adam2392 Sep 18, 2024

sampan501 Sep 18, 2024

ryanhausen commented Sep 23, 2024

adam2392 commented Sep 23, 2024

ryanhausen commented Sep 23, 2024

sampan501 commented Sep 23, 2024

adam2392 commented Sep 24, 2024

adam2392 commented Oct 10, 2024

adam2392 commented Oct 11, 2024

Sparse Implementation of build_coleman_forest plus a fix for bottleneck import #317

Sparse Implementation of build_coleman_forest plus a fix for bottleneck import #317

Conversation

ryanhausen commented Aug 20, 2024

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

codecov bot commented Aug 20, 2024 • edited Loading

Codecov Report

adam2392 left a comment

Choose a reason for hiding this comment

ryanhausen commented Aug 22, 2024

adam2392 left a comment

Choose a reason for hiding this comment

ryanhausen commented Aug 26, 2024

adam2392 commented Aug 26, 2024

sampan501 Sep 12, 2024

Choose a reason for hiding this comment

adam2392 Sep 17, 2024

Choose a reason for hiding this comment

ryanhausen Sep 18, 2024

Choose a reason for hiding this comment

adam2392 Sep 18, 2024

Choose a reason for hiding this comment

ryanhausen Sep 18, 2024

Choose a reason for hiding this comment

adam2392 Sep 18, 2024

Choose a reason for hiding this comment

sampan501 Sep 18, 2024

Choose a reason for hiding this comment

ryanhausen commented Sep 23, 2024

adam2392 commented Sep 23, 2024

ryanhausen commented Sep 23, 2024

sampan501 commented Sep 23, 2024

adam2392 commented Sep 24, 2024

adam2392 commented Oct 10, 2024

adam2392 commented Oct 11, 2024

Sparse Implementation of `build_coleman_forest` plus a fix for bottleneck import #317

Sparse Implementation of `build_coleman_forest` plus a fix for bottleneck import #317

codecov bot commented Aug 20, 2024 •

edited

Loading