Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add code for marron wand simulator #203

Merged
merged 15 commits into from
Feb 8, 2024
Merged

ENH: add code for marron wand simulator #203

merged 15 commits into from
Feb 8, 2024

Conversation

sampan501
Copy link
Member

N/A

Changes proposed in this pull request:

  • Added Marron and Wand 1992 simulations with multivariate normal simulations
  • Added truth for each simulation

Before submitting

  • I've read and followed all steps in the Making a pull request
    section of the CONTRIBUTING docs.
  • I've updated or added any relevant docstrings following the syntax described in the
    Writing docstrings section of the CONTRIBUTING docs.
  • If this PR fixes a bug, I've added a test that will fail without my fix.
  • If this PR adds a new feature, I've added tests that sufficiently cover my new functionality.

After submitting

  • All GitHub Actions jobs for my pull request have passed.

@adam2392 adam2392 self-requested a review January 23, 2024 14:06
Copy link
Collaborator

@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments that I think can result in runtime issues and hidden bugs even.

Can you also add a unit test? I think an easy thing to test is just that things run, expected error comes out for simulation and

sktree/datasets/hyppo.py Show resolved Hide resolved
sktree/datasets/hyppo.py Outdated Show resolved Hide resolved
raise ValueError(
f"Simulation must be trunk, trunk_overlap, trunk_mix, {MARRON_WAND_SIMS.keys()}"
)

y = np.concatenate((np.zeros(n_samples // 2), np.ones(n_samples // 2)))

if return_params:
return X, y, [mu_0, mu_1], [cov, cov]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, the mu and covariance are not easily returned anymore. I'm wondering if we should wrap them in an object then instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, what do you think we need to return in return_params?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we use return_params to get information about the data generating model as well as possibly parameters of the DGM.

Is it easy to wrap these as a subclass for rv_continuous class (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rv_continuous.html#scipy.stats.rv_continuous)? Then we can just return the instantiated object. Though I'm not familiar with that API, and how we can recover the exact parametrization of a specific MW simulation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sampan501 if there's no way to address this, then should we just error out on return_params=True if simulation is not trunk?

I think we just want the DGM to be returned, so perhaps it would make sense to just return a callable such that
returned_model.sample(n_samples) can generate new samples of data. For rng.multivariate_normal or if we can get an API in the framework of rv_continuous, then we can do things like:

test = multivariate_normal(mean=2.5, cov=0.5)
print(test)
print(test.mean)
print(test.cov)
print(test.rvs(size=(10,)))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could just return X, Y, norm_params, and G for the Marron and Wand Sims right?

Copy link
Member Author

@sampan501 sampan501 Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing additional is G, which are the observations from the mixed Gaussian

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that sounds reasonable for now. Can you update the docstring accordingly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adam2392
Copy link
Collaborator

adam2392 commented Feb 1, 2024

Lmk when this is ready for review.

Now n_informative controls the number of informative features. You can set it to default 256. cc: #203 (comment)

Copy link

codecov bot commented Feb 7, 2024

Codecov Report

Attention: 6 lines in your changes are missing coverage. Please review.

Comparison is base (18c2f45) 88.45% compared to head (7c41e75) 88.48%.
Report is 1 commits behind head on main.

Files Patch % Lines
sktree/datasets/hyppo.py 90.62% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #203      +/-   ##
==========================================
+ Coverage   88.45%   88.48%   +0.03%     
==========================================
  Files          54       54              
  Lines        4823     4891      +68     
==========================================
+ Hits         4266     4328      +62     
- Misses        557      563       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sampan501
Copy link
Member Author

@adam2392 PR is ready except for formatting. Not sure where the error is

@adam2392
Copy link
Collaborator

adam2392 commented Feb 7, 2024

You can run make pre-commit to run pre-commit linting pipelines, or setup pre-commit to automatically run before each commit.

I pushed 665dfdf which should fix the issues

Signed-off-by: Adam Li <adam2392@gmail.com>
@sampan501 sampan501 changed the title ENH: add code for marron wand simulator and true density ENH: add code for marron wand simulator Feb 7, 2024
Copy link
Collaborator

@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we'll be using the simulations a lot, it would be great to document it well at this stage to prevent future confusion. I left a few comments on where I think it would be necessary.

sktree/datasets/hyppo.py Outdated Show resolved Hide resolved
sktree/datasets/hyppo.py Outdated Show resolved Hide resolved
sktree/datasets/hyppo.py Outdated Show resolved Hide resolved
sktree/datasets/hyppo.py Outdated Show resolved Hide resolved
@adam2392
Copy link
Collaborator

adam2392 commented Feb 7, 2024

Few other nit picks and then LGTM

Copy link
Collaborator

@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll let you merge once CIs are green

@sampan501 sampan501 merged commit 945864c into main Feb 8, 2024
30 checks passed
@sampan501 sampan501 deleted the marron-wand-sims branch February 8, 2024 01:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants