Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sprint 12 Task List #501

Closed
15 of 27 tasks
cristinaetrv opened this issue May 15, 2024 · 4 comments
Closed
15 of 27 tasks

Sprint 12 Task List #501

cristinaetrv opened this issue May 15, 2024 · 4 comments
Labels
.task list A checklist of smaller tasks
Milestone

Comments

@cristinaetrv
Copy link
Collaborator

cristinaetrv commented May 15, 2024

Due date: June 6 2024

Documentation

  • Separate out annotator description/perl side including performance figures, describe every piece that repo has including Machine Learning subsection, Bioinformatics tools subsection (installation first) - 6/6/24
  • GIF of how you would use general purpose ML library - 6/6/24

Proteomics

PRS

Covariance Matrix Estimation/ML library
Goal: Make more accurate predictions, more tailored test, better control false positive rate

  • Seeing how well when we assume there is signal in the data, hypothesis test to detect spike in data - 6/6/24 - WIP Integrate covariance methods into poe analysis methods #513
  • (related to "Seeing how well...") Conservativeness tests - 6/6/24 - @IlhaH -
  • Computing p values accurately tailored to distribution that you'd expect if there is no spike - 6/6/24
  • Generate more general use cases for ML library
  • (stretch) Finish POE method - 6/6/24 - @IlhaH @austinTalbot7241993
  • (sprint 13) Austin will implement some spherical p-value tests that are a direct POIROT competitor
@cristinaetrv cristinaetrv added the .task list A checklist of smaller tasks label May 15, 2024
@cristinaetrv cristinaetrv added this to the Sprint 12 milestone May 21, 2024
@akotlar
Copy link
Collaborator

akotlar commented May 21, 2024

2024-05-21

Stats methods topic meeting

Computing p values accurately tailored to distribution that you'd expect if there is no spike - 6/6/24

We have merged the random matrix theory PR, which will be used and Ilha will be evaluating
RMT works by setting the ratio of the number of covariates and sample size to a fixed value (p/n = c; c > 0) and letting both p and n go to infinity. This contrasts with classical statistical tests that fix p and let n go to infinity. We're going to evaluating whether RMT works better.

  • What happens in this regime is you are looking at the distribution of the eigenvalues under the null hypothesis of an identify covariance matrix, and then you check whether your largest eigenvalue falls in that range. Marchenko-Pastor.
    We will compare to Hotelling's t2 test.
    Ilha will be doing that comparison

Seeing how well when we assume there is signal in the data, hypothesis test to detect spike in data - 6/6/24

Ilha is making progress on this. He is running first simulation now, running different combinations of singular value shrinkage estimators, and the different types of covariance matrix estimators, and he will be evaluating this via mean squared error on the poe effect estimators, that are simulated.

@akotlar
Copy link
Collaborator

akotlar commented May 24, 2024

Covariance Matrix / POE

Conservativeness:

  • For some combinations of covariance matrix estimation / singular value shrinkers, we became over-conservative (bias decreases conditional on H0, but shrink true effect conditional on H1).
    • Which shrinkers were most conservative?
      • Operator norm, F4, N1 were similar
  • The big problem we have with the parent of origin estimator is that it is anti-conservative; conditional on H0, it will give you non-0 effects (and takes 100's of thousands of samples to converge to 0). We are trying to make a p-value test of whether the singular value distribution of the covariance matrix is spherical. We have precise estimates of big effects, but big bias of small effects. We are trying to put shrinkers on the singular value estimates to kill off the small effects. We're trying to find ways to, if we're estimating true effects, that we report them only when they're big. Based on Mike Epstein's responses in the past, Austin thinks it will be easy to sell something that is giving you PoE estimates, even if they're only big.

Publication Plan for Q3

  1. SPPCA - once Dave Carlson is back from vacation he'll finish giving feedback, this we're targeting to be out by end of summer.
  2. PoE draft, modulo Mike Epstein's students actually completing their UKBB analysis.
  3. Platform paper - Bystro platform updates / generative AI discussion.

Sprint 13 plan update

Austin will implement some spherical p-value tests that are a direct POIROT competitor

Domain Adaptation

Test looks really successful at removing batch effects; TAMPOR does not appear to remove them, at least entirely.

DomainAdaptationTest.ipynb.zip

@akotlar
Copy link
Collaborator

akotlar commented May 31, 2024

2024-05-31

PRS

Automatically launch PRS after ancestry from API server

Pushed back to last week of sprint

Display basic PRS results in webapp (table with individuals and their score) - @akotlar - 6/6/24 - #509

Same

Add batch processing for PRS C+T workflow with dosage matrix for memory issues

Under review

Need annotated AD stat summary to include ancestry

Done

Take in top hit from ancestry, convert to superpop, connect to LD map for corresponding pop for LD clump

Should be in testing by 6/6

Weigh PRS scores by gnomad allele frequencies for specific ancestries and the corresponding ancestry probability

Should be done by 6/6

Finish PRS-CS standard way without Langevin Dynamics

PR'd, there's a test to fix

Take in ancestry PCs as PRS-CS covariates

@akotlar and @austinTalbot7241993 will talk about this on Monday

Genotype imputation

Hypothesis testing

Guy at UC Davis has a good implementation, so we're relying on that

Covariance matrix estimation

We have geodisics in (may use in domain adaptation), we have pyreamann PR'd.

We have been working this week on how well the covariance matrix estimation, and conservativeness. Operator norm has the best MSE and worst conservativeness, and nuclear norm had the best conservativeness and ok MSE.

The difficulty is estimating the largest singular values, which needs to be done by looking at heterozygotes; we were shafting ourselves by looking at low frequency hets and with small effect sizes.

Summary

PRS + Proteomics

Will be done by 6/6 for prototype

ML

On track, good progress

akotlar added a commit to akotlar/bystro that referenced this issue Jun 6, 2024
@akotlar
Copy link
Collaborator

akotlar commented Jun 7, 2024

2024-06-07

Bystro Sync

@akotlar

  • PRS web integration will be done on Monday
  • network analysis is done (@austinTalbot7241993 will help with next steps)
  • Sprint 13: PR dave's PRS calculator
  • Sprint 13: will include the new GIN tasks (annotation parser)

@cristinaetrv

  • Ancestry and weighing PRS scores rolls over

@IlhaH

  • With a significance level of .0005 we're always better than POIROT; at .05 POIROT sometimes has more power, and even on smaller effect sizes (like 0.1), our method usually works.
  • Sprint 13: generate plots for performance under null distribution

@austinTalbot7241993

  • Sprint 13: One last idea for POE: We've been doing shrinkage on singular values, but he will test element wise shrinkage on the singular vector. If this is a multivariate gaussian, once we've made it uncorrelated, we have independence and we can do element wise shrinkage assuming independence.
  • We have a method to compute p-values without inflating test statistics under null. We may develop a few more of these methods and port them into Bystro.

akotlar added a commit to akotlar/bystro that referenced this issue Jun 8, 2024
akotlar added a commit to akotlar/bystro that referenced this issue Jun 8, 2024
akotlar added a commit to akotlar/bystro that referenced this issue Jun 10, 2024
austinTalbot7241993 pushed a commit that referenced this issue Jun 11, 2024
* Updates version of pandas, numpy, ray, torch, tqdm
* tqdm update resolve dependabot low severity issue
* Finish removing `attrs` from codebase
akotlar added a commit that referenced this issue Jun 12, 2024
…json to csv/tsv (#525)

* As GIN requested, add ability to convert ancestry json to csv/tsv from
command line

CLI:
```sh
bystro-api ancestry convert --input-json /home/ubuntu/bystro/python/python/bystro/api/tests/ancestry_input.json --output foobar/ancestry_test_output.tsv --format csv
```

API:
```python
from bystro.api.ancestry import ancestry_json_to_format
ancestry_json_to_format(input_json_path='../api/tests/ancestry_input.json',
                        output_path='foo/bar.tsv', output_format='tsv')
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
.task list A checklist of smaller tasks
Projects
None yet
Development

No branches or pull requests

2 participants