Sprint 12 Task List #501

cristinaetrv · 2024-05-15T18:05:19Z

Due date: June 6 2024

Documentation

Separate out annotator description/perl side including performance figures, describe every piece that repo has including Machine Learning subsection, Bioinformatics tools subsection (installation first) - 6/6/24
GIF of how you would use general purpose ML library - 6/6/24

Proteomics

Review and complete Dennis's PRs (somascan upload, API endpoint to submit filtering prot jobs) @akotlar - 5/20/24
Run pQTL analysis and compare domain adaptation to TAMPOR results (compare plots) @akotlar - 5/22/24 - done; we didn't run pQTL analysis because the n=300 dataset doesn't have genetic data we have been able to access, yet. However, Cutler signed off on the results looking good and wants to publish a paper on the work.
Demo notebook similar to Dec demo but updated to show updates to filtering by annotation @akotlar - 5/22/24 - https://github.com/bystrogenomics/bystro/blob/069b8d8bf7e47a071f71a11f76af97e2e2af0f58/python/python/bystro/examples/ProteomicsProxiedQuery.ipynb
Add ability to query OpenSearch indices from outside the cluster - @akotlar 5/22/24 - https://github.com/bystrogenomics/bystro/blob/069b8d8bf7e47a071f71a11f76af97e2e2af0f58/python/python/bystro/examples/ExampleQueryToDataFrame.ipynb
Add ability to filter proteomic data from outside the cluster - @akotlar 5/22/24
Hook up the filtering by annotation to queue @akotlar - 5/24/24 - update 5/23/24; we don't need this now that we have filtering outside the cluster, until we have a concrete use case in the web app.
Filtering needs to be generalized to SomaScan @akotlar -5/24/24
Generate network analysis results using SPPCA on ~300 sample dataset @akotlar -5/29/24

PRS

Covariance Matrix Estimation/ML library
Goal: Make more accurate predictions, more tailored test, better control false positive rate

Seeing how well when we assume there is signal in the data, hypothesis test to detect spike in data - 6/6/24 - WIP Integrate covariance methods into poe analysis methods #513
(related to "Seeing how well...") Conservativeness tests - 6/6/24 - @IlhaH -
Computing p values accurately tailored to distribution that you'd expect if there is no spike - 6/6/24
Generate more general use cases for ML library
(stretch) Finish POE method - 6/6/24 - @IlhaH @austinTalbot7241993
(sprint 13) Austin will implement some spherical p-value tests that are a direct POIROT competitor

akotlar · 2024-05-21T18:44:49Z

2024-05-21

Stats methods topic meeting

Computing p values accurately tailored to distribution that you'd expect if there is no spike - 6/6/24

We have merged the random matrix theory PR, which will be used and Ilha will be evaluating
RMT works by setting the ratio of the number of covariates and sample size to a fixed value (p/n = c; c > 0) and letting both p and n go to infinity. This contrasts with classical statistical tests that fix p and let n go to infinity. We're going to evaluating whether RMT works better.

What happens in this regime is you are looking at the distribution of the eigenvalues under the null hypothesis of an identify covariance matrix, and then you check whether your largest eigenvalue falls in that range. Marchenko-Pastor.
We will compare to Hotelling's t2 test.
Ilha will be doing that comparison

Seeing how well when we assume there is signal in the data, hypothesis test to detect spike in data - 6/6/24

Ilha is making progress on this. He is running first simulation now, running different combinations of singular value shrinkage estimators, and the different types of covariance matrix estimators, and he will be evaluating this via mean squared error on the poe effect estimators, that are simulated.

akotlar · 2024-05-24T19:53:10Z

Covariance Matrix / POE

Conservativeness:

For some combinations of covariance matrix estimation / singular value shrinkers, we became over-conservative (bias decreases conditional on H0, but shrink true effect conditional on H1).
- Which shrinkers were most conservative?
  - Operator norm, F4, N1 were similar
The big problem we have with the parent of origin estimator is that it is anti-conservative; conditional on H0, it will give you non-0 effects (and takes 100's of thousands of samples to converge to 0). We are trying to make a p-value test of whether the singular value distribution of the covariance matrix is spherical. We have precise estimates of big effects, but big bias of small effects. We are trying to put shrinkers on the singular value estimates to kill off the small effects. We're trying to find ways to, if we're estimating true effects, that we report them only when they're big. Based on Mike Epstein's responses in the past, Austin thinks it will be easy to sell something that is giving you PoE estimates, even if they're only big.

Publication Plan for Q3

SPPCA - once Dave Carlson is back from vacation he'll finish giving feedback, this we're targeting to be out by end of summer.
PoE draft, modulo Mike Epstein's students actually completing their UKBB analysis.
Platform paper - Bystro platform updates / generative AI discussion.

Sprint 13 plan update

Austin will implement some spherical p-value tests that are a direct POIROT competitor

Domain Adaptation

Test looks really successful at removing batch effects; TAMPOR does not appear to remove them, at least entirely.

DomainAdaptationTest.ipynb.zip

akotlar · 2024-05-31T19:31:53Z

2024-05-31

PRS

Automatically launch PRS after ancestry from API server

Pushed back to last week of sprint

Display basic PRS results in webapp (table with individuals and their score) - @akotlar - 6/6/24 - #509

Same

Add batch processing for PRS C+T workflow with dosage matrix for memory issues

Under review

Need annotated AD stat summary to include ancestry

Done

Take in top hit from ancestry, convert to superpop, connect to LD map for corresponding pop for LD clump

Should be in testing by 6/6

Weigh PRS scores by gnomad allele frequencies for specific ancestries and the corresponding ancestry probability

Should be done by 6/6

Finish PRS-CS standard way without Langevin Dynamics

PR'd, there's a test to fix

Take in ancestry PCs as PRS-CS covariates

@akotlar and @austinTalbot7241993 will talk about this on Monday

Genotype imputation

Hypothesis testing

Guy at UC Davis has a good implementation, so we're relying on that

Covariance matrix estimation

We have geodisics in (may use in domain adaptation), we have pyreamann PR'd.

We have been working this week on how well the covariance matrix estimation, and conservativeness. Operator norm has the best MSE and worst conservativeness, and nuclear norm had the best conservativeness and ok MSE.

The difficulty is estimating the largest singular values, which needs to be done by looking at heterozygotes; we were shafting ourselves by looking at low frequency hets and with small effect sizes.

Summary

PRS + Proteomics

Will be done by 6/6 for prototype

ML

On track, good progress

akotlar · 2024-06-07T19:24:01Z

2024-06-07

Bystro Sync

@akotlar

PRS web integration will be done on Monday
network analysis is done (@austinTalbot7241993 will help with next steps)
Sprint 13: PR dave's PRS calculator
Sprint 13: will include the new GIN tasks (annotation parser)

@cristinaetrv

Ancestry and weighing PRS scores rolls over

@IlhaH

With a significance level of .0005 we're always better than POIROT; at .05 POIROT sometimes has more power, and even on smaller effect sizes (like 0.1), our method usually works.
Sprint 13: generate plots for performance under null distribution

@austinTalbot7241993

Sprint 13: One last idea for POE: We've been doing shrinkage on singular values, but he will test element wise shrinkage on the singular vector. If this is a multivariate gaussian, once we've made it uncorrelated, we have independence and we can do element wise shrinkage assuming independence.
We have a method to compute p-values without inflating test statistics under null. We may develop a few more of these methods and port them into Bystro.

…ranscript

…on to tsv/csv

* Updates version of pandas, numpy, ray, torch, tqdm * tqdm update resolve dependabot low severity issue * Finish removing `attrs` from codebase

…json to csv/tsv (#525) * As GIN requested, add ability to convert ancestry json to csv/tsv from command line CLI: ```sh bystro-api ancestry convert --input-json /home/ubuntu/bystro/python/python/bystro/api/tests/ancestry_input.json --output foobar/ancestry_test_output.tsv --format csv ``` API: ```python from bystro.api.ancestry import ancestry_json_to_format ancestry_json_to_format(input_json_path='../api/tests/ancestry_input.json', output_path='foo/bar.tsv', output_format='tsv') ```

cristinaetrv added the .task list A checklist of smaller tasks label May 15, 2024

cristinaetrv added this to the Sprint 12 milestone May 21, 2024

akotlar added a commit to akotlar/bystro that referenced this issue Jun 6, 2024

Issue bystrogenomics#501: Remove s3 dependence in ancstry/model.py

c20a8c3

akotlar added a commit to akotlar/bystro that referenced this issue Jun 8, 2024

Issue bystrogenomics#501: Add simple parsing function to explode on t…

1adacc0

…ranscript

akotlar added a commit to akotlar/bystro that referenced this issue Jun 8, 2024

Issue bystrogenomics#501: Update dependencies and remove attrs

0a2b168

akotlar added a commit to akotlar/bystro that referenced this issue Jun 10, 2024

Issue bystrogenomics#501: GIN request; add converter from ancestry js…

18c9592

…on to tsv/csv

cristinaetrv closed this as completed Jun 10, 2024

austinTalbot7241993 pushed a commit that referenced this issue Jun 11, 2024

[chore] Issue #501: Update dependencies and remove attrs (#524)

9bd677f

* Updates version of pandas, numpy, ray, torch, tqdm * tqdm update resolve dependabot low severity issue * Finish removing `attrs` from codebase

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sprint 12 Task List #501

Sprint 12 Task List #501

cristinaetrv commented May 15, 2024 •

edited

Loading

akotlar commented May 21, 2024 •

edited

Loading

akotlar commented May 24, 2024 •

edited

Loading

akotlar commented May 31, 2024

akotlar commented Jun 7, 2024 •

edited

Loading

Sprint 12 Task List #501

Sprint 12 Task List #501

Comments

cristinaetrv commented May 15, 2024 • edited Loading

akotlar commented May 21, 2024 • edited Loading

2024-05-21

Stats methods topic meeting

akotlar commented May 24, 2024 • edited Loading

Covariance Matrix / POE

Publication Plan for Q3

Sprint 13 plan update

Domain Adaptation

akotlar commented May 31, 2024

2024-05-31

PRS

Summary

PRS + Proteomics

ML

akotlar commented Jun 7, 2024 • edited Loading

2024-06-07

Bystro Sync

cristinaetrv commented May 15, 2024 •

edited

Loading

akotlar commented May 21, 2024 •

edited

Loading

akotlar commented May 24, 2024 •

edited

Loading

akotlar commented Jun 7, 2024 •

edited

Loading