Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Pseudocounts for drop-out variants #43

Open
hezscha opened this issue Feb 16, 2021 · 2 comments
Open

Feature Request: Pseudocounts for drop-out variants #43

hezscha opened this issue Feb 16, 2021 · 2 comments

Comments

@hezscha
Copy link

hezscha commented Feb 16, 2021

Hi,
I'm running Enrich2 on a selection MAVE and have noticed I am unable to get scores for some poorly performing variants because they tend to drop out in later time points during the selection. My PI was wondering if we could alleviate that by introducing pseudo-counts, only for those variants that were clearly present in the initial sample and then decline. We have 3-4 time points plus initial samples and are scoring with WLS regression.

Do you know if this is at all done for MAVE data or if not what the objections are?
And is this something you would consider adding to Enrich2?

@afrubin
Copy link
Member

afrubin commented Feb 16, 2021

This is a question that has come up before, but as you said is not supported by Enrich2. I'll try to explain the reasoning behind not calculating these scores and provide a possible workaround.

Do you know if this is at all done for MAVE data or if not what the objections are?

If you are using ratio-based scores, this might perform well.

The issues come in with regression-based scores with many time points. If a variant drops out early, does it make sense to calculate a strong negative score based on the regression line intercepting the x-axis when the dropout happens? What if the variant drops out in the middle of the experiment and is then seen again in a later time point (due to sampling issues)? The log-linear fit will be very poor and potentially misleading.

We were not able to determine a general solution to these issues, and did not have sufficient test data to approach the problem at the time, so we went ahead and filtered out these variants.

And is this something you would consider adding to Enrich2?

Enrich2 is no longer under active development, but I have added this feature request to the successor project.

If you would like to add a pseudocount, my suggestion is:

  1. Count the variants using Enrich2 in counts-only mode - do not calculate scores.
  2. Open the HDF5 file in a Jupyter notebook or similar environment and add the pseudocount to all relevant count tables.
  3. Re-run Enrich2 using the same configuration file, but enable score calculation. It will automatically detect that the counts are already present, and use these modified counts to calculate variant scores.

Please let me know if you need extra assistance getting this set up. There are some example notebooks in the documentation that show how to open the HDF5 files, but the code may be out of date.

@hezscha
Copy link
Author

hezscha commented Feb 17, 2021

Thanks for the reply Alan!
I see what you mean about it becoming problematic when doing regression. I have tested how well ratio-based scores and regression-based scores correlate for our data and the correlation was quite good so we might use that and add pseudo counts to the after-selection library using your suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants