Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filters/snps_bcftools_chX.csv #15

Open
zmaroti opened this issue Jan 24, 2024 · 3 comments
Open

filters/snps_bcftools_chX.csv #15

zmaroti opened this issue Jan 24, 2024 · 3 comments

Comments

@zmaroti
Copy link

zmaroti commented Jan 24, 2024

Hi,

Is it on purpose that not all SNPs from the 1240K markers are included in the filters data that is used to restrict imputed SNPs to the 1240K marker set? At the hdf5 data import the imputed GTs are filtered by these coordinates, so basically you always "lose" these non included markers even though you do have them in the imputed GTs.

I am aware that some SNPs included in the 1240K CHIP does have lower concordance with true shotgun WGS data. However not all data are coming from CHIP thus removing these from solely WGS dataset would be not required. On the other hand we are talking about imputed GTs anyway where the other markers in LD were allready used to figure out the diploid phased haplotypes on a large genomic chunk based on gold standard WGS ref data (considering random positions for the non concordant SNPs imputation should fix their error alreads). Accordingly, in case these files contains less markers because of trying to avoid "bad markers" then this kind of marker removal should have happened prior to the imputation step for CHIP data and not in the IBD identification step. That way imputation supposed to get better for CHIP while it does not affect true shotgun WGS. Furthermore that approach would not thin your markers at the IBD detection step for either WGS or CHIP data while you still should be able to co-analyze mixed datasets.

But again, I reserve the right to be dumb/ignorant and it may very well happen that I am unaware of some other valid reason to remove these markers. Could you please ellaborate on why ~50k (~4.4%) autosomal 1240K markers are excluded at filtering?

Regards,
Zoltan

@hringbauer
Copy link
Owner

There is no deeper reason for this - we had to choose one of the "1240k SNP sets" out there. Those 1240k SNP sets are curated and filtered to various degrees.

We picked a more filtered SNP set used by several aDNA labs to be sure that everyone has a superset in their imputed vcf (potentially already downsampled to some 1240k set).

You could also keep the "full" 1240k SNP set in the hdf5 creation but, in practice, a few percent more or less SNPs (after using all the data in imputation) should make very little difference.

Only keep in mind that if you choose a drastically different SNP set, our "default" parameters will not be optimal anymore, and our "testing" results, including recommended thresholds, will not apply any longer.

@zmaroti
Copy link
Author

zmaroti commented Jan 31, 2024 via email

@zmaroti
Copy link
Author

zmaroti commented Feb 2, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants