Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significant difference in accuracy when changing the reference from CHM13 to HG38 #10

Closed
daanishmahajan opened this issue Jul 31, 2024 · 5 comments
Assignees

Comments

@daanishmahajan
Copy link

@canfirtina,

We performed an experiment where :

  • Target region - Chromosome 21, Non-target region - Chromosome 22
  • Squiggles were sampled using the tool Squigulator
  • For the first experiment squiggles were sampled from HG38 and in the second they were sampled from CHM13. The reference index file is the same as the reference used in that respective experiment to sample the squiggles.

The results obtained are as follows (Here TP - True positive, FN - false negative, TN - True negative, FP - False positive):

Experiment 1:

  • TP / Total reads from chr21 = 0.0712810027938642 
  • FN / Total reads from chr21 = 0.17252379066356952
  • TN / Total reads from chr22 = 0.07641977252154748
  • FP / Total reads from chr22 = 0.13858278562714 

Experiment 2:

  • TP / Total reads from chr21 = 0.9385569022299598 
  • FN / Total reads from chr21 = 0.02883090673201672
  • TN / Total reads from chr22 = 0.9253183177624721
  • FP / Total reads from chr22 = 0.044481680289140665

One reason for this large gap is the incompleteness of HG38. Do you have any other reason for the same?

Thanks

@daanishmahajan
Copy link
Author

Hi @canfirtina, a gentle reminder for the same.

Thanks

@canfirtina canfirtina self-assigned this Aug 9, 2024
@canfirtina
Copy link
Member

Hi @daanishmahajan,

Thanks for bringing this up. Your reasoning could be correct.

Although we have not tested RawHash (and RawHash2) on hg38, I will be looking into this more closely in the coming few days and get back to you quickly with more detailed analysis.

In the meantime:

  1. Could you let me know the exact version of hg38 you used (e.g., hg38p13)?
  2. On a similar issue, I remember running UNCALLED on hg38, which was too slow to complete the entire mapping process. After I started using CHM13, UNCALLED was able to finish the mapping. This could again be due to the incompleteness of the genome, which might be increasing the complexity of the algorithm behind UNCALLED. However, to resolve such potential issues, UNCALLED suggests masking repeats: https://github.com/skovaka/UNCALLED/tree/master/masking. Could you please try and see if the results change after masking repeats?

@daanishmahajan
Copy link
Author

Hi @canfirtina,

  1. I am using the 2013 version downloaded from here:

http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz

  1. Sure I will do that and get back to you asap.

@daanishmahajan
Copy link
Author

Hi @canfirtina,
I went through the experiment again and found an error in the same. I was simulating squiggles using R10 pore but mapping and indexing using R9 pore chemistry.

I am really sorry for the trouble! Even with HG38 as the reference, I am getting comparable results as compared to when CHM13 is taken as the reference.

Thanks

@canfirtina
Copy link
Member

Hi @daanishmahajan,

No worries, I am glad things worked out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants