Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misleading allotetraploid sturgeon k-mer histogram #8

Open
VictorHerp opened this issue Jan 14, 2022 · 1 comment
Open

Misleading allotetraploid sturgeon k-mer histogram #8

VictorHerp opened this issue Jan 14, 2022 · 1 comment

Comments

@VictorHerp
Copy link

Hi Hannes,
I would like to ask for your advice regarding a problem that I have with the k-mer histogram of sturgeon species probably allotetraploid (Acipenser naccarii).

The problem is that the histogram is a bit strange, it does not look like a usual allotetraploid histogram and the Tetmer model does not adjust correctly, so the values of theta, T, nucleotide divergence, etc. are unexpected (I tried to fix the fit of the model manually but it did not change anything). Although I have good coverage (89x, that is, ~22x per haploid), the first peak of the histogram does not separate from the contamination peak (figure 1 attached). I thought there was a coverage problem so I artificially increased the coverage by combining the reads of two individuals to get up to 142x but the problem keeps popping up. I was wondering if what tetmer marks as the first peak is actually the second peak and the first is what I put inside the red circle of figure 2 that I am attaching to this issue. Do you think this is a coverage problem? Or is there something else that I am missing?
Another thing to consider is the evolutionary history of this species which could create a bias in the analysis. It is hypothesized that this species had reached a level of octaploidy before undergoing a process of diploidization, therefore it is currently considered to be tetraploid even if some loci could still be octaploid or even diploid (because the process of diploidization occurs at different speeds within the genome).

I can send you the histograms that I used on tetmer if you want to have a look. Waiting for your answer, thank you!

Víctor Muñoz

Figure 1
Figure 2

@hannesbecher
Copy link
Owner

Hi Víctor,

Thanks for getting in touch!

First of all, I don’t recommend merging k-mer data sets from multiple individuals. They are likely to contain different genetic variants generating additional peaks in your spectrum (unless the individuals are clones). Also it is unlikely that all samples where sequenced at exactly the same depth and peaks would not align. So, let’s focus at your top (single-individual) plot.

You have cut of the y-axis a bit low and it is hard to see the first peak (multiplicity approx. 12). It would be good to know whether this is a data peak or contamination. One way to check would be to generate a quick and dirty assembly and to run blobtools on it. Even quicker would be to use smudgeplot.There are then two options:

The multiplicity 12 peak is due to contamination: This would be unfortunate. Because this peak overlaps with the true 1x peak, fitting parameters is unreliable. You could still try and I’d be happy to help.

The multiplicity 12 peak is a data peak: You are dealing with an octoploid spectrum. According to what you told me this seems plausible. Tetmer is not made for octoploids but could be extended for auto-octoploids. I’d be interested to try this. Allo would be too complicated because there are too many possible homology relationships with eight genomes (and in reality you are probably dealing with some intermediate state).

If you send me your spectrum to the contact email that is in the tetmer paper then I'm happy to take a look.

All the best,
Hannes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants