Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle Element AVITI data (16S amplicon sequencing) #1893

Closed
mniku opened this issue Feb 23, 2024 · 15 comments
Closed

How to handle Element AVITI data (16S amplicon sequencing) #1893

mniku opened this issue Feb 23, 2024 · 15 comments

Comments

@mniku
Copy link

mniku commented Feb 23, 2024

I'm processing our first Element AVITI 16S amplicon sequencing data, using dada2 in QIIME2. I'm wondering how to do this optimally, as it appears that it behaves a bit differently from MiSeq data in dada2:

The phred quality stays much higher than in MiSeq usually, but to my surprise, I still need to truncate the reads just as short to get similar % accepted in DADA2. For more details & actual statistics, see this thread in the QIIME2 forums.

I was first wondering if the AVITI phred scores are a tad optimistic, but got a potentially interesting comment from AVITI bioinformatician. They think the scoring should be quite comparable vs MiSeq BUT: ”There are differences in the distribution of q scores within a read, however, which would be relevant to DADA2 filtering. AVITI data will have greater q score variance within a read--in AVITI data you would be more likely to find a single low q score in the middle of a high quality read.”

Therefore, they recommend we try tweaking maxEE. Does this sound like a good idea? How should we evaluate/validate the results?

We have a huge number of reads to begin with so that’s no problem, and by truncating close to minimum required overlap we get comparable % of reads through vs. MiSeq. But it feels stupid to throw away high quality data just because I don’t completely understand what’s going on.

The QIIME2 guys recommended me to open an issue here, because this calls for a deep level understanding of dada2.

@benjjneb
Copy link
Owner

”There are differences in the distribution of q scores within a read, however, which would be relevant to DADA2 filtering. AVITI data will have greater q score variance within a read--in AVITI data you would be more likely to find a single low q score in the middle of a high quality read.”

That does make sense assuming it is true. maxEE is calculating the average expected errors in a read. This is different than calculating average quality score, as quality scores are first transformed into their probability of representing an error (P = 10^-(Q/10)), which makes isolated low quality scores much more impactful to a maxEE filter than an average Q-score filter. There was a paper on this approach that gives more reasoning:

Therefore, they recommend we try tweaking maxEE. Does this sound like a good idea? How should we evaluate/validate the results?

We have a huge number of reads to begin with so that’s no problem, and by truncating close to minimum required overlap we get comparable % of reads through vs. MiSeq. But it feels stupid to throw away high quality data just because I don’t completely understand what’s going on.

Throwing away lower quality data is not stupid. At the end of amplicon sequencing, the parameters we are estimating are the relative abundances -- the absolute number of reads that get through the pipeline is not important beyond how it affects the measurement of the relative abundances. So, pushing through more but lower-quality data can backfire, in that while more reads make it to the end, the measurements of the relative abundance get worse because of the increased noise. Of course there is some optimal level which will vary among datasets, but in general trying to maximize reads passing filtration is not that important. More important is to avoid any systematic loss of data (e.g. of longer amplicons due to merging, etc.). With that said, when I look at the qza's that you posted in the Q2 thread, I'm more concerned abotu the large loss at chimera removal, e.g. nearly 50% loss samples SF1_01 and ZYMO. This is beyond what is expected even for high PCR cycle numbers, and is usually (90%+) due to unremoved technical bases (primers most often, but sometimes barcodes/adapters) that have ambiguous nucleotides in them.

@mniku
Copy link
Author

mniku commented Feb 28, 2024

Many thanks Benjamin!

Yes, the chimera% is indeed high. I have verified that we did remove all the technical bases (using cutadapt with the 341F & 785R primer sequences, and checking by zgrep that this was succesful). And not especially high cycling either (14 cycles in our lab with the 16S primers, and I think 18 cycles in the sequencing core with the technical primers). So I don't quite understand what's the problem. I think we have experienced this also in MiSeq previously. But we always get very accurately the expected composition of the ZymoBiomics Microbial Community Standard (of course super simplistic).

@Avi-Til
Copy link

Avi-Til commented Jul 4, 2024

Hi @mniku,
We are also comparing MiSeq vs AVITI Dataset from running the same physical library pool on Illumina and AVITI. We also noticed a significant difference in the number of reads discarded at the Chimera removal step in the AVITI Run but not the Illumina Run. Were you able to identify the cause and a solution?
Were there any important observations from tweaking the maxEE value?

I would greatly appreciate your input. Thanks!

To add on, on observation of the DADA2 statistics, our AVITI Dataset had nearly 85-89% filter pass through, 42-44% merged reads, and only 23-27% passed Chimera removal, which I think is mildly different from your DADA2 statistics.

@mniku
Copy link
Author

mniku commented Jul 4, 2024

Good to hear (I guess) that we are not alone in this! I’m afraid we didn’t really SOLVE the issue. We just optimized the truncation carefully so that we got the most of the data and went with that (there was more than enough accepted reads anyway, as the sequencing was much deeper than with MiSeq). We didn’t try tweaking maxEE because we didn’t quite know how to validate the results of the tweaking.

@IrshadUlHaq1
Copy link

@mniku
The genomic center at our institute proposed that we use their new AVITI technology for amplicon sequencing (16S rRNA gene and the ITS). I thought it was a good proposal because the sequencing depth is a no-brainer. However, I wanted to double check if people have been using QIIME2 for AVITI datasets, and on QIIME2 forum I came across your thread and the potential caveat you mentioned while using DADA2. That brought me here, and I just need your opinion on whether I should go for the AVITI chemistry or stick with the Miseq. You mentioned that you optimized the truncation carefully to retain most of the data. Can you elaborate on that optimized truncation a little bit? I have no preference as long as I reach my goal without any hiccups and deep troubleshooting.

@mniku
Copy link
Author

mniku commented Jul 18, 2024

I’d probably go with AVITI if that is more cost effective. I wouldn’t say the quality (in terms of DADA2) was inferior to Miseq, it just seemed worse (in DADA2) than we expected based on the q scores (= more reads were discarded than expected based on the very high q scores). I’m guessing this might be (at least in part) because the DADA2 algorithm is optimized for Miseq (see above; but @benjjneb is of course the expert). I’m hoping this just means that we lose a bit more data (which might be kind of OK due to the great sequencing depth) and doesn’t generate other issues. We always include the Zymobiomics microbial community composition standard and that was fine. With the optimization of truncation, I mean that we tried several different truncation lengths for a subset of data and selected the settings which gave us the highest numbers of accepted reads. I never really did that this way previously and only now noticed that it can make quite a big difference (I mean, even within the obvious limits set by minimal overlap and the eyeballing of the q scores).

@andressamv
Copy link

Following up here, I am curious if we can follow the same steps for merging different runs (before chimera removal) for AVITI and MiSeq data. @benjjneb, do you think we can combine these datasets if the libraries are prepared the same way?

@benjjneb
Copy link
Owner

If the processed amplicons start/end at the same position, then yes the ASV tables can be combined. I would consider the sequencing technology as a batch effect when evaluating statistical models though.

@IrshadUlHaq1
Copy link

@mniku I am having the same problem. DADA2 is getting rid of most of the sequences at the chimera stage. Do you have any advice for me. I have removed the primers using cutadapt but do you know if I need more processing of the input data prior to the DADA2 step? How do you use the zgrep? Did you also use the '--p-adapter-f' and '--p-adapter-r' options in your qiime2 cutadapt script?

Following are the primers the sequencing center has used:
V4_515F_Nextera: TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGGTGCCAGCMGCCGCGGTAA

V4_806R_Nextera: GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGGACTACHVGGGTWTCTAAT

Thanks

@Avi-Til
Copy link

Avi-Til commented Nov 13, 2024

@IrshadUlHaq1 While I have not used cutadapt, I fixed my issue by using trimming parameters that I would have used if I had done an Illumina run. Maybe you could try that. I typically use (5, 220) for both forward and reverse trimming parameters. Let me know if that works.

@mniku
Copy link
Author

mniku commented Nov 13, 2024

@IrshadUlHaq1 we haven't so far found any better solution than to use truncate with more stringent phred score threshold than we would use for MiSeq data. And it's still removing a lot. But I'm afraid I can't comment on your specific data because I don't know the adaptors etc. used by your sequencing facility. I use zgrep to check for remaining primer sequences, for example like this (of course replace with your primer sequence):
zgrep -c 'CCTACGGG[ATCG]GGC[AT]GCAG' *TRIMMED_CUTADAPT_FW.gz

@benjjneb is there any new insight on dada2 with AVITI data? We recently also ran into the problem that we have far too MUCH data vs. what is actually needed for amplicon sequencing -> dada2 takes AGES even on supercomputer. How would you filter the data to make it smaller in a non-biased way? How about using a super tight maxEE?

@Avi-Til I don't think one can just blindly apply some constant truncate for any data. Depends on many factors and should be decided based on evaluation of the phred score plots. Typically reverse reads have lower quality and should therefore be truncated shorter than forward. But of course your data could be different.

@Avi-Til
Copy link

Avi-Til commented Nov 13, 2024

Hi @mniku,

I agree with you. I apologize as I did not explain my rationale sufficiently. The 515F-806R region is 291bp long including the primers. Excluding the primers of roughly 20nt each, the expected read length is 251bp. This length is fine when using an Illumina 2x250bp sequencing, but when using the AVITI 2x300bp, this is too long of a read length, and hence the adapters or primers get sequenced on each 3' end of the forward and reverse reads. As suggested cutadapt can be used to remove the 3' end primer/adapter sequences, but knowing the product size, previous experience with Illumina datasets, and having sequenced quite a few libraries on both Illumina and AVITI parallel to test for consistency across sequencing methods, I made the suggestion of using (5, 220) not as a final fix, but as a simple troubleshooting step for them to report back so I can get a better understanding of how the sequences in the dataset look like. Given the Q40 AVITI reads, if the testing with these trimming parameters shows that the sequence discarding issue was non-recurring, it could be that the issue is with cutadapt.

However I agree with you, and in the multiple Illumina datasets I've analyzed I've found (5, 249) and (5, 221) to be a generalizable starting point for forward and reverse trimming parameters for an average quality Illumina dataset (parameters determined by base at which median score drops below Q25). However, my experience with AVITI data from our sequencing center has been that it is consistently Q40 throughout the whole read, and does not have the drop in quality scores in the reverse reads as seen in Illumina. As a result, I've found no issues truncating the AVITI reads to the more shorter and stringent Illumina parameters as a means to skip using cutadapt, of course, informed by the expected amplicon size. This suggestion may not be valid for say, a V3-V4 sequencing datasets.

@IrshadUlHaq1
Copy link

IrshadUlHaq1 commented Nov 13, 2024

All,
Thank you for your input and time. I am now running dada2 with --p-trunc-len-f 265 and --p-trunc-len-r 232, which is more strict than the previous ----- f 276 and ----- r 260. I will update you if it improves the numbers. By the way the demux plots show that the forwards reads are consistently close to Q40 and Q45 but the reverse reads falls to Q30 and lower after position 230. The following plots represent demux-trimmed, where primers were trimmed using cutadapt.

image

@IrshadUlHaq1
Copy link

@Avi-Til I can confirm that ---f 265 and ----r 232 significantly increased the percentage of sequences at the 'filtered' stage that trickled down to the 'non-chimeric' stage. I wonder if I should continue more stricter iteration to optimize it further. What is the gold standard for sequences retention? 50% and above for most samples?

@mniku What criteria did you follow to settle for your final 'trunc' parameters?

@Avi-Til
Copy link

Avi-Til commented Nov 13, 2024

Since you are doing a 515F/806R amplicon, which means the expected sequence length is around 250. So any bases above 250-ish (give or take a few for variations) are basically adapters or primer sequences. So I would keep both the truncation parameters below 250bp value. I expect you will see a further massive improvement when you drop your forward truncation value from 265 to below 250. In my datasets, I was able to obtain an 85% pass-rate on average on a recent AVITI dataset (but this may vary by dataset).

Normally for Illumina our lab chooses a truncation parameter where the median quality score drops to Q30 or Q25 (worst case). For AVITI, the quality is so high that it does not seem to be a concern. If you choose to follow the Q30 guideline, your truncation parameter for both forward and reverse reads will be 250, allowing you to use the full amplicon length. Instead, you can also rationalize that AVITI has improved sequencing standards, and that warrants a revisit of this threshold, and claim your new threshold is Q40. If you choose to do that, then the truncation parameters I would choose would be 250 for forward, and 230 for reverse. I don't expect there to be a massive difference between sticking to Q30 or revising the threshold to Q40. And in our lab we take a pragmatic approach, and mention the threshold we chose in our papers.

The process of choosing a truncation parameter from the knowledge of the expected amplicon size just negates the need for cutadapt. Cutadapt is more useful for shotgun sequencing where the expected lengths may not be known. We did a bunch of parallel sequencing of the same library on both AVITI and Illumina and found that apart from the singleton species discovered by the ultra-deep sequencing of AVITI, the results were comparable, and choosing the truncation parameters from the knowledge of the amplicon size cut down on the DADA2 analysis time and memory usage, eliminated the need to use cutadapt. DADA2 seems to take the most amount of time in the chimera removal step if the truncation parameters are greater than 250

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants