-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to handle Element AVITI data (16S amplicon sequencing) #1893
Comments
That does make sense assuming it is true.
Throwing away lower quality data is not stupid. At the end of amplicon sequencing, the parameters we are estimating are the relative abundances -- the absolute number of reads that get through the pipeline is not important beyond how it affects the measurement of the relative abundances. So, pushing through more but lower-quality data can backfire, in that while more reads make it to the end, the measurements of the relative abundance get worse because of the increased noise. Of course there is some optimal level which will vary among datasets, but in general trying to maximize reads passing filtration is not that important. More important is to avoid any systematic loss of data (e.g. of longer amplicons due to merging, etc.). With that said, when I look at the qza's that you posted in the Q2 thread, I'm more concerned abotu the large loss at chimera removal, e.g. nearly 50% loss samples |
Many thanks Benjamin! Yes, the chimera% is indeed high. I have verified that we did remove all the technical bases (using cutadapt with the 341F & 785R primer sequences, and checking by zgrep that this was succesful). And not especially high cycling either (14 cycles in our lab with the 16S primers, and I think 18 cycles in the sequencing core with the technical primers). So I don't quite understand what's the problem. I think we have experienced this also in MiSeq previously. But we always get very accurately the expected composition of the ZymoBiomics Microbial Community Standard (of course super simplistic). |
Hi @mniku, I would greatly appreciate your input. Thanks! To add on, on observation of the DADA2 statistics, our AVITI Dataset had nearly 85-89% filter pass through, 42-44% merged reads, and only 23-27% passed Chimera removal, which I think is mildly different from your DADA2 statistics. |
Good to hear (I guess) that we are not alone in this! I’m afraid we didn’t really SOLVE the issue. We just optimized the truncation carefully so that we got the most of the data and went with that (there was more than enough accepted reads anyway, as the sequencing was much deeper than with MiSeq). We didn’t try tweaking maxEE because we didn’t quite know how to validate the results of the tweaking. |
@mniku |
I’d probably go with AVITI if that is more cost effective. I wouldn’t say the quality (in terms of DADA2) was inferior to Miseq, it just seemed worse (in DADA2) than we expected based on the q scores (= more reads were discarded than expected based on the very high q scores). I’m guessing this might be (at least in part) because the DADA2 algorithm is optimized for Miseq (see above; but @benjjneb is of course the expert). I’m hoping this just means that we lose a bit more data (which might be kind of OK due to the great sequencing depth) and doesn’t generate other issues. We always include the Zymobiomics microbial community composition standard and that was fine. With the optimization of truncation, I mean that we tried several different truncation lengths for a subset of data and selected the settings which gave us the highest numbers of accepted reads. I never really did that this way previously and only now noticed that it can make quite a big difference (I mean, even within the obvious limits set by minimal overlap and the eyeballing of the q scores). |
Following up here, I am curious if we can follow the same steps for merging different runs (before chimera removal) for AVITI and MiSeq data. @benjjneb, do you think we can combine these datasets if the libraries are prepared the same way? |
If the processed amplicons start/end at the same position, then yes the ASV tables can be combined. I would consider the sequencing technology as a batch effect when evaluating statistical models though. |
@mniku I am having the same problem. DADA2 is getting rid of most of the sequences at the chimera stage. Do you have any advice for me. I have removed the primers using cutadapt but do you know if I need more processing of the input data prior to the DADA2 step? How do you use the zgrep? Did you also use the '--p-adapter-f' and '--p-adapter-r' options in your qiime2 cutadapt script? Following are the primers the sequencing center has used: V4_806R_Nextera: GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGGACTACHVGGGTWTCTAAT Thanks |
@IrshadUlHaq1 While I have not used cutadapt, I fixed my issue by using trimming parameters that I would have used if I had done an Illumina run. Maybe you could try that. I typically use (5, 220) for both forward and reverse trimming parameters. Let me know if that works. |
@IrshadUlHaq1 we haven't so far found any better solution than to use truncate with more stringent phred score threshold than we would use for MiSeq data. And it's still removing a lot. But I'm afraid I can't comment on your specific data because I don't know the adaptors etc. used by your sequencing facility. I use zgrep to check for remaining primer sequences, for example like this (of course replace with your primer sequence): @benjjneb is there any new insight on dada2 with AVITI data? We recently also ran into the problem that we have far too MUCH data vs. what is actually needed for amplicon sequencing -> dada2 takes AGES even on supercomputer. How would you filter the data to make it smaller in a non-biased way? How about using a super tight maxEE? @Avi-Til I don't think one can just blindly apply some constant truncate for any data. Depends on many factors and should be decided based on evaluation of the phred score plots. Typically reverse reads have lower quality and should therefore be truncated shorter than forward. But of course your data could be different. |
Hi @mniku, I agree with you. I apologize as I did not explain my rationale sufficiently. The 515F-806R region is 291bp long including the primers. Excluding the primers of roughly 20nt each, the expected read length is 251bp. This length is fine when using an Illumina 2x250bp sequencing, but when using the AVITI 2x300bp, this is too long of a read length, and hence the adapters or primers get sequenced on each 3' end of the forward and reverse reads. As suggested cutadapt can be used to remove the 3' end primer/adapter sequences, but knowing the product size, previous experience with Illumina datasets, and having sequenced quite a few libraries on both Illumina and AVITI parallel to test for consistency across sequencing methods, I made the suggestion of using (5, 220) not as a final fix, but as a simple troubleshooting step for them to report back so I can get a better understanding of how the sequences in the dataset look like. Given the Q40 AVITI reads, if the testing with these trimming parameters shows that the sequence discarding issue was non-recurring, it could be that the issue is with cutadapt. However I agree with you, and in the multiple Illumina datasets I've analyzed I've found (5, 249) and (5, 221) to be a generalizable starting point for forward and reverse trimming parameters for an average quality Illumina dataset (parameters determined by base at which median score drops below Q25). However, my experience with AVITI data from our sequencing center has been that it is consistently Q40 throughout the whole read, and does not have the drop in quality scores in the reverse reads as seen in Illumina. As a result, I've found no issues truncating the AVITI reads to the more shorter and stringent Illumina parameters as a means to skip using cutadapt, of course, informed by the expected amplicon size. This suggestion may not be valid for say, a V3-V4 sequencing datasets. |
All, |
@Avi-Til I can confirm that ---f 265 and ----r 232 significantly increased the percentage of sequences at the 'filtered' stage that trickled down to the 'non-chimeric' stage. I wonder if I should continue more stricter iteration to optimize it further. What is the gold standard for sequences retention? 50% and above for most samples? @mniku What criteria did you follow to settle for your final 'trunc' parameters? |
Since you are doing a 515F/806R amplicon, which means the expected sequence length is around 250. So any bases above 250-ish (give or take a few for variations) are basically adapters or primer sequences. So I would keep both the truncation parameters below 250bp value. I expect you will see a further massive improvement when you drop your forward truncation value from 265 to below 250. In my datasets, I was able to obtain an 85% pass-rate on average on a recent AVITI dataset (but this may vary by dataset). Normally for Illumina our lab chooses a truncation parameter where the median quality score drops to Q30 or Q25 (worst case). For AVITI, the quality is so high that it does not seem to be a concern. If you choose to follow the Q30 guideline, your truncation parameter for both forward and reverse reads will be 250, allowing you to use the full amplicon length. Instead, you can also rationalize that AVITI has improved sequencing standards, and that warrants a revisit of this threshold, and claim your new threshold is Q40. If you choose to do that, then the truncation parameters I would choose would be 250 for forward, and 230 for reverse. I don't expect there to be a massive difference between sticking to Q30 or revising the threshold to Q40. And in our lab we take a pragmatic approach, and mention the threshold we chose in our papers. The process of choosing a truncation parameter from the knowledge of the expected amplicon size just negates the need for cutadapt. Cutadapt is more useful for shotgun sequencing where the expected lengths may not be known. We did a bunch of parallel sequencing of the same library on both AVITI and Illumina and found that apart from the singleton species discovered by the ultra-deep sequencing of AVITI, the results were comparable, and choosing the truncation parameters from the knowledge of the amplicon size cut down on the DADA2 analysis time and memory usage, eliminated the need to use cutadapt. DADA2 seems to take the most amount of time in the chimera removal step if the truncation parameters are greater than 250 |
I'm processing our first Element AVITI 16S amplicon sequencing data, using dada2 in QIIME2. I'm wondering how to do this optimally, as it appears that it behaves a bit differently from MiSeq data in dada2:
The phred quality stays much higher than in MiSeq usually, but to my surprise, I still need to truncate the reads just as short to get similar % accepted in DADA2. For more details & actual statistics, see this thread in the QIIME2 forums.
I was first wondering if the AVITI phred scores are a tad optimistic, but got a potentially interesting comment from AVITI bioinformatician. They think the scoring should be quite comparable vs MiSeq BUT: ”There are differences in the distribution of q scores within a read, however, which would be relevant to DADA2 filtering. AVITI data will have greater q score variance within a read--in AVITI data you would be more likely to find a single low q score in the middle of a high quality read.”
Therefore, they recommend we try tweaking maxEE. Does this sound like a good idea? How should we evaluate/validate the results?
We have a huge number of reads to begin with so that’s no problem, and by truncating close to minimum required overlap we get comparable % of reads through vs. MiSeq. But it feels stupid to throw away high quality data just because I don’t completely understand what’s going on.
The QIIME2 guys recommended me to open an issue here, because this calls for a deep level understanding of dada2.
The text was updated successfully, but these errors were encountered: