Update README.md

KevinKuchinski · Dec 11, 2022 · 48f5afd · 48f5afd
1 parent f66de91
commit 48f5afd
Showing 1 changed file with 52 additions and 24 deletions.
diff --git a/README.md b/README.md
@@ -1,39 +1,67 @@
 # HopDropper
-Remove PCR artefacts from paired-end sequencing data
-## Background
-PCR is an integral technique is many genomic sequencing methods. Applications include: enriching target material in nucleic acid specimens, attaching barcode indices during library preparation, amplifying material captured by hybridization probes, and generating clusters on sequencer flow cells. These PCRs introduce artefacts into sequencing data that complicate downstream analysis. HopDropper concerns itself with the type of artefact called index hops.
+HopDropper is a bioinformatic tool that analyzes paired-end genomic sequencing data to: 1) identify distinct fragments of genomic material, and 2) generate pairs of consensus sequences that describe both ends of each fragment. This single-molecule level of resolution allows genomic fragments to be counted and individually characterized. As part of this analysis, HopDropper also identifies chimeric sequence artefacts generated during library construction and targeted enrichment. Removal of these artefacts improves confidence in the genomic fragments identified by HopDropper and reduces false detections.
 
-Index hops occur when chimeras form between molecules from different libraries, thus attaching barcode indices to material not derived from the specimen they represent. Unique dual indexing enables the identification of most index hops but cannot detect so-called “double hops”. These occur when chimeras form from templates that are themselves index hops, thereby reconstituting a valid pair of indices with exogenous library material inserted between them.
+HopDropper does the above using unique molecular indices (UMIs). UMIs are distinct nucleotide sequences that become associated with each end of each genomic fragment during library construction. These UMIs are introduced before amplification and target enrichment, so all subsequent molecular copies of the same template fragment will share the same pair of UMIs. When using paired-end sequencing chemistry, this allows all read pairs describing the same template fragment to be grouped together. These groups of read pairs can then be collapsed into a single pair of consensus sequences describing both ends of a single template fragment.
 
-Double hops are rare, but regularly cause false positives in applications attempting detection and characterization of rare, low-abundance target material. To make matters worse, the formation of index hops chimeras is often exacerbated by target enrichment methods commonly employed in these applications, e.g. hybridization probe capture where PCR amplification is performed on low-complexity material from pooled libraries.
+HopDropper UMIs are composed of an extrinsic portion (eUMI) and an intrinsic portion (iUMI). eUMIs are incorporated into specialized sequencing adapter oligomers that are ligated onto input genomic material during library construction. Some adapters contain random eUMI sequences while others contain one eUMI from a fixed repertoire. iUMIs are naturally created by the random fragmentation of genomic material which creates distinct sequence motifs at the ends of the fragments.
 
-HopDropper removes index hops by identifying all reads that could have been derived from the same template molecule. It ensures these reads are present in only one library, discarding reads from other libraries where they are less abundant. In cases where it is not clear which library is the true source of those reads, they are removed from all libraries. HopDropper is purposefully aggressive, subscribing to the principle that confidence in results is more important than retention of suspect data. In applications concerned with the detection and characterization of rare target material, HopDropper prefers false negatives to false positives.
+HopDropper also uses UMIs to discard chimeric molecules created during library construction and target enrichment. Each UMI should only co-occur with one other UMI (its so-called best mate). Infrequent co-occurrences between UMIs that are not each other’s best mates are removed. Similarly, HopDropper uses UMIs to discard index hops. Index hops are a special type of chimera wherein genomic material originating from one specimen acquires library barcodes from a different specimen. Each UMI should only appear in one library (its so-called best library). Infrequent appearances of a UMI in libraries other than its best library are removed by HopDropper.
 
-HopDropper identifies reads that could have been derived by the same template molecule using "intrinsic" UMIs. Intrinsic UMIs are generated by taking the first k nucleotides from a read sequence. All reads sharing the same intrinsic UMI could be copies of the same template molecule. This is premised on genomic sequences being inherently complex and input nucleic acids being randomly fragmented during library. Thus, HopDropper should not be used on amplicons!
+## HopDropper input
+HopDropper analysis is performed on a group of FASTQ files. This group contains the first and second read files (generated by paired-end sequencing) for one or multiple libraries. If multiple libraries are analyzed, they should form a group of libraries among which there was a risk of chimera formation and index hopping, e.g. libraries pooled for PCR or hybridization probe capture-based enrichment. 
 
-Another limitation of HopDropper is the possibility of “UMI collisions”, i.e. multiple template molecules sharing the same UMI through coincidence, e.g. a k-mer that repeats in the genome or two template molecules that fragment at the same position. k-mers repetition can be mitigated by choosing longer k-mers (the HopDropper default is 14-mers), but the frequency of collisions from repeated k-mers will ultimately depend on the organism being sequenced, the length of its genome, and the presence of repetitive elements.
+The FASTQ file pairs included in a HopDropper analysis are indicated in a TSV input file. This file associates each FASTQ file pair with a library name. The input TSV file should not have a header line, and each line must be ordered as follows:
 
-Collisions from coincidental fragmentation at the same position will depend on the length of genome as well, with shorter genomes more likely to produce collisions. Coincidental fragmentation will also become more likely as the number of genome copies in the specimens increases. On the one hand, a silver lining of rare target detection is that genome copies are necessarily limited. On the other hand, throughput and cost considerations may encourage extensive library multiplexing, thereby increasing genome copies.
+```library_name R1_file_name    R2_file_name```
 
-Due to the aggressive nature of the HopDropper algorithm, collisions will typically result in some amount of lost data. If one of the colliding UMIs is significantly less abundant, its reads will be removed from its library while reads from the dominant colliding UMI will be retained in the other library. Alternatively, if no colliding UMI is significantly more abundant than the others, all reads will be removed from all libraries. Once again, HopDropper prefers false negatives to false positives.
+FASTQ file pairs must be unzipped and located in the same directory as the TSV input file.
 
-Finally, intrinsic UMI collisions can be substantially mitigated using library adapters that introduce exogenous UMIs. These introduce an additional coincidence that must occur alongside k-mer repeats or fragmentation of multiple templates at the same position. If inline UMI adapters are used, the exogenous UMI sequence is already prepended to the read sequence and its intrinsic UMI. Simply run HopDropper with a longer UMI length parameter (length of exogenous UMI + length of intrinsic UMI). If adapter UMIs are instead located in barcode index regions, they must be prepended to read sequences before running HopDropper.
+## HopDropper Algorithm
+HopDropper performs the following sequence of operations on the provided FASTQ files:
+1.	HopDropper extracts a UMI from each read. The user specifies the length of eUMI present in the adapter oligo and the length of the desired iUMI. HopDropper analyzes the first <i>n</i> base quality scores of each read, where <i>n</i> is the combined eUMI and iUMI length. If all of those base quality scores exceed a minimum value (default is PHRED score of 30), the first <i>n</i> bases of the read sequence are extracted as its UMI.
 
-## The HopDropper Algorithm
-1.	Enumerate UMIs from each read. This is done by taking the first k letters of the read sequence. It requires that all base calls inside the UMI have a minimum PHRED quality score, otherwise the read and its mate are discarded.
-2.	For each UMI, count the number of times it occurs across all libraries. If it does not exceed a certain threshold, discard all reads containing that UMI and their mates. By default, HopDropper removes UMI singletons.
-3.	For each UMI, count the number of times it appears in every library separately. Identify the library where it occurs the most times (the top library) and the library where it occurs the next most times (the second library). Determine if the number of occurrences in the top library is significantly greater than the number of occurrences in the second library. If it is significantly greater, assign the top library as that UMI’s best library. If it is not significantly greater, assign no library as that UMI’s best library.
-HopDropper determines significance by whether the number of occurrences in the top library is at least twice the number of occurrences in the second library. This is premised on the doubling nature of PCR amplification and the fact that amplification of PCR artefacts will necessarily trail the templates on which they are based by at least one amplification cycle. This principle has been previously used in other chimera detection tools, including VSEARCH.
-4.	Iterate through each read pair in every library. Extract UMIs from both reads. If either UMI has no best library assigned to it, discard the read pair. If the library in which the read pair is located is not the best library for both UMIs in the pair, discard the read pair.
+2.	HopDropper assigns a best library to each UMI. A UMI’s best library is the library in which it occurs most frequently. If a UMI occurs in more than one library, the number of occurrences in its most frequent library must be significantly greater than the number of occurrences in its second most frequent library. Significance is determined by converting frequency counts into proportions then comparing the proportions of the top two libraries with the two proportion Z-test (default alpha level is 5%). If a UMI occurs in only one library, the two proportion Z-test compares the proportions 1.0 and 0.0, and the significance relies solely on the sample size, i.e. the number of times the UMI was observed. If a UMI does not have a clear best library after this analysis, it is assigned no best library and read pairs containing this UMI will be discarded when valid reads are outputted. 
+
+3.	HopDropper assigns a best UMI mate to each UMI. A UMI’s best mate is the other UMI with which it co-occurs in read pairs the most frequently. If a UMI co-occurs with more than one other UMI, the number of co-occurrences with its most frequent mate must be significantly greater than the number of co-occurrences with its second most frequent mate. Significance is determined by converting frequency counts into proportions then comparing the proportions of the top two UMI mates with the two proportion Z-test (default alpha level is 5%). If a UMI co-occurs with only one other UMI, the two proportion Z-test compares the proportions 1.0 and 0.0, and the significance relies solely on the sample size, i.e. the number of times the UMI was observed. If a UMI does not have a clear best mate after this analysis, it is assigned no best mate and read pairs containing this UMI will be discarded when valid reads are outputted.
+
+4.	HopDropper writes valid read pairs to new FASTQ files. For each read pair, HopDropper considers the UMIs for both reads. The pair is only valid if both UMIs have the same best library and that library is the one to which the read pair was assigned. The read pair is also valid only if both UMIs are each other’s best UMI mate. Finally, the read pair is only valid if its UMI pair has been observed a minimum number of times (default is 30 occurrences).
+
+5.	HopDropper creates pairs of consensus sequences representing both ends of each distinct input genomic molecule. All read pairs in a library are grouped based on their UMI pair. Each group of read pairs is randomly sub-sampled for faster computation (default is up to 200 read pairs). Low quality bases in these reads are masked with ‘N’ ambiguity characters (default threshold is a PHRED score of 30), then a multiple sequence alignment is conducted for each read in the pair. These multiple sequence alignments are used to generate consensus sequences for both reads in the pair, i.e. both ends of the input genomic molecule described by that group of read pairs. If a position in the multiple sequence alignment is not described by a sufficient number of non-ambiguous bases (default is 20 non-ambiguous bases), then it is masked with an ‘N’ ambiguity character. If the non-ambiguous bases describing a position in the multiple sequence alignment do not converge on a single nucleotide (default consensus threshold is 0.75), then it is masked with an ‘N’ ambiguity character. Consensus sequences are truncated after the first appearance of a run of ‘N’ ambiguity character (default run length is 5).
+
+6.	HopDropper writes each pair of consensus sequences to a FASTA file. Headers are formatted as follows: ```>experiment_name|library_name|fragment_number|fragment_end|fragment_copies|fragment_UMI_pair``` 
+
+Experiment_name is a name given to a HopDropper analysis at runtime. Library_name is the name of the library in which the fragment was detected.      Fragment_number is a sequential identifier for all fragments from the same library. The combination of library_name and fragment_number is unique for each fragment within experiment_name. Fragment_copies is the number of read pairs describing the fragment, i.e. the number of PCR copies of the input genomic molecule that were sequenced. Fragment_UMI_pair is the UMI pair describing the fragment. This is useful for identifying the same fragment across different experiments (when the same fragment may be assigned a different fragment_number within its library).
+
+## Important Notes
+A.	Do not trim bases from the beginning of reads! HopDropper must consider those bases as-sequenced for UMI assignment.
+
+B.	UMIs are not always unique. Sequencing adapter oligomers with fixed eUMIs are obviously limited in number. Sequencing adapter oligomers with random eUMIs have a limited number of permutations. iUMIs are limited by the number of positions in the genome where fragmentation can occur, and repetitive genome structures or simple chance may result in redundant sequence motifs at certain fragmentation locations. Taken together, this means that some input fragments may share UMIs with other input fragments on one or both of their ends by chance. This coincidental sharing of UMIs is called a UMI collision.
+The rate of UMI collisions depends on the diversity of eUMIs available on adapter molecules. It also depends on the diversity of iUMIs residing within the genomes of the organisms in the specimen which, in turn, depends on total genome length and the presence and frequency of repetitive structural elements. The UMI collision rate also depends on the number of input molecules; fewer input molecules require fewer UMIs, so there is a lower probability of collision. UMI collision rates are easily underestimated, even when the diversity of eUMIs and iUMIs seems extensive (the same cognitive bias behind the classic “birthday problem” from statistics and probability theory applies here).
+Unfortunately, UMI collisions are alike in appearance to chimeras and index hops. HopDropper does not attempt to distinguish them. If a UMI collision occurs and one of the input fragments containing that UMI is significantly more abundant, the other input fragment will be discarded. If a UMI collision occurs and none of the input fragments containing that UMI are significantly more abundant than the rest, all of these input fragments will be discarded. This means that conditions favouring UMI collisions will increase data attrition by HopDropper, potentially resulting in false negatives. Users should understand that HopDropper’s priority is to remove false positives.
+
+C.	HopDropper expects the eUMIs to be in-line with the input molecule, i.e. directly adjacent to the end of the input molecule. If the eUMI’s position within the sequencing adapter oligomer does not place it directly adjacent to the input molecule upon ligation, eUMI sequences and their corresponding base quality scores must be prepended to read entries in the FASTQ files.
+
+D.	Users may provide a list of fixed eUMIs if there is a specific repertoire of eUMI sequences in the sequencing adapter oligomers they use. If this list is provided, HopDropper will reject any read whose eUMI does not appear in the list. 
+
+E.	Users may choose not to use an eUMI, relying solely on iUMIs. This requires a library construction method where input genomic material is randomly fragmented before any adapters are added and before any amplification occurs. Thus, amplicon enrichment is not conducive for analysis by HopDropper using iUMIs alone.
+
+F.	Users may choose not to use an iUMI, relying solely on eUMIs in library adapter oligos. If there are a limited number of eUMIs provided, this may increase the rate of UMI collisions and, consequently, data attrition.
 
 ## HopDropper usage
 ```
-$ python hop_dropper.py -i <input file> -o <output name> -l <UMI length> -q <UMI quality> -m <minimum UMI count>
+$ python hop_dropper.py -i <input file> -o <output dir> -I <iUMI length> -E <eUMI length> [-q <min UMI qual> -f <fixed UMIs file> -m <min UMI count> -Q <min read qual> -s <subsample size> -d <min read depth> -c <consensus threshold> -N <max consecutive Ns>]
 ```
 <b>Arguments:</b>
-    -i : path to input file (described below)
-    -o : name to append to output files
-    -l : the number of bases to use as UMI from each read
-    -q : the minimum PHRED score of each base in the UMI required for the read to be considred
-    -m : the minimum number of times a UMI must appear to be considered
-The HopDropper input file is a TXT file listing all FASTQ files to be analyzed. Each library must be provided as a single interleaved FASTQ file.
+
+    -i : path to TSV input file
+    -o : output directory
+    -I : iUMI length
+    -E : eUMI length
+    -q : minimum PHRED quality for bases in UMI (default = 30)
+    -f : path to TXT file listing fixed eUMIs (one eUMI per line)
+    -m : minimum occurences of UMI pair for retention (UMI pairs that do not appear at least this many times are not outputted, default = 30)
+    -Q : minimum PHRED quality of bases in reads (bases below this quality are masked with Ns while generating fragment end consensus sequences, default = 30)
+    -s : sub-sample size (the maximum number of read pairs used for generating fragment end consensus sequences, default = 200)
+    -d : minimum read depth (the minimum number of reads with non-ambiguous nucleotides required for generating fragment end consensus sequences, default = 20)
+    -c : consensus threshold (the proportion of reads that must contain the same nucleotide when generating fragment end consensus sequences, default = 0.75)
+    -N : maximum consecutive Ns (reads are trimmed after this number of consecutive Ns, default = 5)