-
Notifications
You must be signed in to change notification settings - Fork 2
3. Sequencing
This part of the Reg-Seq protocol outlines the experimental steps relevant to the "mapping" and "barcode sequencing" of mutant DNA libraries. The term "mapping", in this context, refers to the use of Illumina sequencing to "map" unique DNA barcodes to their corresponding mutant promoter sequence. In other words, sequencing enables one to build a codex that links a barcode to a mutated regulatory region of DNA. This protocol begins with an overview of Illumina sequencing, which may be useful for new users. We then outline the steps involved in "mapping" libraries (which only requires that DNA is sequenced) and, finally, outline the steps involved in the barcode sequencing of both DNA and cDNA.
Illumina machines rely on a technology called "sequencing by synthesis". Please watch this video to learn more. Caltech has a core sequencing facility, called the Millard and Muriel Jacobs Genetics and Genomics Laboratory that provides excellent resources and technical expertise relating to Illumina and other sequencing platforms. In the past, many of our sequencing runs have been performed by the staff members of this core facility.
The principles of Illumina sequencing are somewhat straightforward. The "central" component of an Illumina sequencing run is the flow cell, a tiny chip that is coated with hybridized strands of DNA. When you load your DNA library into an Illumina machine, each end of your DNA library must contain an adaptor sequence, a stretch of DNA that recognizes and anneals with these strands of DNA, thus tethering your DNA library to the flow cell. If you load in too much of your DNA library, you will saturate the system and will receive low-quality DNA results. Conversely, if you load in too little of your DNA library, you will not make full use of the flow cell, and will receive a low number of total reads.
Illumina machines sequence many millions of DNA polymers in a single reaction. In most cases, these runs can be "multiplexed", meaning that multiple users can load their samples into the same machine and "demultiplex" the data later.
When multiple users load samples into the same Illumina machine, it is necessary for each user to prepare their DNA sequences with unique index sequences. Index sequences are added to your DNA samples via PCR, usually after initial amplification of the target region. The use of unique index sequences (which can be added to both the 5' and 3' ends of the DNA library) allows you to mix many samples together (up to 96 combinations of index sequences) and sequence them at the same time. Following sequencing, for example on an Illumina MiSeq, the software is able to identify these indexes on each sequence read and demultiplex data based on these indexes.
A sequencing run that is "single index" means that only a single index sequence is present, while "dual index" indicates that an index sequence has been added to both sides of a DNA library. See this documentation for more information on index sequences, and how they fit into the total Illumina sequencing pipeline.
A final note on Illumina sequencing: one can perform either a "single-end" or a "paired-end" sequencing experiment. Single-end means that sequencing is only performed on one end of the DNA library...an experiment of this type will sequence ~150 bases from one end, and will also sequence one index. A paired-end experiment sequences both ends of the DNA, and can be either single- or dual-index.
Note: Sequencing can be done on plasmid-based libraries (conventional Reg-Seq) or on genome-integrated libraries in largely the same way.
Note: We have only performed Reg-Seq on DNA libraries expressed from plasmid; any notes on sequencing of genome-integrated libraries must be tested.
The major difference between the two is in the primers used: for plasmid-based DNA libraries, we use standard Illumina primers (e.g. primers that are already present in the reagent cartridge when you start an Illumina run), whereas, if you were to use of pLibAcceptorV2 and genome-integrated libraries, the protocol would require custom primers, which must be manually "spiked" in to the Illumina machine. We provide detailed information on these primer sequences and the steps necessary to spike-in custom primers at the very end of this protocol.
The first part of the sequencing protocol is to perform the "mapping run", a process which we outline, broadly, in Fig. 1, below.
Figure 1: A schematic outlining a number of experimental steps, leading up to the "mapping" sequencing run. E. coli cells expressing a mutagenized promoter, harbored on a low-copy plasmid, are grown in M9 media to a specific OD600. The DNA library is then isolated, purified, and prepared for sequencing. It is loaded onto an Illumina flow cell, and paired-end sequencing is performed. The resulting
.fastq
files are then analyzed to link each unique barcode to its corresponding promoter sequence.
In the "mapping" sequencing, the objective is to link each barcode sequence to its corresponding mutated promoter. Starting with a library of E. coli cells, each harboring a different, mutagenized plasmid with the promoter sequence of the gene / operon to be studied, we grow this cellular library to an OD600 of 0.3 in M9 media with 0.5% glucose (more on all of these experimental steps in the text that follows). Then, we isolate this DNA and perform a PCR reaction to add the index and adaptor sequences which are necessary for Illumina sequencing. Finally, we perform the sequencing on an Illumina machine and analyze the resulting data (which is exported in .fastq
format). We now go through the specific details of this "mapping".
Recall from Part 2 of this protocol that E. coli cells are grown in M9 + 0.5% glucose media in preparation for the "mapping" run. Once the DNA library is isolated, purified, and stored away, it is time to amplify it and add the necessary adaptor and index sequences. The mapping run is the sequencing experiment that links each barcode to its unique, mutagenized promoter.
Add the adaptor and index sequences using PCR with the following settings. Set up each PCR reaction in triplicate, as this will ensure that you obtain enough DNA.
reagent | concentration | volume (μL) |
---|---|---|
DNA library | 10ng/μL | 1 |
Q5 polymerase mix | 2x stock | 25 |
Primer Fwd | 10 μM | 2.5 |
Primer Rev | 10 μM | 2.5 |
Water | N/A | 19 |
Primers:
-fwd: 5'-AATGATACGGCGACCACCGAGATCT ACACTCTTTCCCTACACG ACGCTCTTCCGATCT CAAA TTCGTCTTCACCTCGAGCAC-3'
-rev: 5'-AAGCAGAAGACGGCATACGAGATCGGT CTCGGCATTCCTGCT GAACCGCTCTTCCGATCT CACC GCAGGGGATAATATTGCCCA-3'
These primers add the adaptors, index sequence, and everything else required for Illumina sequencing. Mix these reagents carefully, while keeping everything on ice throughout the experiment.
Use the following thermocycler settings with 20 cycles.
cycles | temperature | time |
---|---|---|
1 | 98°C | 30 seconds |
20 | 98°C | 10 seconds |
20 | 66°C (anneal) | 30 seconds |
20 | 72°C (extend) | 30 seconds |
1 | 72°C | 120 seconds |
Hold | 4°C | ∞ |
For more information on the thermocycler conditions for Q5 polymerase, see the NEB website.
Perform a gel extraction by adding 10 μL of 6x NEB DNA dye to each 50 μL PCR reaction. Load the full volumes on a thick, 2% agarose gel (each sample in its own well). Perform electrophoresis for 45 minutes at 120V. Use a scalpel to remove the DNA band corresponding to the amplified oligo libraries. Perform a gel extraction using one of many commercially-available kits. We have previously obtained good results with the Zymoclean Gel DNA Recovery Kit. Only extract that band which corresponds to the expected size of the amplified PCR product.
After performing the gel extraction according to the manufacturer's protocol, NanoDrop the eluted DNA and record the concentration and purity.
For the "mapping" sequencing, we have previously submitted our prepared DNA libraries directly to NGX Bio, a company in San Francisco, California. This company provides Next-Generation Sequencing (NGS) as a service.
For the relevant details to submit a sample for sequencing to NGX Bio, consult the Excel file on this webpage and click on the button that says "Sample submission form/guidelines". This Excel file specifies the concentration and amount of DNA to submit.
NGX Bio uses a Bioanalyzer instrument to test the length distribution of DNA in the sample that you submit. It is normal to find some DNA sequences with an incorrect length in the sample, which arise due to both synthesis errors in the original DNA and incorrect amplification. For high-quality sequencing results, however, getting as "sharp of a peak" (namely, having most of the DNA sequences be precisely the expected length) as possible is very important.
NGX Bio typically returns sequencing data in 3 weeks. We use the paired-end sequencing service, with 150 cycles for Read 1 and Read 2 on a Hi-Seq 2500 machine. 250 million total reads for mapping a library is more than sufficient.
Note that the Caltech sequencing center also provides this service, but the turnaround is much longer (typically ~8 weeks).
When sequencing data is exported from the Illumina machine, it is in FASTQ format, a text-based sequencing data file format that stores both raw sequence data and quality scores. Since the "mapping" run uses paired-end sequencing, in which sequencing is performed at both ends of the DNA library, we must connect each end, thus creating a single, long DNA sequence from each paired-end read. This process is called "joining", and can be performed with an open-source software called the "FLASH" tool. FLASH stands for "Fast Length Adjustment of SHort reads". The FLASH software is hosted by Johns Hopkins and a manual on FLASH can be found here. A research paper on the FLASH software can be found here.
The first step in using FLASH is to download the open-source software. FLASH can be downloaded from sourceforge. On the sourceforge webpage, click on the large, green button that says "Download Latest Version". To install FLASH, follow this documentation from the developers:
"On UNIX-compatible systems, including GNU/Linux and Mac OS X, you must compile FLASH from source. The only dependency, other than functions that are expected to be available in the C library, is the zlib data compression library. To install FLASH, download the tarball, untar it, and compile the code using the provided Makefile:
$ tar xzf FLASH-1.2.11.tar.gz
$ cd FLASH-1.2.11
$ make
The executable file that is produced is named 'flash'. To run it from the command line you must copy it to a location on your $PATH variable, or else run it with a path including a directory, such as "./flash".
FLASH also runs on Windows, and you can compile it on Windows using MinGW. However, for convenience you may instead download a standalone Windows binary from the SourceForge page (https://sourceforge.net/projects/flashpage/)."
After installing the FLASH file, you will now be able to execute FLASH commands on your command line. Open up a terminal, and execute the following commands:
./flash read1.fastq read2.fastq
[-m minOverlap] [-M maxOverlap] [-x mismatchRatio]
[-p phredOffset] [-o prefixOfOutputFiles] [-d pathToDirectoryForOutputFiles]
[-f averageFragment Length] [-s standardDeviationOfFragments] [-r averageReadLength]
[-h displayHelp]
read1.fastq and read2.fastq are fastq files of paired-end reads from the short fragment library. In the text above, we are using the "flash" command to join the two fastq files from the Illumina machine. All of the commands in brackets (e.g. -m, -x, and so forth) provide additional directives on the command. It is with these commands that you can specify the minimum overlap between reads (e.g. if you sequence 150 nucleotides from one end, and 150 nucleotides from the other, and the total length of your library was 250, there would be ~50 bases of overlap between the reads). The full explanation of these different commands can be found in the FLASH Manual.
After joining the reads, we next want to perform quality filtering. When an Illumina machine's camera takes images of DNA clusters, it sometimes has a hard time resolving two clusters (especially if the flow cell was loaded with too high of a DNA concentration). At the end of any Illumina run, a composite Q-score (Quality Score) is produced...an exceptional score is 30+, but any score about 20 is acceptable.
We remove low-quality reads by performing quality filtering with FastX, another open-source software. This software is managed by the Hannon lab at CSHL. For instructions on downloading and installing FastX, see this webpage.
Once FastX is installed, we perform quality filtering using the command line again and execute the following commands:
cat file_to_be_filtered.fastq | ./fastq_quality_filter -Q33 -o output_file_name.fastq
In this command, file_to_be_filtered.fastq
is -- you got it -- the fastq file that is to be filtered. The FastX command for quality filtering is fastq_quality_filter
, and the -Q33 is simply setting the filtering quality threshold at Q = 33 or greater. The output file name is given after -o.
After joining the paired-end reads with FLASH and performing quality filtering using FastX, we can next perform the actual "mapping", whereby barcodes are joined to their corresponding mutagenized promoters. To perform the mapping step, we can use the create_key
module from regseq
. To see how the module is used, you can look at the notebook 3_1_create_key.ipynb
. Make sure that you followed the instructions on how to set up the right python environment, which you can find at the Home tab of this wiki.
Reg-Seq is a powerful method, in part, because it allows one to decipher the regulatory mechanism for different promoters under variable growth conditions. Some TFs are only active in anaerobic conditions, for example, or in low glucose environments. Therefore, it is essential that one grows cell libraries in multiple environmental conditions to fully delineate why certain genes are regulated.
Cell libraries should first be grown to saturation in LB and diluted 1:10,000 into the appropriate growth media for the promoter under consideration, and grown to an optical density (OD600) of 0.3 before harvesting RNA and DNA for sequencing.
In the original Reg-Seq study, multiple growth conditions were tested, including differing carbon sources, such as growth in M9 with 0.5% Glucose, M9 with acetate (0.5%), M9 with arabinose (0.5%), M9 with Xylose (0.5%) and arabinose (0.5%), M9 with succinate (0.5%), M9 with fumarate (0.5%), M9 with Trehalose (0.5%), and LB. We also used several stress conditions such as heat shock, where cells were grown in M9 and were subjected to a heat shock of 42 degrees for 5 minutes before harvesting RNA. We grew in low oxygen conditions. Cells were grown in LB in a container with minimal oxygen, although some will be present as no anaerobic chamber was used. This level of oxygen stress was still sufficient to activate FNR binding, and so activated the anaerobic metabolism. We also grew cells in M9 with Glucose and 5mM sodium salycilate. Growth with zinc was preformed at a concentration of 5mM ZnCl2 and growth with iron was performed by first growing cells to an OD of 0.3 and then adding iron chloride to a concentration of 5mM and harvesting RNA after 10 minutes. Growth without cAMP was accomplished by the use of the JK10 strain which does not maintain its cAMP levels. These growth conditions were chosen so as to span a wide range of growth rates, as well as to illuminate any carbon source specific regulators.
Once cells are grown to OD 0.3, DNA and RNA must be isolated. The protocol for isolating DNA will differ, depending on whether you are using plasmid-based expression or have genome-integrated the library. In the former case, DNA can be isolated using very straightforward, off-the-shelf kits. We recommend Zymo DNA Clean&Concentrate-5 kits, as these consistently give the cleanest DNA isolations. Clean all PCR reactions with a Zymo DNA Clean and Concentrator kit, and verify libraries using an Agilent Tapestation or other method for checking the specific size of PCR amplicons prior to sequencing.
When isolating RNA using the pJK14 method, RNA should first be stabilized using Qiagen RNA Protect. Lysis of cells can be performed using lysozyme and RNA isolated using the Qiagen RNA Mini Kit. Reverse transcription should be performed using Superscript IV and a specific primer for the labeled mRNA with the sequence: 5' - GCAGGGGATAATATTGCCCA - 3'
qPCR should be performed to check the level of DNA contamination on the isolated cDNA. One creative method to multiplex the number of growth conditions that can be sequenced at once is to add short, 4-nt barcodes to DNA and RNA isolated from each growth condition during the PCR amplification steps that add the necessary indices and adaptors for Illumina sequencing. Examples of primers which add a unique, 4nt barcode and the necessary Illumina adaptors for this purpose are shown below. Short, 4-nt barcodes to indicate each growth condition are shown in bold. If you have multiple growth conditions, Forward and Reverse primers should be ordered for both cDNA and DNA, and unique, 4-nt barcodes designed. Note that barcodes are introduced with the forward primers for cDNA and DNA.
cDNA:
-fwd: AATGATACGGCGACCACCGAGATCT ACACTCTTTCCCTACACG ACGCTCTTCCGATCT **TCTA** TATTAGGCTTCTCCTCAGCG
-rev: AAGCAGAAGACGGCATACGAGATCGGT CTCGGCATTCCTGCT GAACCGCTCTTCCGATCT GACC GCAGGGGATAATATTGCCCA
DNA:
-fwd: AATGATACGGCGACCACCGAGATCT ACACTCTTTCCCTACACG ACGCTCTTCCGATCT **ATGC** TATTAGGCTTCTCCTCAGCG
-rev: AAGCAGAAGACGGCATACGAGATCGGT CTCGGCATTCCTGCT GAACCGCTCTTCCGATCT GACC GCAGGGGATAATATTGCCCA
When running qPCR on cDNA samples, the objective is to show that the sample crosses the amplification threshold many cycles before a no-RT control (a sample with no DNA added). Since one does not know the concentration of cDNA after preparing the sample, a good "rule of thumb" for checking cDNA quality via qPCR is to use 4 microliters of cDNA as a template. When performing the qPCR, any sample that has 5 or fewer amplification cycles of difference between sample and the no RT control should be discarded. Additionally, prior to sequencing, all amplified samples should be analyzed on a BioAnalyzer or Agilent Tapestation.
Note that this section has not been performed in previous iterations of Reg-Seq. The protocol for isolating genome-integrated DNA is different. gDNA should be extracted from 5 mL cell library pellets using a Qiagen Gentra Puregene kit (#158567). All sequencing runs with pLibAcceptorV2 also requires that barcoded DNA be amplified from 1 microgram of gDNA via PCR for 14 cycles using primers GU59 and GU60. This reaction should then subsequently be cleaned using a Zymo Research DNA Clean and Concentrator kit. To add sequencing adapters and indices to the library, 1 ng of this reaction should be used as template DNA for a second PCR for 8 cycles using primers GU70 and either GU63 or GU64. Clean all PCR reactions with a Zymo DNA Clean and Concentrator kit, and verify libraries using an Agilent Tapestation or other method for checking the specific size of PCR amplicons prior to sequencing.
RNA can also be extracted from 50 mL library pellets using a Qiagen RNEasy Midi kit (#75142). Use 45 micrograms of each extract and concentrate it using a Qiagen Minelute Cleanup Kit (#74204). Barcoded cDNA should then be generated from 25 micrograms of each concentrated RNA extract using Thermo Fisher SuperScript IV (#18090010) primed with GU101. The manufacturer’s protocol should be followed, aside from extending the reaction time to 1 hour at 52 °C. The cDNA reaction can then be cleaned using a Zymo Research DNA Clean and Concentrator kit (#D40140) before amplification. Barcoded cDNA should be amplified via PCR for 13 cycles using primers GU59 and GU102. This reaction must be cleaned using a Zymo Research DNA Clean and Concentrator Kit and 1 ng of this reaction used in a second PCR for indexing and addition of flow cell adapters. The second PCR is typically 8 cycles (though the precise number of cycles should be checked via qPCR -- again, you can use 4 microliters of cDNA as the template in the qPCR experiment) and utilize primers GU102 and either GU61 or GU62. Every primer name listed here, and its full sequence from 5' - 3', can be found in the list of Phillips lab primers by clicking on this sentence. All of these primers sequences are also listed below.
GU59: 5' - CATGTTGTCCACTCCAATCGGTGATGGTCCTG - 3'
GU60: 5' - GTAATAGCTAAATCCCACCCGATGCCTGCAGG - 3'
GU61: 5' - CAAGCAGAAGACGGCATACGAGAT ACTGTG CATGTTGTCCACTCCAATCG - 3'
GU62: 5' - CAAGCAGAAGACGGCATACGAGAT AGCCAT CATGTTGTCCACTCCAATCG - 3'
GU63: 5' - CAAGCAGAAGACGGCATACGAGAT ATCTCG CATGTTGTCCACTCCAATCG - 3'
GU64: 5' - CAAGCAGAAGACGGCATACGAGAT CAGTGT CATGTTGTCCACTCCAATCG - 3'
GU70: 5' - AATGATACGGCGACCACCGAGATCTACACGTAATAGCTAAATCCCACCCGATGC - 3'
GU101: 5' - AATGATACGGCGACCACCGAGATCTACACGTAATAGCTAAATCCCACCCG ATGCCTGCAGG - 3'
With DNA and clean, uncontaminated cDNA in hand, the next step is to perform Illumina sequencing for the "barcode sequencing run".
Once we have "mapped" each barcode to its corresponding, mutated promoter, we next perform the barcode sequencing process, which enables us to determine the gene expression value for each, unique promoter. We outline the general approach of this experiment in Fig. 2, below.
Figure 2: Schematic outlining experimental steps after mapping the DNA library. E. coli cells harboring the mutagenized promoters are again grown to a specific OD600, but this time they are grown in any desired media (LB, M9, anaerobic growth, and so forth). Both DNA and RNA are isolated and purified. RNA is reverse transcribed to cDNA, and both the DNA and cDNA libraries are amplified via PCR to add adaptor and index sequences, which can then be run on an Illumina machine as a single-end read. After sequencing, the resulting
.fastq
files are analyzed, and barcodes are counted in both their DNA and cDNA form to deduce the relative gene expression for each mutagenized promoter.
In the barcode sequencing experiment, we begin again with the library of E. coli cells, each harboring a different, mutagenized plasmid with the promoter sequence of the gene / operon to be studied. Each of these mutant promoters will produce a different amount of RNAs -- in other words, each mutant promoter will have a different gene expression value. We can determine this value by counting how many times a barcode appears in both the DNA and cDNA forms, and then divide (cDNA / DNA) to determine gene expression. We grow this cellular library to an OD600 of 0.3 in the selected media (this could be anaerobic growth, growth without sugar, and so forth...a process which we outline in detail in the section that follows). Next, we isolate both the DNA and RNA from this library. We perform reverse transcription on the RNA to produce double-stranded cDNA, and then we add index and adaptor sequences to both the DNA and cDNA before loading them into the Illumina machine. When we analyze the resulting data (which is again exported in .fastq
format), we count how many times each barcode appears in both DNA and cDNA forms. We now go through the specific details of this protocol.
Once the DNA and cDNA libraries are prepared for each growth condition, we next perform the "barcode sequencing" run using the services of the Millard and Muriel Jacobs Genetics and Genomics Laboratory at Caltech. This facility uses a HiSeq 2500.
For this part of the sequencing experiment, only the region containing the random, 20 bp barcode needs to be sequenced, since the "mapping" run already linked each unique barcode to a corresponding, mutated promoter.
To prepare DNA and cDNA libraries for this "barcode sequencing" run, many of the steps are the same as in the mapping part of this protocol. First, we amplify the DNA and cDNA libraries, adding the necessary adaptor and index sequences. Add the adaptor and index sequences using PCR with the following settings. Set up each PCR reaction in triplicate, as this will ensure that you obtain enough DNA.
reagent | concentration | volume (μL) |
---|---|---|
DNA/cDNA Library | - | 1 μL (DNA) or 4 μL (cDNA) |
Q5 polymerase mix | 2x stock | 25 |
Primer Fwd | 10 μM | 2.5 |
Primer Rev | 10 μM | 2.5 |
Water | N/A | 19 μL (DNA) or 16 μL (cDNA) |
Primers:
-fwd: 5'-AATGATACGGCGACCACCGAGATCT ACACTCTTTCCCTACACG ACGCTCTTCCGATCT NNNN GCCGTCGTTTTACATGACTG-3'
-rev: 5'-AAGCAGAAGACGGCATACGAGATCGGT CTCGGCATTCCTGCT GAACCGCTCTTCCGATCT NNNN GCAGGGGATAATATTGCCCA-3'
The format for these primers is (beginning on the 5' end) Illumina adaptor, followed by a 4nt barcode (NNNN), and concludes with a 20bp overlap with the plasmid. Here, the Illumina adaptor is unindexed and 'NNNN' is a four base pair barcode that we use to identify different growth conditions for the experiments. It is a good idea to perform this PCR on biological replicates for each sequencing run, using unique, 4nt barcode sequences to specify each growth condition and biological replicate.
Mix these reagents carefully, while keeping everything on ice throughout the experiment.
Use the following thermocycler settings with 19 cycles. We have historically used 19 cycles for this amplification step, but it is always a great idea to confirm that this number of cycles is appropriate by precisely following the qPCR protocol outlined in Part 2 of this Wiki).
cycles | temperature | time |
---|---|---|
1 | 98°C | 30 seconds |
~19 | 98°C | 10 seconds |
~19 | 63.8°C (anneal) | 30 seconds |
~19 | 72°C (extend) | 30 seconds |
1 | 72°C | 120 seconds |
Hold | 4°C | ∞ |
For more information on the thermocycler conditions for Q5 polymerase, see the NEB website.
Perform a gel extraction by adding 10 μL of 6x NEB DNA dye to each 50 μL PCR reaction. Load the full volumes on a thick, 2% agarose gel (each sample in its own well). Perform electrophoresis for 45 minutes at 120V. Use a scalpel to remove the DNA band corresponding to the amplified oligo libraries. Perform a gel extraction using one of many commercially-available kits. We have previously obtained good results with the Zymoclean Gel DNA Recovery Kit. Only extract that band which corresponds to the expected size of the amplified PCR product.
After performing the gel extraction according to the manufacturer's protocol, NanoDrop the eluted DNA and record the concentration and purity.
For barcode sequencing, we have historically used the Millard and Muriel Jacobs center at Caltech. They will run the samples on a Hi-Seq 2500 machine (the same as NGX Bio) and will also analyze the libraries with a Bioanalyzer machine.
For barcode sequencing, we typically use 75 million reads per growth condition, and find that this provides good results in downstream analyses (e.g. plotting of information footprints and energy matrices).
After the barcode sequencing run, we still do post-processing on the sequencing data. We use single-end reads for the barcode sequencing part of the protocol and, accordingly, there is no need to perform a "joining" step with the FLASH tool. However, we still perform quality score filtering of the .fastq
files from the Illumina machine.
To perform quality filtering, use the FastX tool, as discussed early in this Wiki page. As a reminder, we remove low-quality reads by performing quality filtering with FastX, and execute the following commands:
cat file_to_be_filtered.fastq | ./fastq_quality_filter -Q33 -o output_file_name.fastq
In this command, file_to_be_filtered.fastq
is the .fastq
file that is to be filtered. The FastX command for quality filtering is fastq_quality_filter
, and the -Q33 is simply setting the filtering quality threshold at Q = 33 or greater. The output file name is given after -o.
After filtering the single-end sequencing data, we now have a .fastq
file that contains only high-quality sequencing reads. With this dataset in hand, we next perform barcode splitting, a computational process that separates the sequencing .fastq
files by growth condition, as well as whether or not they are from the DNA library or derived from the RNA extraction. Recall that each experimental condition (both biological replicates, RNA vs. DNA, and growth conditions) receive unique index sequences, which help us to identify where each library came from. It is now necessary to split the sequencing data into bins, based on these experimental conditions. We use FastX to perform barcode splitting, by running the command:
cat sequences_to_be_split.fastq | ./fastx_barcode_splitter.pl --bcfile bcfile.txt --prefix my_split_output --exact --bol
(For an explanation on the barcode splitting command, see the FastX documentation.
To use this command, you will need to specify which barcode sequences you actually used in preparing your DNA libraries! In the command above, these barcode sequences are specified in the `bcfile.txt' file, and an example layout for this file is simply:
BI96 TCTA
BI97 AGAG
BI98 GCAT
BI95 CAGT
BI101 CAAG
BI105 ATGC
After running this command, you will now have a quality-filtered, split sequencing file with data for each biological replicate, DNA vs. RNA, and growth condition.
Note: This protocol must be tested, as we have not previously performed Reg-Seq on genome-integrated libraries.
If performing the standard Reg-Seq protocol, in which DNA libraries are expressed from a plasmid, the sequencing primers (e.g. Primer for Read 1 and 2, and Index Reads) are already present in the Illumina cartridge. There is no need to add custom primers.
If you want to perform Reg-Seq on genome-integrated libraries (both for mapping and barcode sequencing), however, the sequencing primers required are not standard Illumina primers and, thus, are not present in the cartridge. Therefore, if sequencing a genome-integrated DNA library that was cloned using the pLibAcceptorV2 plasmid, it is necessary to "spike-in" custom primers. Accordingly, you will not be able to use the DNA sequencing facility at Caltech, as they only allow the use of standard Illumina adaptor sequences.
To spike in custom primers, you must buy a reagent kit and flow cell for the MiSeq (such as the MiSeq Reagent Kit V2 which includes all reagents and the flow cell) and perform sequencing on your own. For custom sequencing runs of this nature, we have previously used the Single-Cell Profiling and Engineering Center (SPEC) at Caltech, which is run by Matt Thomson's laboratory at Caltech. Jeff Park, a research associate at SPEC, is particularly knowledgeable about custom sequencing runs.
Illumina has extensive documentation on spiking-in custom primers. If you do decide to go this route, consult the manual, set up a meeting with somebody at SPEC, and get their advice before proceeding.
Read 1 - GU60 - BC Amp Rev (spike into Illumina Well #12 if using MiSeq v2 flow-cell kit)
5' - GTAATAGCTAAATCCCACCCGATGCCTGCAGG - 3'
Read 2 - GU79 (spike into Illumina Well #14 if using MiSeq v2 flow-cell kit)
5' - CGTGCATAGTGCCATGTTATCCCTGAAGTCGAG - 3'
Index Read - GU88 - Lib i7 (spike into Illumina Well #13 if using MiSeq v2 flow-cell kit)
5' - CTCGACTTCAGGGATAACATGGCACTATGCACG - 3'
Read 1 - GU60 - BC Amp Rev (spike into Illumina Well #12 if using MiSeq v2 flow-cell kit)
5' - GTAATAGCTAAATCCCACCCGATGCCTGCAGG - 3'
Index 1 - GU71 (spike into Illumina Well #13 if using MiSeq v2 flow-cell kit)
5' - CAGGACCATCACCGATTGGAGTGGACAACATG - 3'
In most cases, a MiSeq v2 flow-cell (up to 15 million reads) is sufficient for both mapping and barcode sequencing, but use this formula to check:
No. of unique barcodes in sample (quantified from mapping data) * no. of strains or growth conditions * 2 biological replicates for RNA * 2 biological replicates for DNA * 10x reads per barcode/sample = total reads required (at minimum).