-
Notifications
You must be signed in to change notification settings - Fork 6
PacBio Sequencing Data Processing
Although one might starts with the circular consensus (CCS) reads, we prefer to work with the more "raw" reads, i.e. subreads. We use the following procedure to process PacBio sequencing data.
Fastq can be generated from SMRTLink portal. The base quality is default to zero and thus making the read unusable. For now, we will default the base quality artificially to the letter "H". For instance, a .bam generated by SMRTLink Portal for Sequel data will be tweaked as follow:
export RUNTYPE=m54186_170711_202450.subreads
samtools view ${RUNTYPE}.bam \
| perl -ne 'chomp();@b=split(/\t/);$rl=length($b[9]);
printf "@%s\n%s\n+\n%s\n",$b[0], $b[9], ("H" x $rl) if ($rl>=500);' \
> ${RUNTYPE}.fastq
LAST is used to generate possible read alignemnts against a reference genome. It is thus necessary to build the reference genome database to enable the alignment.
last-755/src/lastdb -v -P 1 hg19.nomask.st.lastdb hg19.fa 2>&1 \
| tee lastdb.v755.hg19.log
NOTE: In our experience, it takes 50 min to build hg19 on a single core with 16GB RAM.
NOTE: Lastal database is build once. You only need to re-build it if the sequence has changed.
LAST is used to generate the possible read alignments against a reference genome. Picky selectRep then sieves through the read alignments (.maf file) to pick the best representative alignment.
export RUNTYPE=m54186_170711_202450.subreads
last-755/src/lastal -r1 -q1 -a0 -b2 -Q1 -P4 \
hg19.lastdb\${RUNTYPE}.fastq 2>${RUNTYPE}.lastal.log \
| ./picky.pl selectRep --thread 4 --preload 10 \
1>${RUNTYPE}.align 2>${RUNTYPE}.selectRep.log
The direct streaming of last output to picky remove the need to store the gigantic intermediate .maf file. However, if you are troubleshooting or experimenting with other lastal alignment parameters and have sufficient storage, you may keep the .maf file and use the following commands instead.
last-755/src/lastal -r1 -q1 -a0 -b2 -Q1 -P4 hg19.lastdb ${RUNTYPE}.fastq \
1>${RUNTYPE}.maf 2>${RUNTYPE}.lastal.log
cat ${RUNTYPE}.maf \
| ./picky.pl selectRep --thread 4 --preload 6 \
1>${RUNTYPE}.align 2>${RUNTYPE}.selectRep.log
NOTE: For .align format, see output documentation
Finally, the selected alignments for read are processed to identify harbored structural variants.
cat ${RUNTYPE}.align \
| ./picky.pl callSV \
--oprefix ${RUNTYPE}.sv \
--fastq ${RUNTYPE}.fastq \
--genome hg19.fa --removehomopolymerdeletion \
--exclude=chrY --exclude=chrM \
--sam 2>${RUNTYPE}.callSV.log
Each structural variants are classified into the respectively files; namely deletion (.DEL.xls), insertion (.INS.ls), indel (.INDEL.xls), inversion (.INV.xls), translocation (.TTLC.xls), tandem duplication junction (.TDSR.xls), and tandem duplication complete read (.TDC.xls). Read not harboring any structural variant is recorded in .NONE.xls. Each aligned read fragment is recorded in .xls file.
NOTE: See output documentation for details.
File | Description |
---|---|
<oprefix>.profile.DEL.xls | tab-delimited file for deletions |
<oprefix>.profile.INS.xls | tab-delimited file for insertions |
<oprefix>.profile.INDEL.xls | tab-delimited file for possible co-insertion-and-deletion |
<oprefix>.profile.INV.xls | tab-delimited file for inversions |
<oprefix>.profile.TTLC.xls | tab-delimited file for translocations |
<oprefix>.profile.TDSR.xls | tab-delimited file for tandem duplications where read span the junction |
<oprefix>.profile.TDC.xls | tab-delimited file for tandem duplications where read completely cover the duplications |
<oprefix>.profile.xls | tab-delimited file for all read alignment segments |
If you are only interested in getting the list of structural variants and NOT the sam file, you may omit the "--sam" option.
Sometime, we may not be interested in SV identified on specific chromosomes. You may use the "--exclude" option. For instance, to exclude structural variant identified on mitochrondria, use the option "--exclude=chrM".
It is possible to convert the tab-delimited SVs .xls files to VCF format.
./picky.pl xls2vcf \
--xls ${RUNTYPE}.sv.profile.DEL.xls \
> ${RUNTYPE}.DEL.vcf
By default, only SVs with at least 2 read evidence support are reported. For low coverage data, one may wish to report even SVs that have single read support only.
./picky.pl xls2vcf \
--xls ${RUNTYPE}.sv.profile.DEL.xls \
--re 1 > ${RUNTYPE}.include.Singleton.DEL.vcf
It is also possible to have all the different types of SVs in a single vcf file.
export OPREFIX=${RUNTYPE}.sv
./picky.pl xls2vcf \
--xls ${OPREFIX}.profile.DEL.xls \
--xls ${OPREFIX}.profile.INS.xls \
--xls ${OPREFIX}.profile.INDEL.xls \
--xls ${OPREFIX}.profile.INV.xls \
--xls ${OPREFIX}.profile.TTLC.xls \
--xls ${OPREFIX}.profile.TDSR.xls \
--xls ${OPREFIX}.profile.TDC.xls \
> ${RUNTYPE}.allsv.vcf