This repository contains notes on how to generate DNA string alignment dataset from real datasets from NCBI Bioproject on Ubuntu.
First we need to download the SRA toolkit of NCBI in order to download dataset from NCBI Bioproject. Here we use version 3.0.0. If there is a newer version, check out the sra-tools repository.
wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/3.0.0/sratoolkit.3.0.0-ubuntu64.tar.gz
tar -xvf sratoolkit.3.0.0-ubuntu64.tar.gz
cd sratoolkit.3.0.0-ubuntu64/bin
echo "export PATH=\${PATH}:$(pwd)" >> ~/.bashrc
source ~/.bashrc
You may want to run vdb-config --interactive
first before testing the installation with prefetch
.
After entering a project (we use PRJNA178613 as an example), see the table Project Data
, click on the Number of Links
number and there will be a list of links to runs. Click on one of the links and you will see an accession ID starting with SRR
. Copy that ID (e.g. SRR611076) and run
prefetch SRR611076
It takes around 1.5 hours to download this dataset. Next we can see that an .sra
file is downloaded in ./SRR611076
. We can then convert the file into fastq file with
cd SRR611076/
fastq-dump --split-files SRR611076.sra
We use --split-files
because this dataset has PAIRED
layout. After waiting some time we can see that two fastq files are generated.
We use BWA as the sequence mapper. First we can download a reference genome of the species (sequence.fasta
here) to the following.
bwa index -p test sequence.fasta
bwa mem -M -t 1 test SRR611076_1.fastq SRR611076_2.fastq > SRR611076.sam
This creates a .sam file, which records the possible position of mapping. We can then use this to generate a string alignment dataset with the script generate_dataset.sh
. To use this we can do the following
chmod +x generate_dataset.sh
generate_dataset.sh [sam_file] [fasta_file] > [output_directory]