Pseudogenome-based Read Compressor (PgRC) is an in-memory algorithm for compressing the DNA stream of FASTQ datasets, based on the idea of building an approximation of the shortest common superstring over high-quality reads.
The implementation supports constant-length reads limited to 255 bases.
The following steps create an PgRC executable.
On Linux PgRC build requires installed cmake version >= 3.5 (check using cmake --version
):
git clone https://github.com/kowallus/PgRC.git
cd PgRC
mkdir build
cd build
cmake ..
make PgRC
PgRC [-i <seqSrcFile> [<pairSrcFile>]] [-t <noOfThreads>] [-o] [-d] <archiveName>
-o preserve original read order information
-t number of threads used
-d decompression mode
compression of DNA stream in order non-preserving regime (SE mode):
./PgRC -i in.fastq comp.pgrc
compression of DNA stream in order preserving regime (SE_ORD mode):
./PgRC -o -i in.fastq comp.pgrc
compression of paired-end DNA stream in order non-preserving regime (PE mode):
./PgRC -i in1.fastq in2.fastq comp.pgrc
compression of paired-end DNA stream in order preserving regime (PE mode):
./PgRC -o -i in1.fastq in2.fastq comp.pgrc
decompression of DNA stream to the current folder:
./PgRC -d comp.pgrc