_____ _ __ ___ _ | __ \ | | \ \ / (_) | | /\ | |__) |_ _ _ __ __ _| | ___ __ \ \ /\ / / _ ______ _ _ __ __| | / \ | ___/ _` | '__/ _` | |/ _ \ / _` \ \/ \/ / | |_ / _` | '__/ _` | | | | | | (_| | | | (_| | | (_) | (_| |\ /\ / | |/ / (_| | | | (_| | --:'''':-- |_| \__,_|_| \__,_|_|\___/ \__, | \/ \/ |_/___\__,_|_| \__,_| :'_' : |___/ _:"":\___ ' ' ____.' ::: '._ . *=====<<=) \ : . ' '-'-'\_ /'._.' \====:_ "" _ ___ _____ ___ .' \\ /_\ / __| |_ _| / __| : : / _ \ | (__ | | | (_ | / : \ /_/ \_\ \___| |_| \___| : . '. ,. _ : : : : . .-. .-. .-. .-. .-. .-. . '-' _ ). :__:-:__.;--' |\ /|||\|||\ /|||\|||\ /|||\|||\ /| ( _| _ ) '-' '-' ||\|||/ \|||\|||/ \|||\|||/ \|||\|| ( - _| |_| -_ -~ `-~ `-~ `-` `-~ `-` `-~ `- ( _| |_ |_ ) '- |_ -
- Python 3.6 or later
- BIOPYTHON 1.77 or later
- BLAST command line tools 2.2.30 or later
- NumPy
- Scikit-learn
- SciPy
- Matplotlib
- MAFFT 6.9 or later
- FastTree 2.0.0 or later
- BLAT
- SPAdes
- GNU Parallel
- BWA
- samtools 1.2 or later
- pandas
Working_directory |_____Data_folder | |_____10deduplicated_reads | |_____20assemblies | |_____30raw_contigs | |_____31exonic_contigs | |_____40aln_orth_par | |_____41detected_par | |_____50pslx | |_____60mafft | |_____70concatenated_exon_alignments |_____ParalogWizard |_____ParalogWizard.py |_____probes_concatenated_exons.fasta |_____probes_separated_exons.fasta
- Trimmed, filtered and deduplicated pair-end reads are stored in 10deduplicated_reads folder together with list of samples called
samples_list.txt
. All names and IDs shoud contain only letters and numbers.
10deduplicated_reads |_____Genus1-species1_ID.R1.fastq |_____Genus1-species1_ID.R2.fastq |_____Genus2-species1_ID.R1.fastq |_____Genus2-species1_ID.R2.fastq |_____Genus2-species2_ID.R1.fastq |_____Genus2-species2_ID.R2.fastq ... |_____samples_list.txt
List of samples must have names of the samples which correspond to fastq files, one per each line with empy line in the end.
Genus1-species1_ID
Genus2-species1_ID
Genus2-species2_ID
...
- Two probe files: one containing sequences which are concatendated exons, another - with the same sequences separated to exons. Both files should have same names for the same gene. All names shoud contain only letters and numbers.
>Representative1-Gene1
>Representative1-Gene2
>Representative2-Gene3
...
>Representative1-Gene1_exon_1
>Representative1-Gene1_exon_2
>Representative1-Gene2_exon_1
>Representative2-Gene3_exon_2
>Representative2-Gene3_exon_4
...
python3 ParalogWizard.py cast_assemlble -d -pr <bait file> [-np <executing command without GNU Parallel>]
-d
— folder with data within working directory
This part of PW is a simplified code of HybPiper pipeline and includes only mapping with bwa and assembly with spades. Takes de-duplicated and filtered reads from 10deduplicated_reads
and saves assemblies to 20assemblies
with original HybPipers' sub-folder structure.
-pr
— reference file with concatenated exons, stored in working directory
-np
— option to use assembly step without GNU parallel
python3 ParalogWizard.py cast_retrieve -d <folder with data> -pe <probe file with separated exons> [-c <collect/recollect raw contigs][-l threshold for length cover of BLAST hits] [-s <threshold for k-mer cover of contigs assembled by SPAdes>] [-nc <number of cores>]
Collects contigs assembled by SPAdes to folder 30raw_contigs
. Matches retrieved contigs to the probe file with individual separated exons with BLAST, extracts exonic contigs according to hits and stores them in folder 31exonic_contigs
within the main folder with ParalogWizards results. Hit tables and statistics with assumed number of copies for each locus per sample are saved alongside.
-d
— folder with data within working directory
-pe
— reference file with split exons, stored in working directory
-c
— option for collecting/recollecting assemblies from 20assemblies
to 30raw_contigs
, after each successful assembly run can be done only once, contigs need to be recollected only when assembly step is modified
-l
— blast hit length filter (in %) for BLAST step. Default 75.
-s
— k-mer cover filter (float) for assembled contigs. Default is 5.
-nc
— option to indicated number of used cores. Default 1.
python3 ParalogWizard.py cast_analyze -d <folder with data> [-b <list of taxa excluded from paralog divergence estimation>] [-nc <number of cores>]
Builds alignments for each exon with all exonic contigs retrieved for this exon and saves them to 40aln_orth_par
, then calculates divergences between alternative contigs within each samples and combines these divergencies to individual exon distributions. Individual exon distributions are then clustered to peaks and mean values from each peak is taken to global distribution.
-b
— option to exclude samples from paralog divergence estimation and creation of a custom reference with or without paralogs. Names of samples should be provided exactly as in the list of samples in 10deduplicated_reads
. When used, at least one sample has to be provided.
-nc
— option to indicated number of used cores. Default 1.
python3 ParalogWizard.py cast_detect -d <folder with data> -pe <probe file with separated exons> [-b <list of taxa excluded from new reference creation>] [-p -mi <minimum paralog divergence> -ma <maximum paralog divergence>]
-d
— folder with data within working directory
-pe
— reference file with split exons, stored in working directory
-b
— option to exclude samples from paralog divergence estimation and creation of a custom reference with or without paralogs. Names of samples should be provided exactly as in the list of samples in 10deduplicated_reads
. When used, at least one sample has to be provided.
-p
— option to work with paralogs
-mi
— minimum divergence between copies for paralogs search. Required only with -p
.
-ma
— maximum divergence between copies for paralogs search. Required only with -p
.
python3 ParalogWizard.py cast_separate -d <folder with data> -pc <probe file with separated paralogs> -i <minimum identity for BLAT> [-r <list of taxa excluded from paralogs separation, ie included in all alignments in case of >]
-d
— folder with data within working directory
-pc
— path to customized reference generated by cast_detect
and saved in 41detected_par
or 41without_par
within working directory
-i
— minimum identity for BLAT step.
-r
— option to exclude samples supposedly diverged before the event of WGD and, therefore, originally lacking paralogs associated with WGD; if single copy, sequences of samples from redlist are included into both alignments of separated paralogs (i.e., identical sequences present in both copies' alignments) in contrast to cases with gene loss after WGD (such sequences are included into only one alignment where they fit better).
[server] |_____home/[logname] |_____Working_directory | |_____Data_folder | | |_____10deduplicated_reads | | |_____20assemblies | | |_____30raw_contigs | | |_____31exonic_contigs | | |_____40aln_orth_par | | |_____41detected_par | | |_____50pslx | | |_____60mafft | | |_____70concatenated_exon_alignments | | | |_____ParalogWizard_1a_CastSubmitAssemble.sh | |_____ParalogWizard_1a_CastSubmitAssemble_submitter.sh | |_____ParalogWizard_1b_CastSubmitRetrieve.sh | |_____ParalogWizard_2_CastSubmitAnalyze.sh | |_____ParalogWizard_3_CastSubmitSeparate.sh | |_____ParalogWizard_Settings.cfg | |_____HybSeqSource |_____ParalogWizard |_____ParalogWizard.py |_____probes_concatenated_exons.fasta |_____probes_separated_exons.fasta |_____custom_reference_with_separated_paralogs.fasta
data
— path to data directory as [Working directory]/[Data directory] (see data structure above) within home directory on a particular server
probe_exons_split
— name of the reference file with split exons; value for -pe
option of cast_retrieve
command
probe_exons_concat
— name of the reference file with concatenated exons; value for -pr
option of cast_assemble
command
exon_length
— minimum exon length accepted to analysis
server
— Metacentrum server with all used directories
collect_contigs
— "yes"
if needed to collect assemblies from 20assemblies
to 30raw_contigs
otherwise "no"
or empty, after each assembly run can be done only once, contigs need to be recollected only when assembly step is modified; value for -c
option of cast_retrieve
command
length_cut
— blast hit length filter (in %) during retrieving step ; value for -l
option of cast_retrieve
command
spades_cover_cut
— k-mer cover filter (int) during retrieving step; value for -s
option of cast_retrieve
command
blocklist
— list of samples excluded from paralog divergence estimation and creation of a custom reference with or without paralogs separated by space and enclosed in quotation marks; value for -b
option of cast_analyze
and cast_detect
commands
redlist
— list of samples supposedly diverged before the event of WGD and, therefore, originally lacking paralogs associated with WGD; if single copy, sequences of samples from redlist are included into both alignments of separated paralogs (i.e. identical sequences present in both copies' alignments) in contrast to cases with gene loss after WGD (such sequences are included into only one alignment where they fit better); value for -r
option of cast_retrieve
command
paralogs
— "yes"
for work with paralogs, "no"
will generate customized reference to produce correct orthologous alignments; value for -p
option of cast_detect
command
paralog_min_divergence
— minimum divergence between copies for paralogs search; value for -mi
option of cast_detect
command
paralog_max_divergence
— maximum divergence between copies for paralogs search;value for -ma
option of cast_detect
command
probes
— customized reference generated by cast_detect
and saved in 41detected_par
or 41without_par
; value for -pc
option of cast_separate
command
minident
— minimum identity for BLAT during separation step; value for -i
option of cast_separate
command
ParalogWizard_1a_CastSubmitAssemble_subitter.sh
— submits ParalogWizard_1a_CastSubmitAssemble.sh
for each sample as a separate job according to the sample list in 10deduplicated_reads
. Used as executable.
ParalogWizard_1a_CastSubmitAssemble.sh
— executes cast_assemble
command with options according to settings set in ParalogWizard_Settings.cfg
. Normally is not used directly. For execution over individual samples use -v sample="[sample]"
.
qsub -v sample="[sample]" ParalogWizard_1a_CastSubmitAssemble.sh
ParalogWizard_1b_CastSubmitRetrieve.sh
— executes cast_retrieve
command with options according to settings set in ParalogWizard_Settings.cfg
. Used with qsub
.
ParalogWizard_2a_CastSubmitAnalyze.sh
— executes cast_analyze
command with options according to settings set in ParalogWizard_Settings.cfg
. Used with qsub
.
This step can be skipped in case of working without paralogs (paralogs="no")
ParalogWizard_2b_CastSubmitDetect.sh
— executes cast_detect
command with options according to settings set in ParalogWizard_Settings.cfg
. Used with qsub
.
In case of work without paralogs, this step will create customized reference without paralogs separated.
ParalogWizard_2b_CastSubmitSeparate.sh
— executes cast_separate
command with options according to settings set in ParalogWizard_Settings.cfg
. Used with qsub
.
Before this step, customized reference with or without separated paralogs must be copied to HybSeqSource folder