Cassava Genomics project

Collection of scripts used in the Cassava Genomics Project

Script	Version	Source	Cite
clean_genomic_fasta.py: Clean contig identifiers to avoid incompatibility issues	0.15	https://github.com/bpucker/GenomeAssembly/	10.1101/2023.06.27.546741
contig_stats.py: Calculate contig statistics	1.31	https://github.com/bpucker/script_collection/	10.1371/journal.pone.0164321
genetic_map_to_fasta.py: Create input file for ALLMAPS merge command by mapping genetic markers to assembly contigs	0.2	-	/10.1101/2024.09.30.615795
cov_plot.py: Create assembly coverage plot from coverage file.	0.2	https://github.com/bpucker/At7	10.1371/journal.pone.0164321
coverage_te_plot.py: Adjusted to create a coverage plot including density of TE repeats for M. esculenta.	0.5	-	10.1371/journal.pone.0164321
RNAseq_cov_analysis.py: Analyse coverage of predicted polypeptide sequences by RNAseq data	0.1	https://github.com/bpucker/GenomeAssembly/	10.1101/2023.06.27.546741
TE_repeat_analysis.R: Analyse repeat density of EDTA results	0.1	-	10.1038/s41597-023-02800-0

genetic_map_to_fasta.py

python genetic_map_to_fasta.py \
--map <FULL_PATH_TO_GENETIC_MAP_FILE> \
--contigs <FULL_PATH_TO_CONTIGS_FILE>
--output <BASE_PATH_TO_OUTPUT_FILE>
[--sim <MINIMUM_SIMILARITY_BEST_HIT]
[--score <MINIMUM_SCORE_BEST_HIT]

Mapping of genetic markers (--map) to assembly contigs (--contigs). Genetic markers are expected in the format of the composite genetic map from Manihot esculenta Crantz by the ICGMC (File S2, https://doi.org/10.1534/g3.114.015008).

Input:

--map: genetic markers
--fasta: FASTA file containing assembly contigs
--output: Base path to output files (without extension)

Output:

<output>.fasta: FASTA file containing marker sequences
<output>_blastn_marker_contig_mapping.txt: BLASTN output
<output>_mapped_contigs.csv: Mapped markers, compatible with ALLMAPS merge command

coverage_te_plot.py

python coverage_te_plot.py \
--coverage_file <FULL_PATH_TO_COVERAGE_FILE> \
--te_file <FULL_PATH_TO_TE_FILE> \
--out <FULL_PATH_TO_OUTPUT_FILE> \
--cov <AVERAGE_COVERAGE>
[--res <RESOLUTION, WINDOW_SIZE_FOR_COVERAGE_CALCULATION> 1000]
[--sat <SATURATION, CUTOFF_FOR_MAX_COVERAGE_VALUE> 100.0]
[--num_contigs <NUMBER_OF_CONTIGS_TO_PLOT> 18]
[--max_cov <MAXIMUM_COVERAGE 600]
[--max_chromosome <MAXIMUM_CHROMOSOME_SIZE_BP 55000000]

Creates a coverage plot showing the average coverage in blocks of resolution size (--res). The maximum displayed coverage is defined by the saturation (--sat). Each chromosome is plotted separatly and the average coverage is marked by a red line. For each chromosome, a histogram is created showing the coverage distribution.

Input:

--coverage_file: Coverage file
--te_file: Repeats TSV created with Circos genomicDensity function
--cov: Average coverage
--out: Base path to output files (without extension)

Output:

<out>.png: Coverage plot
<out>_<contig>.png: Histogram of coverage for each contig/chromosome
<out>_coverage_resolution<res>.tsv: Average coverage per block

TE_repeat_analysis.R

Rscript TE_repeat_analysis.R \
--repeat_gff3 <FULL_PATH_TO_REPEAT_GFF3_FILE> \
--repeat_gff3_intact <FULL_PATH_TO_INTACT_REPEAT_GFF3_FILE> \
--gene_gff3 <FULL_PATH_TO_GENE_GFF3_FILE> \
--chr_length_A <FULL_PATH_TO_CHR_LENGTH_A_FILE> \
--chr_length_B <FULL_PATH_TO_CHR_LENGTH_B_FILE> \
--output_dir <FULL_PATH_TO_OUTPUT_DIRECTORY>

Analyzes EDTA annotation results by classifying repeats. A bar plot is generated to show the distribution of transposable element (TE) families across chromosomes for each haplophase separately. Additionally, a circos density plot is produced, displaying the genomic density of TE repeats, intact TE repeats, and predicted coding sequences. The inner track of the plot includes a rainfall (rainbow) plot that illustrates the minimal distance between neighboring repeats for each TE family.

Input:

--repeat_gff3: Path to repeat GFF3 file from EDTA results
--repeat_gff3_intact: Path to intact repeat GFF3 file from EDTA results
--gene_gff3: Path to GFF3 file annotating predicted coding sequences
--chr_length_A: Path to TSV mapping chromosome IDs to chromosome length for haplophase A
--chr_length_B: Path to TSV mapping chromosome IDs to chromosome length for haplophase B
--output_dir: Directory for output files

Output:

<output_dir>/tables/TEs_barplot.png: Barplot showing distribution of TE families for each chromosome
<output_dir>/tables/density_repeats_A.tsv: Density of repeats for haplophase A
<output_dir>/tables/density_repeats_B.tsv: Density of repeats for haplophase B
<output_dir>/plots/circos_genomic_density_A.png: Circos plot for haplophase A
<output_dir>/plots/circos_genomic_density_B.png: Circos plot for haplophase B

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
LICENSE		LICENSE
README.md		README.md
TE_repeat_analysis.R		TE_repeat_analysis.R
coverage_te_plot.py		coverage_te_plot.py
genetic_map_to_fasta.py		genetic_map_to_fasta.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cassava Genomics project

genetic_map_to_fasta.py

coverage_te_plot.py

TE_repeat_analysis.R

About

Releases 1

Packages

Languages

License

c-thoben/CassavaGenomicsProject

Folders and files

Latest commit

History

Repository files navigation

Cassava Genomics project

genetic_map_to_fasta.py

coverage_te_plot.py

TE_repeat_analysis.R

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages