Skip to content

loulanomics/Misc_Bioinformatics

 
 

Repository files navigation

Miscellaneous bioinformatics

Navigating common challenges in microbial ecology.

Multiplexing

Protocol. How to sort barcoded illumina reads into individual FASTQ files... The easy way to taxonomically identify microbial isolates! Includes a program (demultiplexFASTQ.py) and a four-sample dataset.

exactMatching

Protocol. We commonly want to find exact matches between sequences in two FASTA files. When files are large, we don't always need or want the robust BLAST algorithm. This is a perl program that is fast, light, and easy.

Cutadapt

Protocol. How to trim primer sequences from reads generated by Illumina. Plus several common targets:

  1. microbial V3-V4 16S rRNA
  2. microbial V4 16S rRNA
  3. bacterial V4-V5 16S rRNA
  4. microbial V1-V9 16S rRNA
  5. microbial V1-ITS 16S rRNA
  6. fungal 18S rRNA

DADA2

  1. filterAndTrim_bigData.R. At the filterAndTrim step, process groups of samples one at a time instead of all samples simultaneously. Saves time and computer power and crashes and headaches.

  2. merge_ASV_tables.R. Helpful when you have many ASV tables from DADA2 and want to merge them by unique FASTA sequences.

NCBI

  1. removeLineBreakFASTA.sh. Downloading contigs from NCBI, there are line breaks at 800bp. Remove those with this.

  2. downloadMultipleSRA_series.sh. Download multiple files from Sequence Read Archive. Use when you're interested in runs that are named as a series of numbers, which is typical for BioProjects (e.g., runs in project PRJNA597057 range from SRR10755563 to SRR10755886).

  3. downloadMultipleSRA_text.sh. Download multiple files from Sequence Read Archive. Use when you're interested in runs that are not named in a series. Create a text file called "runs.txt" with all desired runs.

  4. ncbiTaxDB_scrape.sh. With a list of NCBI IDs, scrape the taxonomy database webpage associated with it, keeping only taxonomy paths (Kingdom, Phylum, etc) in the resulting file.

  5. ncbiAssemblyDB_scrape.sh Sample thing, here we are scraping the NCBI assemby database for associated BioSamples.

navigateFASTQ-A

  1. catFASTQ.sh. Concatenate FASTQ files with identical names. Its original purpose was to combine files from two sequencing runs (on full and nano Illumina flow cells) on the same samples.

  2. calculateRPKM.py. Count number of bases in FASTA and convert to reads per kilobase million (rpkm). Metric used in metatrascriptomics.

  3. subsetFASTQ.sh. Subset a large FASTQ into smaller ones. Was helpful when learning error rates on a large dataset in dada2.

  4. fastaToCSV.sh. Have a FASTA file? Want to work with it in Excel or R? Use this. The result is a spreadsheet with two columns, "Headers" and "FASTA."

  5. splitFA. Perl program that splits large FASTAs into a user-specified number of smaller FASTAs.

toolsAndPipelines

  1. rgiFASTA.sh. Mine ARGs from FASTAs in a directory with CARD's resistance gene identifier.

  2. deepARG_organize.R. Load and organize results from the deepARG online tool.

  3. metaxa2_[fastq/fasta].sh. Assess taxonomy in assembled or unassembled metagenomes with Metaxa2.

  4. integronFinder.sh. Mine integron sequences from contigs with Integron Finder.

  5. mobileOG-db.sh. Mine mobile genetic elements from the mobileOG database.

About

Navigating common challenges in microbial ecology

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Shell 49.1%
  • Python 23.9%
  • R 18.9%
  • Perl 8.1%