4CAC is a four-class classifier designed to classify contigs in mixed metagenome assemblies as phages, plasmids, prokaryotes (bacteria and archaea), microeukaryote (fungi and protozoa) or uncertain. 4CAC generates an initial four-way classification using XGBoost algorithms and further improves the classification using the assembly graph.
4CAC requires python 3.7 or above to run. You will need the following python dependencies to run 4CAC and related support scripts.
- scikit-learn
- xgboost
To run 4CAC, please download the respository as follows.
git clone https://github.com/Shamir-Lab/4CAC.git
cd 4CAC
4CAC requires the following input files:
(1) Contig file in "fasta" format: a set of contigs to be classified.
(2) Assembly grah file in "gfa" format: the assembly graph generated by metaSPAdes or metaFlye when assembling reads to generate the input contigs.
(3) A path file has path information for each contig, such as scaffolds.path
in metaSPAdes assembly and assembly_info.txt
in metaFlye assembly.
For contigs assembled from short reads by metaSPAdes, files scaffolds.fasta
, assembly_graph_with_scaffolds.gfa
, and scaffolds.path
can be used as input.
For contigs assembled from long reads by metaFlye, files assembly.fasta
, assembly_graph.gfa
, assembly_info.txt
can be used as input.
4CAC generates an initial four-way classification using XGBoost algorithms.
python classify_xgb.py -f <contig file>
The command line options for this script are:
-f/--fasta
: The contig file to be classified
-o/--outfile
: The name of the output file. If not specified, <input filename>.probs_xgb_4class.out
-p/--num_processes
: The number of processes to use. Default=8
4CAC further improves the XGBoost classification using the assembly graph.
python classify_4CAC.py --assembler <metaSPAdes/metaFlye> --asmdir <output directory of assembly graph and path file>
The command line options for this script are:
--assembler
: The assembler used to generate the contig file. metaSPAdes or metaFlye
--asmdir
: The directory of the assembly graph and the path file.
--classdir
: The directory of the XGBoost four-class classification file. If not specified, default = --asmdir
--outdir
: The output directory. If not specified, default = --asmdir
A small test dataset could be found under the test_data
folder. Note that the ground_truth_class.fasta
file under test_data is not necessary for running 4CAC. It can be used to compare the XGBoost classification and 4CAC classification with the ground truth if users are interested.
In case of any questions or suggestions please feel free to contact lianrong.pu@gmail.com