Benchmarking scripts built primarily to test our inverse weighted hamming distance approach to Neighbor Joining.
To generate the plot, run the following commands:
from src.plot_stressor_regimes import plot_stressor_regimes
plot_stressor_regimes()
Alternatively, this code is written in plot.ipynb
.
- In order to change parameters, set them as arguments for
plot_stressor_regimes()
. - The implementation is currently written to solve each tree on runtime, but there is an option to use pre-calculated scores (RF and triplets correct) to replicate the plot previously made by deprecated code. To do this, open
src/plot_stressor_regimes.py
and under theplot_stressor_regimes
function, do the following changes:- Comment out the section named "Uncomment to Run Cascade"
- Uncomment the section named "Uncomment to Use Cached Scores"
- The implementation is currently written to use pre-simulated ground-truth trees on Richard's account. If you want to generate your own trees, open
src/benchmark.py
and override theget_gt_tree
function (and enable caching). - To add a new solver, do the following changes:
- Open
src/benchmark.py
and underget_solver_by_name()
, add the solver name and instance to the elif cascade. - When calling
plot_stressor_regimes()
, add the solver name to thesolver_names
list and its corresponding color tosolver_plot_params
- Open
Located in solver_benchmarking_single/
, the Solver Benchmarking.ipynb
notebook runs low-throughput benchmarks of a single solver on a single set of trees. Included is a BenchmarkModule
class that neatly manages the various input/output files, while allowing heavy user modification of specific elements (solvers, character matrices, custom distance functions, etc) through subclassing.
Note: This code is not written with the caching decorator, but it already has a caching mechanism built in.
Located in the folder solver_benchmarking_whole_regime/
, the scripts allow the reconstruction, scoring, and plotting of multple algorithms (solvers), stressors, and parameters simultaneously. Here are the following usages:
Given ground truth trees with specified fitness regimes, cell counts, stressors, and priors, use a custom solver to reconstruct it only from its character matrix. The output will be one file per tree that contains the topology in its newick format.
In order to set up this section, do these changes in 1-reconstruct_new.py
:
- Ensure the directories for ground-truth trees and new reconstructed trees are correct.
- Under "GT Tree Params", fill in the conditions to run the solver through, while ensuring each condition is accompanied by a corresponding set of ground-truth trees.
- Under "Recon Tree Params", fill in the priors type(s) to use and algorithm(s) to run.
- The script can be run with
python 1-reconstruct_new.py -t <t>
wheret
is the array ID to use. This array ID will be the determiner of which combination of conditions and tree number to feed into the solver.
And apply these changes to 1-reconstruct_new.sbatch.sh
:
- At line 15, make sure the conda environment name is correct.
- At line 28, determine whether or not to have caching enabled. If so, then existing files will not be overwritten.
To run the script, do the following:
sbatch 1-reconstruct_new.sbatch.sh <ARRAY_OFFSET>
Where ARRAY_OFFSET
is the number to add to ARRAY_ID
, since the server limits ARRAY_ID
s only to 999. Therefore to run, for example, 2100 trees, you would run the command three times, with ARRAY_OFFSET
s of 0, 1000, and 2000 (and caching enabled).
N/A
This section is for generating trees and doing distance analysis. It does not use the caching decorator yet. The pipeline involves four steps:
simulate_topologies.ipynb
: Simulate tree topologies given a set of tree parameters.simulate_trees
: From each topology, create a set of trees given a different set of tree parameters.2-simulate_trees.sbatch.sh
: Run this script to use SLURM to parallelize the tree generation.2-simulate_trees.solo.py
: Contains the function to generate one tree under a single set of tree parameters, given an array id-t
. This is used in the sbatch script above.2-simulate_trees.all.py
(Deprecated): A runnable script to generate all the trees without parallelization.
compute_distance
: A parallelization setup to calculate the true distance and the weighted hamming distance for every pair of leaves in every single tree in the specified dataset.3-compute_distances.sbatch.sh
: Run this script to use SLURM to parallelize the distance calculation.3-compute_distances.py
: Contains the function to compute the distance of one tree, given an array id-t
.3-compute_distances.zen.sh
(Deprecated): An alternative parallelization script using zen instead of sbatch.
distance_analysis.ipynb
: Analyze the distance data and generate plots.
The dataset parameters are collected in config.json
and shared across the scripts.