Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pruned FastTree graphic #250

Closed
lauradoepker opened this issue Sep 21, 2018 · 12 comments
Closed

Add pruned FastTree graphic #250

lauradoepker opened this issue Sep 21, 2018 · 12 comments

Comments

@lauradoepker
Copy link

It's helpful to view this FastTree phylogeny that depicts which sequences were chosen for the prune.py step for lineage analyses. This would also be helpful for the minadcl pruning too. I've pasted @dunleavy005 's SCons code from Ecgtheow that achieves this - I hope I grabbed the right part of the code to show this (?). Example .png of what I'm talking about (red leaves were kept, black leaves were pruned away):

pruned_cluster_fasttree

    @w.add_target()
    def pruned_cluster_seqids(outdir, c):
        pruned_cluster_seqids = env.Command(
            path.join(outdir, "pruned_cluster_seqids.txt"),
            c["cluster_fasttree"],
            "lib/cft/bin/prune.py --naive naive --seed " + options["seed"] + " $SOURCE -n " + str(c["nprune"]["value"]) + " --strategy seed_lineage $TARGET")
        env.Depends(pruned_cluster_seqids, "lib/cft/bin/prune.py")
        return pruned_cluster_seqids

    @w.add_target()
    def pruned_cluster_seqs(outdir, c):
        return env.Command(
            path.join(outdir, "pruned_cluster_seqs.fasta"),
            [c["pruned_cluster_seqids"], c["cluster_seqs"]],
            "seqmagick convert --include-from-file $SOURCES $TARGET")

    @w.add_target()
    def input_seqs(outdir, c):
        return env.Command(
            path.join(outdir, "input_seqs.fasta"),
            c["pruned_cluster_seqs"],
            "awk \'/^[^>]/ {gsub(\"N\", \"-\", $0)} {print}\' < $SOURCE > temp.fasta;" + \
            "seqmagick mogrify --squeeze temp.fasta;" + \
            "awk \'/^[^>]/ {gsub(\"-\", \"N\", $0)} {print}\' < temp.fasta > $TARGET;" + \
            "rm temp.fasta")

    @w.add_target()
    def pruned_cluster_fasttree_png(outdir, c):
        pruned_cluster_fasttree_png = env.Command(
            path.join(outdir, "pruned_cluster_fasttree.png"),
            [c["cluster_fasttree"], c["pruned_cluster_seqids"]],
            "xvfb-run -a python/annotate_fasttree_tree.py $SOURCES --naive naive --seed " + options["seed"] + " --output-path $TARGET")
        env.Depends(pruned_cluster_fasttree_png, "python/annotate_fasttree_tree.py")
        return pruned_cluster_fasttree_png

@lauradoepker
Copy link
Author

In Ecgtheow, I am tweaking the nprune value around manually and checking the FastTrees. In CFT, a variety of nprunes may also be helpful (not just always 100), but am I right in assuming that all the ML trees would have to be pre-computed with just a handful of nprune values?

@metasoarous
Copy link
Member

Right now things are set up in CFT so that different reconstructions can have different prune counts, but there's no "upstream control mechanism" for which counts you might want to perform for a particular clonal family.

We could do it so that when you run the build you specify the prune count, and can therefrom create a separate dataset that has been run with alternate prune counts. But this isn't really optimal for comparisons, and is sort of clunky on the build side having to switch datasets to look at different prune counts.

So the better thing to do (and part of why I organized the way I did around "reconstructions" in the first place) would be to make it so that for a given clonal family and prune strategy, you can have multiple reconstructions. So maybe when you build at the command line, you can specify --prune-counts=100,300 to compare?

On the far end of the spectrum in terms of control, it might be possible to specify somehow in the input yaml file that cft consumes which clonal families should be run at what prune counts. Moving forward in this direction dovetails with getting us to the point where it would be easier for you to select clonal families from the viz and have ASR trees or ecgtheow analyses computed from them.

@metasoarous
Copy link
Member

@lauradoepker Regarding this annotate_fasttree_tree.py script, we should be able to move that into the pipeline easily enough. Just out of curiosity though, how well does this script handle for big trees?

@lauradoepker
Copy link
Author

@metasoarous I'm not sure. We haven't ever had problems with it. Some of my biggest clonal families have a couple thousand members.

@metasoarous
Copy link
Member

Hey @eharkins; Do you want to help with this? Would basically mean plugging something along the lines of the above into the SConstruct.

@eharkins
Copy link
Contributor

Happy to help, sorry for the delay here. Will read more closely and let you know if I have questions tomorrow.

@metasoarous
Copy link
Member

@dunleavy005 @lauradoepker - @eharkins and I are taking a look at this and wondering what all went into the crazy awk step in your code snippet above. Is that something that is needed for your script to function properly or should we be able to safely ignore?

@dunleavy005
Copy link

dunleavy005 commented Sep 28, 2018

lol I don't quite remember, BUT either one of RevBayes/BEAST was trimming off columns with full N's so to make the two programs consistent, we removed sequence positions from the FASTA that had 100% N's in the column. Unfortunately, seqmagick --squeeze only trims off -'s, so the awk nonsense is just converting everything to dashes, trimming, then back to N's.

Basically not needed for you guys :)

P.S. my memory aint so bad!!! fhcrc/seqmagick#72

metasoarous pushed a commit that referenced this issue Oct 8, 2018
* added height calculation for evenly spaces leaves

* floating point division

* adding nt sequences for Olmsted#17

* github.com/matsengrp/olmsted/issues/13

* including multiplicity for Olmsted(11)

* tabs -> spaces

* nt seqs dict using tripl lookup instead of fasta parser

* comment white space

* first try on #250; using ecgtheow script to color pruned nodes

* set prune_count back to default
@eharkins
Copy link
Contributor

eharkins commented Oct 8, 2018

@lauradoepker closing this, it seems to work for cft now with a few changes. Let me know if you have any issues.

@eharkins eharkins closed this as completed Oct 8, 2018
@metasoarous
Copy link
Member

This script is choking on the bigger trees. I've tried to take steps to prevent it from becoming an issue by only having it run on the smaller trees, but for some reason this hasn't been working as I'd hoped. Hopefully I'm able to fix this, but if not, I may have to make it so that these targets wouldn't get built unless you specified a special flag (or some such).

@metasoarous metasoarous reopened this Oct 25, 2018
@lauradoepker
Copy link
Author

Yeah, it fails to render the image (or whatever) when the clonal family is 20,000 or 30,000 members big. Your fix sounds good @metasoarous. cc @dunleavy005

@metasoarous
Copy link
Member

Not sure how I never came back around to close this issue, but it's now the case that if you run scons with the --fasttree-png flag, it should build out the fasttree images when it can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants