Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
justincbagley authored Mar 14, 2017
1 parent 1a9c601 commit 62b39e9
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ The DOI for MissingDataFX v0.1.0, via [Zenodo](https://zenodo.org), is coming so
Missing data is an important consideration in the theory and practice of phylogenetic systematics, as well as the design of phylogenetics studies (Wiens 2003; Wiens et al. 2005; Wiens 2006; Pyron 2011). This software automates exploratory analyses calculating the amount of missing data in phylogenetic datasets (NEXUS character partitions), as well as testing the form and potential significance of relationships between missing characters and phylogenetic tree parameters. Parts of MissingDataFX were inspired by, and recreate, correlational/exploratory analyses of the effects of missing data on phylogenies used previously by Wiens et al. (2005) and Pyron (2011). I wrote MissingDataFX code to help me automate these and related analyses for a recent project on molecular phylogenetics and Bayesian total-evidence dating, as well as the effects of missing data on phylogenetic results, in 'sucker' fishes of the family Catostomidae (Bagley et al., in revision).

The current version of this software focuses on developing **MissingDataFX.sh**, a shell script that looks at potential effects of missing data on phylogenetic analyses in two basic ways: 1) by characterizing the contents of data blocks in a NEXUS input file, including proportions of data versus missing data for each partition, and 2) by performing a customized R analysis. Three main operations are performed in the R environment: a) extracting tree parameters (terminal branch lengths, terminal/subtending branch heights, posterior support) from BEAST or MrBayes trees generated from the input NEXUS; b) using appropriate standard or nonparametric tests for correlations between missing data proportions and the tree parameters; and c) plotting relationships among variables. More details are given below.
The current version of this software focuses on developing **MissingDataFX.sh**, a shell script that looks at potential effects of missing data on phylogenetic analyses in two basic ways: **1)** by characterizing the contents of data blocks in a NEXUS input file, including proportions of data versus missing data for each partition, and **2)** by performing a customized R analysis. **Three main operations are performed in the R environment:** **a)** extracting tree parameters (terminal branch lengths, terminal/subtending branch heights, posterior support) from BEAST or MrBayes trees generated from the input NEXUS; **b)** using appropriate standard or nonparametric tests for correlations between missing data proportions and the tree parameters; and **c)** plotting relationships among variables. More details are given below.

As in the case of the author's software package for phylogenetic and phylogeographic inference, [PIrANHA](https://github.com/justincbagley/PIrANHA), the MissingDataFX package is fully command line-based and is available as open-source software according to the license.

Expand Down Expand Up @@ -97,7 +97,7 @@ For simplicity, all filenames supplied to MissingDataFX for a given analysis sho
We use this naming convention so that the tree filenames can be linked to the original NEXUS input file(s) without conflicting with the NEXUS filenames used in other procedures used in the shell script.

### What happens in R?
Following several steps summarizing the NEXUS input file using operations in the shell, **MissingDataFX.sh** creates and runs a customized R script that loads the 'Interfaces to Phylogenetic Software in R' or 'ips' R package (plus my fixed version of one of its functions) and related phylogenetics packages, and then reads in the tree, plots the tree (and saves to file), extracts the node and branch length annotations into a matrix (also saved to file). Next, [Shapiro-Wilk tests](https://en.wikipedia.org/wiki/Shapiro–Wilk_test) are conducted to evaluate whether the log-transformed data and tree parameters meet normality criteria for subsequent analyses. Then MissingDataFX tests for correlations between the proportions of data or missing data and (y-axis/independent var.:) 1) posterior support for terminal taxa and 2) length of terminal branch (same as height estimate for terminal taxon tmrca/node, in case of BEAST chronograms). If the data are normal (Shapiro-Wilk tests were non-significant at the alpha = 0.05 level), regular [Pearson correlations](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) and [generalized linear modeling](https://en.wikipedia.org/wiki/Generalized_linear_model) analyses are conducted; however, if the data are non-normal (Shapiro-Wilk tests were significant), then correlations are conducted using nonparametric [Spearman's rank correlation coefficient](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient) and linear relationships among log-transformed variables are plotted, but not given trendlines. Results are output to file in text or PDF (graphics) files.
Following several steps summarizing the NEXUS input file using operations in the shell, **MissingDataFX.sh** creates and runs a customized R script that loads the 'Interfaces to Phylogenetic Software in R' or 'ips' R package (plus my fixed version of one of its functions) and related phylogenetics packages, and then reads in the tree, plots the tree (and saves to file), extracts the node and branch length annotations into a matrix (also saved to file). Next, [Shapiro-Wilk tests](https://en.wikipedia.org/wiki/Shapiro–Wilk_test) are conducted to evaluate whether the log-transformed data and tree parameters meet normality criteria for subsequent analyses. Then MissingDataFX tests for correlations between the proportions of data or missing data and (y-axis/independent var.:) posterior support for terminal taxa and length of terminal branch (same as height estimate for terminal taxon tmrca/node, in case of BEAST chronograms). If the data are *normal* (Shapiro-Wilk tests were non-significant at the alpha = 0.05 level), regular [Pearson correlations](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) and [generalized linear modeling](https://en.wikipedia.org/wiki/Generalized_linear_model) analyses are conducted. However, if the data are *non-normal* (Shapiro-Wilk tests were significant), then correlations are conducted using nonparametric [Spearman's rank correlation coefficient](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient) and linear relationships among log-transformed variables are plotted, but not given trendlines. Results are output to file in text or PDF (graphics) files.

### Usage
It's so easy to use, the MissingDataFX.sh script doesn't display any sophisticated Usage or help flag info yet, because it doesn't need to. Everything you need to know is given here in the README! Assuming that you have installed the dependencies and the repo and followed guidelines for input files above, you may run MissingDataFX by simply entering the script name at the command line and hitting return!
Expand Down

0 comments on commit 62b39e9

Please sign in to comment.