Skip to content

Running stability selection on your data

jklynch edited this page Sep 25, 2016 · 2 revisions

Running hominid_stability_selection on your data

What does hominid_stability_selection do?

Once you have found the SNPs that have a significant correlation with microbiome abundances, you need to determine which taxa (or other covariates) most consistently and robustly contribute to the correlation.

Lasso regression is sensitive to small variations of the covariates so it is common to use a resampling method like stability selection to choose relevant covariates. (For more details, see the supplemental methods.)

hominid_stability_selection runs on a single processor.

hominid_stability_selection skips SNPs whose 95th percentile confidence interval for R2 includes zero (0).

hominid_stability_selection command-line arguments

The command-line arguments are all required and are expected in this order:

  1. Output file from hominid, with unprocessed SNPs removed.
    • SNPs that were not processed by hominid are those that have "NA" in columns 7 to 21 (starting with column "rsq_mean" and ending with "cv_kurtosis"). Delete the rows corresponding to these SNPs.
  2. OTU/taxon table: Use the same input file as was used in hominid
  3. Output file name
  4. The lowest α coefficient: During stability selection, the Lasso tuning parameter, α, is varied between 0.3 αmax and αmax. You can change the range to, say, 0.5 αmax and αmax by setting this argument to 0.5.
  5. transformation of the input abundance data. Use the same value as was used in hominid.
  6. number of SNPs to test. To run on all input SNPs, set this value to -1.

To see a sample hominid_stability_selection command, see test_stability_selection.sh

Output file format

hominid_stability_selection takes the file produced by hominid and merely adds extra columns at the right side of the table. The added columns are the input taxa/covariates, and the data values are the stability scores for each taxon/covariate.