Efficient multivariate correlation measure / high contrast subspace miner. This is an implementation of the HiCS algorithm presented in our paper:
HiCS: High Contrast Subspaces for Density-Based Outlier Ranking.
Note: This is not the implementation used in the original publication. This implementation is currently work-in-progress. The implementation does not include the outlier detection component, i.e., it implements only the subspace search step.
HiCS can either be used as a Nim library or via a command line frontend. As a convenience for Windows users I have prepared ready-to-use binaries of the command line frontend. There is both a 32 bit and a 64 bit version. If you are running Linux, compiling from source should probably not be a big problem -- see instructions below.
Basically, HiCS takes a CSV file as input, and will return the subspace search results in an output CSV.
Usage:
HiCS [options] --csvIn <FILE>
Options:
--csvIn <FILE> File name of CSV input data (requires delimiter: ';')
--csvOut <FILE> File name of subspace mining output data
--hasHeader Specifies if input CSV has a header row
--numRuns <INT> Number of Monte Carlo iterations to perform (default: 100)
--numCandidates <INT> Number of candidates in subspace beam search (default: 500)
--maxOutputSpaces <INT> Maximum number of subspaces in the final output (default: 1000)
--alpha <DOUBLE> Size of the test statistic specified as a fraction (default 0.1)
--onlySubspace <INT,INT,...>] Using this command will not perform a subspace search.
Instead, HiCS will only compute the contrast of the specified
subspace. Note: The list of INTs must not contain whitespace and
indexing starts at zero. For example: 0,3,12,51
-h, --help Print this help message.
-v, --version Print version information.
-s, --silent Disables debug output on stdout.
There is also a mode in which HiCS does not perform a complete subspace search, but instead quanfifies the contrast in a given subspace. This mode is enabled when using the --onlySubspace
command line option.
In order to compile from source, you need Nim and its package manager Nimble. The required steps are:
- Install Nim (ideally by cloning and building the devel version of Nim + adding the
nim
binary to yourPATH
) - Install Nimble (ideally by cloning and building Nimble + adding the
nimble
binary to yourPATH
) - Clone this repository and run
nimble build
If you are running Windows you can also install the latest stable version of Nim (which should also include Nimble). I try to maintain this code in a way that it always compiles with both the latest stable and devel versions of Ni.m
Apart from its goal as a subspace search technique, HiCS can be seen as a multivariate correlation measure. One of its most important features is that its "contrast notion" is defined consistently for variable sets of any cardinality >2, i.e., in general its focus is on the case of multivariate spaces. However, it is also interesting to look at the simple, bivariate case to illustrate its properties. The following examples show typical bivarite relationships, including a comparison to other correlation analysis approaches: Pearson and Spearman correlation coefficients, Mutual Information (MI), and the Maximal Information Coefficient (MIC).
First some perfect functional dependencies with a 1:1 mapping between X and Y:
An interesting property of HiCS is that it is sensitive to the multiplicity in the mapping from X to Y (and vice versa). For instance here we have cases where 1 value can map to 2 values in the other domain. This is reflected in a lower correlation result. It is also possible to differentiate between mappings of multiplicities 1:2, 1:3, etc.
This is also visible in oscillating functional dependencies...
... step functions ...
... or other distribution like these:
Some more noiseless manifolds:
And now some noisy distributions:
And finally some fully uncorrelated distribution. Note that MIC has a tendency to overestimate in such cases and may fail to detect independence: