-
Notifications
You must be signed in to change notification settings - Fork 6
Elastic principal graph algorithm
Here the most concise description of the ElPiGraph algorithm is given. Read this text for a more complete still minimal description necessary to understand the algorithm in details.
X is the data matrix. Rows correspond to data points, columns to variables.
N is number of points.
M is dimension of data vectors.
R is robustness radius (finite robustness radius corresponds to the local robust version of the principal graphs, infinite value corresponds to the standard global version)
EP is the penalty coeffiecient for the total sum of squared graph edge lengths
ER is the penalty coefficient for the deviation from the harmonic graph embedding
Accuracy is required accuracy of approximation to stop
maximumNumberOfNodes is the maximum number of nodes in the graph after which the algorithm should stop.
Settings for the graph grammar set
A set of graph grammars must be defined, each of which contains a set of graph grammar operations. Each operation is a transformation of the graph, which can be parametrized by the current graph embedding G and X.
For constructing elastic principal trees, one grammar "Grow" or two grammars "Grow" and "Shrink" can be used.
Grow grammar consists of two operations:
- “Add a node and an edge to an existing node”
- “Add a node by bisecting an existing edge”.
Shrink grammar consists of two operations:
- “Remove a hanging node”
- “Merge two nodes in one and delete the connecting edge”.
GrammarSet is a list of grammars
For the principal tree it contains
nGrow instances of "Grow" grammar (default - 1-tree, 2-treeWithPruning )
nShrink instances of "Shrink" grammar at each step (default - 0-tree, 1-treeWithPruning).
Nodes is a matrix with M columns and nNodes rows. nNodes is the number of nodes which changes (grows) when algorithm proceeds.
Links is an array with nNodes elements each of which is an array with indices of nodes which are linked with current node. It means that if exists such a j that Links(i)(j)=k then nodes with numbers i and k are linked.
Taxons is an array with nNodes elements each of which is an array with indices of data points which are associated to the graph node (e.g., by proximity).
- Initializing the current graph embedding G
Choice between several strategies is possible, among them are
1.1. Some pre-defined graph configuration
1.2. Initialize with 2 nodes and one edge
1.2.1. Two close data points within the most dense region of the data space
1.2.2. Two nodes positioned on a fragment of the first principal component
1.2.3. Two nodes connecting the mean point of the dataset and the closest to the mean data point
After choosing the intial configuration the graph embedding G is optimized.
2.Main loop of the growing principal graph.
2.1. Step <- 1
2.2. For each grammar Grammar from the GrammarSet
2.2.1. Apply all possible transformations of G using all possible operations from Grammar to the current graph embedding G. This gives a set of k tentative graph embeddings {GTentative}.
2.2.2. Optimize all k tentative graph embeddings
2.2.3. Select the one GTentativeOptimal from {GTentative} having the minimum energy value Energy(GTentativeOptimal)=MSE+elastic energy penalty -> min among all {GTentative}.
2.2.4. Assign G<-GTentativeOptimal
2.3. Step <- Step+1
2.4. Calculate fraction of unexplained variance: FVU = fvu(Nodes,Links,X).
2.5. If FVU<Accuracy or Number of Nodes>=maximumNumberOfNodes then stop algorithm.