Skip to content

Human Genome Diversity Project, visualizing SNP profiles of human genomes

Andrei Zinovyev edited this page Nov 6, 2017 · 14 revisions

For this example we use already reduced by Principal Component Analysis data from HGDP containig information on 1043 individuals from 53 historically native population from 7 large geographical regions (Africa, Near East, Europe, South Central Asia, East Asia, America, Oceania).

The objective of this example is to show application of a hybrid learning of an elastic principal tree in several epochs, such that the result of learning in one epoch serves as initialization of the graph for the next epoch. Here we will use combination of the standard algorithm of elastic principal trees which will capture the global structure of the data. Aftwerwards we will apply robust elastic principal tree algorithm in order to better learn the local details of data distribution.

hgdptable = importdata('test_data/HGDP_SNP/hgdp_PC3_annotated.txt');
hgdp = hgdptable.data; 
labelRegions = (hgdptable.textdata(2:end,3))';
labelNations = (hgdptable.textdata(2:end,2))';

Now we will prepare such a coloring of data points that different geographical regions will be distinguished by very different colors while nations within the region will be distinguished by aspects of the same color.

lcm = containers.Map;
for i=1:size(labelNations,2)
     region = char(labelRegions(i)); label = char(labelNations(i));
 if strcmp(region,'AFRICA') lcm(label) = [0.5 0 0]; end
 if strcmp(region,'SOUTH_CENTRAL_ASIA') lcm(label) = [1 0.5 0]; end
 if strcmp(region,'SOUTH_AMERICA') lcm(label) = [1 0 0]; end
 if strcmp(region,'OCEANIA') lcm(label) = [0 1 1]; end
 if strcmp(region,'EAST_ASIA') lcm(label) = [1 1 0]; end
 if strcmp(region,'EUROPE') lcm(label) = [0 1 0]; end
 if strcmp(region,'NEAR_EAST') lcm(label) = [0.75 0.75 0.75]; end
end
lcm('Sardinian') = [0 1 0]; lcm('Tuscan') = [0.1 1 0.1]; lcm('Italian') = [0.3 1 0.1]; 
lcm('Basque') = [0 0.9 0]; lcm('French') = [0 0.8 0]; lcm('Orcadian') = [0 0.6 0];
lcm('Russian') = [0.1 1 0]; lcm('Adygei') = [0 0.8 0.3];

These data are usually shown in projection onto the first two principal components:

for i=1:size(hgdp,1) 
  if ~strcmp(labelNations(i),'_') 
    plot(hgdp(i,1),-hgdp(i,2),'ks','MarkerFaceColor',lcm(char(labelNations(i)))); hold on; 
  end; 
end;

Now, we construct the standard elastic principal tree and visualize it using Metro Map Layout. We explicitly specify the values of elasticity coefficients Lambda (for stretching) and Mu (for deviation from harmonicity).

[NodePositions,Edges] = computeElasticPrincipalGraph(hgdp,30,'Lambda',0.01,'Mu',0.1);
close all;
NodesMM = computeMetroMapLayout(NodePositions,Edges);
drawGraph2D(NodesMM,Edges,'ShowClusterNumbers',0);
drawPieChartsMetroMap(NodePositions,Edges,hgdp,labelNations,NodesMM,'LabelColorMap',lcm);

The result of this elastic tree construction is not satisfactory: for higher elasticity coefficients the tree does not trace the local details of the data distribution, learning only the global structure corresponding roughly to geographical regions; for smaller elasticity coefficients, the tree starts to be unconstrained and forms extensive branching. Below are several runs of the elastic principal tree construction with various elasticity parameters:

Now let us try to use the robust version of the elastic principal graph.

[NodePositions,Edges] = computeElasticPrincipalGraph(hgdp,30,'TrimmingRadius',0.25);
figure;
NodesMM = computeMetroMapLayout(NodePositions,Edges);
drawGraph2D(NodesMM,Edges,'ShowClusterNumbers',0);
drawPieChartsMetroMap(NodePositions,Edges,hgdp,labelNations,NodesMM,'LabelColorMap',lcm);

The result of this ElPiGraph application is also not satisfactory. Though the principal tree now is able to follow the local structure of the data distribution, it describes it only in one particular region, where the density of the data is the highest. The robust elastic principal tree is not able to pass the gaps in the data distribution larger than the trimming radius, because the distant points do not influence the change of the node positions and not visible to the tree.

Now let us combine three strategies (curve, global and local robust) in one algorithm, constructing first a principal curve following the major genotype changes, secondly the standard principal tree with relatively high elasticity coefficients, and finally use this configuration as the initial position to learn the local details of the data distribution by robust version of the principal graph.

 [NodePositions,Edges] = computeElasticPrincipalCurve(hgdp,20,'Plots',2); 
 [NodePositions,Edges] = computeElasticPrincipalGraph(hgdp,40,'Plots',2,'Lambda',0.01,'Mu',0.1,'InitGraph',struct('InitNodes',NodePositions,'InitEdges',Edges));
 [NodePositions,Edges] =  computeElasticPrincipalGraph(hgdp,60,'InitGraph',struct('InitNodes',NodePositions,'InitEdges',Edges),'Lambda',0.0002,'Mu',0.0001,'TrimmingRadius',0.25,'Plots',2);
 NodesMM = computeMetroMapLayout(NodePositions,Edges);
 figure;
 drawGraph2D(NodesMM,Edges,'ShowClusterNumbers',0);
 drawPieChartsMetroMap(NodePositions,Edges,hgdp,labelNations,NodesMM,'LabelColorMap',lcm);

This time the principal tree fits both the global distribution of the data (geographical regions) and also approximates the local details of the data distribution.

In order to better see the local details of the data distribution, one can scale the pie-charts on the graph:

 drawGraph2D(NodesMM,Edges,'ShowClusterNumbers',0);
 drawPieChartsMetroMap(NodePositions,Edges,hgdp,labelNations,NodesMM,'LabelColorMap',lcm,'ScaleCharts',0.3);

Finally, let us project the principal tree in the PCA plot together with data points.

[LabelColorMap,partition] = drawPieChartsMetroMap(NodePositions,Edges,hgdp,labelNations,NodesMM); close all; drawPieChartsProc(NodePositions,partition,labelNations,NodesMM,'LabelColorMap',lcm,'ScaleCharts',0.8); drawGraph2D(NodePositions,Edges,'NodeSizes',zeros(size(NodePositions,1),1),'ShowClusterNumbers',0); for i=1:size(hgdp,1) if ~strcmp(labelNations(i),'_') plot(hgdp(i,1),hgdp(i,2),'ks','MarkerFaceColor',lcm(char(labelNations(i))),'MarkerSize',6); hold on; end; end;