Demonstration of the UPGMA hierarchal clustering algorithm in Pandas, Seaborn, and Scipy.
The Unweighted Pair Group Method with Arithmetic Mean (UPGMA) algorithm is a bottom up agglomerative/hierarchical clustering algorithm commonly performed on genetic distance matrices. Running the UPGMA algorithm generally allows for construction of a dendrogram. The code in this repository utilizes Pandas and Seaborn for data visualization and vectorization capabilities.
In the context of this repository, UPGMA performs deterministically. Therefore, results will always be the same for every run. In addition, as long as the data integrity is preserved, the data may be organized in any order and the results will still remain the same.
{('Man', 'Monkey'): 0.5,
('Turtle', 'Chicken'): 4.0,
(('Man', 'Monkey'), 'Dog'): 6.25,
(('Turtle', 'Chicken'), (('Man', 'Monkey'), 'Dog')): 7.875,
((('Turtle', 'Chicken'), (('Man', 'Monkey'), 'Dog')), 'Tuna'): 14.1875,
(((('Turtle', 'Chicken'), (('Man', 'Monkey'), 'Dog')), 'Tuna'), 'Moth'): 18.21875}
- python3-numpy
- python3-pandas
- python3-scipy
- python3-seaborn
Execute the upgma.py file in an IPython environment.
Tables may be viewed by running commands such as:
upgma.upgma_records[('Man', 'Monkey')]
upgma.upgma_records[(((('Turtle', 'Chicken'), (('Man', 'Monkey'), 'Dog')), 'Tuna'),'Moth')]
The phylogenetic distances may be viewed by running:
upgma.phylogeny
- The Pandas styler contains a bug that affects one of the intermediate steps of this program. When the index is [((('Turtle', 'Chicken'), (('Man', 'Monkey'), 'Dog')), 'Tuna')], the original dataframe cannot be properly stylized.
See the created issue: pandas-dev/pandas#24687
ValueError: Buffer has wrong number of dimensions (expected 1, got 3)
FIX: The tuples have been stringified to prevent this strange, unpredictable behavior. However, this could represent greater problems in the pandas cython code base.