-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathREADME.Rmd
145 lines (118 loc) · 6.36 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-"
)
```
# hierarchicalSets
<!-- badges: start -->
[![CRAN\_Release\_Badge](http://www.r-pkg.org/badges/version-ago/hierarchicalSets)](https://CRAN.R-project.org/package=hierarchicalSets) [![CRAN\_Download\_Badge](http://cranlogs.r-pkg.org/badges/hierarchicalSets)](https://CRAN.R-project.org/package=hierarchicalSets)
[![R-CMD-check](https://github.com/thomasp85/hierarchicalSets/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/thomasp85/hierarchicalSets/actions/workflows/R-CMD-check.yaml)
<!-- badges: end -->
## What is this?
This is a package that facilitates hierarchical set analysis on large
collections of sets.
### OK, so what is Hierarchical Set Analysis?
*Hierarchical Set Analysis* is a way to investigate large numbers of sets. It
consists of two things: A novel hierarchical clustering algorithm for sets and
a range of visualizations that builds on top of the resulting clustering.
**The clustering**, in contrast to more traditional approaches to hierarchical
clustering, does not rely on a derived distance measure between the sets, but
instead works directly with the set data itself using set algebra. This has two
results: First, it is much easier to reason about the result in a set context,
and second, you are not forced to provide a distance if none exists (sets are
completely independent). The clustering is based on a generalization of Jaccard
similarity, called *Set Family Homogeneity*, that, in its simplest form, is
defined as the size of the intersection of the sets in the family divided by the
size of the union of the sets in the family. The clustering progresses by
iteratively merging the set families that shows the highest set family
homogeneity and is terminated once all remaining set family pairs have a
homogeneity of 0. Note that this means that the clustering does not necessarily
end with the all sets in one overall cluster, but possibly split into several
hierarchies - this is intentional.
**The visualizations** uses the derived clustering as a scaffold to show e.g.
intersection and union sizes of set combinations. By narrowing the number of
set families to visualize to the branch points of the hierarchy the number of
data points to show is linearly related to the number of sets under
investigation, instead of the exponential if we chose to show everything. This
means that hierachical set visualizations are much more scalable than e.g. UpSet
and Radial Sets, at the cost of only showing combinations of the progressively
most similar sets. Apart from intersection and union sizes there is a secondary
analysis build into hierarchical sets that detects the elements not conforming
to the imposed hierarchy. These *outlying elements* defines a subset of the
universe that deviates from the rest and can be quite interesting -
visualizations to investigate these are of course also provided.
Hierachical set analysis is obviuously sensible for collections of sets where a
hierarchical structure makes sense, but even for set collection that does not
obviously support a hierarchy, it can be interesting to look at how the
different sets do, or do not, relate to each other.
## How do I get it?
The stable version is available on CRAN with
`install.packages('hierarchicalSets')`. Alternatively the development version
can be obtained from GitHub using devtools:
```{r, eval=FALSE}
# install.packages('devtools')
devtools::install_github('thomasp85/hierarchicalSets')
```
## How do I use it?
hierarchicalSets comes with a toy dataset containing the followers of 100
prolific anonymous twitter users. To create the hierarchical clustering you use
the `create_hierarchy()` function.
```{r}
library(hierarchicalSets)
data('twitter')
twitSet <- create_hierarchy(twitter)
twitSet
```
To simply have a look at the hierarchy you plot it:
```{r}
plot(twitSet)
```
Here the x axis is encoded with the *Set Family Heterogeneity* which is the
inverse of the homogeneity. It can thus be interpreted as the ratio of union to
intersection size.
Usually we are interested in the direct numbers which can be provided with
another plottype - the intersection stact
```{r}
plot(twitSet, type = 'intersectStack', showHierarchy = TRUE)
```
We see that especially four sets are very similar, incidentally four of the
largest sets. The rightmost cluster is interesting as well as we see it is held
together by a very small overlap and that each individual set contains a lot of
unique followers.
While it is expected that followers share some of the same patterns in terms of
who they follow, it cannot be expected that they fully adhere to a single
hierarchy. We can look at how sets are connected across the hierarchy by
counting the number of elements they have in common that is not part of their
closest shared set family. This can be shown using hierarchical edge bundles:
```{r}
plot(twitSet, type = 'outlyingElements', quantiles = 0.8, alpha = 0.2)
```
It seems our four sets again draws attention to themselves by having strong
connections to a range of other sets distant in terms of the clustering. Also
it is obvious that there is two seperate groups of sets in terms of their
deviation profile. There might be a secondary structure hidden within the
outlying elements. Lets create a hierarchical set clustering based only on the
outlying elements:
```{r}
twitSet2 <- outlier_hierarchy(twitSet)
twitSet2
```
In this way it is possible to gradually shave off hierarchical structures,
revealing the uncaptured relations of the prior analysis... Happy investigation!
## Future work
Hierarchical Sets is static on purpose. I firmly belive that effective static
plots are the foundation for any good visualization. You can always augment a
static visualization with interactivity, but not all interactive visualizations
can be used in a static way. That being said, it could be fun to experiment with
said augmentation within a shiny app. Also, the implementation begins to
struggle (just slow down actually) when being used with thousands of sets and
millions of elements - some C++ wizardry might be warranted for such huge
datasets.
Another nice idea I have in mind is to be able to keep set and element metadata
within the HierarchicalSet object and seemlessly use it for plotting.