title | tags | authors | affiliations | date | bibliography | aas-doi | ||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
`plotastic`: Bridging Plotting and Statistics in Python |
|
|
|
11.11.2023 |
paper.bib |
10.3847/xxxxx <- update this with the DOI from AAS once you know it. |
plotastic
addresses the challenges of transitioning from exploratory
data analysis to hypothesis testing in Python's data science ecosystem.
Bridging the gap between seaborn
and pingouin
, this library offers a
unified environment for plotting and statistical analysis. It simplifies
the workflow with user-friendly syntax and seamless integration with
familiar seaborn
parameters (y, x, hue, row, col). Inspired by
seaborn
's consistency, plotastic
utilizes a DataAnalysis
object to
intelligently pass parameters to pingouin
statistical functions.
Hence, statistics and plotting are performed on the same set of
parameters, so that the strength of seaborn
in visualizing
multidimensional data is extended onto statistical analysis. In essence,
plotastic
translates seaborn
parameters into statistical terms,
configures statistical protocols based on intuitive plotting syntax and
returns a matplotlib
figure with known customization options and more.
This approach streamlines data analysis, allowing researchers to focus
on correct statistical testing and less about specific syntax and
implementations.
Python's data science ecosystem provides powerful tools for both
visualization and statistical testing. However, the transition from
exploratory data analysis to hypothesis testing can be cumbersome,
requiring users to switch between libraries and adapt to different
syntaxes. seaborn
has become a popular choice for plotting in Python,
offering an intuitive interface. Its statistical functionality focuses
on descriptive plots and bootstrapped confidence intervals
[@waskomSeabornStatisticalData2021]. The library pingouin
offers an
extensive set of statistical tests, but it lacks integration with common
plotting capabilities [@vallatPingouinStatisticsPython2018].
statannotations
integrates statistical testing with plot annotations,
but uses a complex interface and is limited to pairwise comparisons
[@charlierTrevismdStatannotationsV02022].
plotastic
addresses this gap by offering a unified environment for
plotting and statistical analysis. With an emphasis on user-friendly
syntax and integration of familiar seaborn
parameters, it simplifies
the process for users already comfortable with seaborn
. The library
ensures a smooth workflow, from data import to hypothesis testing and
visualization.
The following code demonstrates how plotastic
analyzes the
example dataset "fmri", similar to @waskomSeabornStatisticalData2021
(\autoref{fig:examplefmri}).
### IMPORT PLOTASTIC
import plotastic as plst
# IMPORT EXAMPLE DATA
DF, _dims = plst.load_dataset("fmri", verbose = False)
# EXPLICITLY DEFINE DIMENSIONS TO FACET BY
dims = dict(
y = "signal", # y-axis, dependent variable
x = "timepoint", # x-axis, independent variable (within-subject factor)
hue = "event", # color, independent variable (within-subject factor)
col = "region" # axes, grouping variable
)
# INITIALIZE DATAANALYSIS OBJECT
DA = plst.DataAnalysis(
data=DF, # Dataframe, long format
dims=dims, # Dictionary with y, x, hue, col, row
subject="subject", # Datapoints are paired by subject (optional)
verbose=False, # Print out info about the Data (optional)
)
# STATISTICAL TESTS
DA.check_normality() # Check Normality
DA.check_sphericity() # Check Sphericity
DA.omnibus_rm_anova() # Perform RM-ANOVA
DA.test_pairwise() # Perform Posthoc Analysis
# PLOTTING
(DA
.plot_box_strip() # Pre-built plotting function initializes plot
.annotate_pairwise( # Annotate results from DA.test_pairwise()
include="__HUE" # Use only significant pairs across each hue
)
)
:Results from DA.check_sphericity()
. plotastic
assesses sphericity
after grouping the data by all grouping dimensions (hue, row, col). For
example, DA.check_sphericity()
grouped the 'fmri' dataset by "region"
(col) and "event" (hue), performing four subsequent sphericity tests for
four datasets. []{label="tab:sphericity"} \label{tab:sphericity}
'region', 'event' | spher | W | chi2 | dof | pval | group count | n per group |
---|---|---|---|---|---|---|---|
'frontal', 'cue' | True | 3.26e+20 | -462.7 | 44 | 1 | 10 | [14] |
'frontal', 'stim' | True | 2.45e+17 | -392.2 | 44 | 1 | 10 | [14] |
'parietal', 'cue' | True | 1.20e+20 | -452.9 | 44 | 1 | 10 | [14] |
'parietal', 'stim' | True | 2.44e+13 | -301.9 | 44 | 1 | 10 | [14] |
:Results of DA.omnibus_rm_anova()
. plotastic
performs one two-factor
RM-ANOVA per axis (grouping the data by row and col dimensions) using x
and hue as the within-factors. For this example, DA.omnibus_rm_anova()
grouped the 'fmri' dataset by "region" (col), performing two subsequent
two-factor RM-ANOVAs. Within-factors are "timepoint" (x) and "event"
(hue). For conciceness, GG-Correction and effect sizes are not shown.
[]{label="tab:RMANOVA"} \label{tab:RMANOVA}
'region' | Source | SS | ddof1 | ddof2 | MS | F | p-unc | stars |
---|---|---|---|---|---|---|---|---|
'parietal' | timepoint | 1.583 | 9 | 117 | 0.175 | 26.20 | 3.40e-24 | **** |
'parietal' | event | 0.770 | 1 | 13 | 0.770 | 85.31 | 4.48e-07 | **** |
'parietal' | timepoint * event | 0.623 | 9 | 117 | 0.069 | 29.54 | 3.26e-26 | **** |
'frontal' | timepoint | 0.686 | 9 | 117 | 0.076 | 15.98 | 8.28e-17 | **** |
'frontal' | event | 0.240 | 1 | 13 | 0.240 | 23.44 | 3.21e-4 | *** |
'frontal' | timepoint * event | 0.242 | 9 | 117 | 0.026 | 13.031 | 3.23e-14 | **** |
The functionality of plotastic
revolves around a seamless integration
of statistical analysis and plotting, leveraging the capabilities of
pingouin
, seaborn
, matplotlib
and statannotations
[@vallatPingouinStatisticsPython2018; @waskomSeabornStatisticalData2021;
@hunterMatplotlib2DGraphics2007;
@charlierTrevismdStatannotationsV02022]. It utilizes long-format
pandas
DataFrames
as its primary input, aligning with the
conventions of seaborn
and ensuring compatibility with existing data
structures [@wickhamTidyData2014a; @reback2020pandas; @mckinneyDataStructuresStatistical2010].
plotastic
was inspired by seaborn
using the same set of intuitive
and consistent parameters (y, x, hue, row, col) found in each of its
plotting functions [@waskomSeabornStatisticalData2021]. These parameters
intuitively delineate the data dimensions plotted, yielding 'facetted'
subplots, each presenting y against x. This allows for rapid and
insightful exploration of multidimensional relationships. plotastic
extends this principle to statistical analysis by storing these
seaborn
parameters (referred to as dimensions) in a DataAnalysis
object and intelligently passing them to statistical functions of the
pingouin
library. This approach is based on the impression that most
decisions during statistical analysis can be derived from how the user
decides to arrange the data in a plot. This approach also prevents code
repetition and streamlines statistical analysis. For example, the
subject keyword is specified only once during DataAnalysis
initialisation, and plotastic
selects the appropriate paired or
unpaired version of the test. Using pingouin
alone requires the user
to manually pick the correct test and to repeatedly specify the subject
keyword in each testing function.
In essence, plotastic
translates plotting parameters into their
statistical counterparts. This translation minimizes user input and also
ensures a coherent and logical connection between plotting and
statistical analysis. The goal is to allow the user to focus on choosing
the correct statistical test (e.g. parametric vs. non-parametric) and
worry less about specific implementations.
At its core, plotastic
employs iterators to systematically group data
based on various dimensions, aligning the analysis with the distinct
requirements of tests and plots. Normality testing is performed on each
individual sample, which is achieved by splitting the data by all
grouping dimensions and also the x-axis (hue, row, col, x). Sphericity
and homoscedasticity testing is performed on a complete sampleset listed
on the x-axis, which is achieved by splitting the data by all grouping
dimensions (hue, row, col) (\autoref{tab:sphericity}). For omnibus and
posthoc analyses, data is grouped by the row and col dimensions in
parallel to the matplotlib
axes, before performing one two-factor
analysis per axis using x and hue as the within/between-factors.
(\autoref{tab:RMANOVA}).
DataAnalysis
visualizes data through predefined plotting functions
designed for drawing multi-layered plots. A notable emphasis within
plotastic
is placed on showcasing individual datapoints alongside
aggregated means or medians. In detail, each plotting function
initializes a matplotlib
figure and axes using plt.subplots()
while
returning a DataAnalysis
object for method chaining. Axes are
populated by seaborn
plotting functions (e.g., sns.boxplot()
),
leveraging automated aggregation and error bar displays. Keyword
arguments are passed to these seaborn
functions, ensuring the same
degree of customization. Users can further customize plots
by chaining DataAnalysis
methods or by applying common matplotlib
code
to override plotastic
settings. Figures are exported using
plt.savefig()
.
plotastic
also focuses on annotating statistical information within
plots, seamlessly incorporating p-values from pairwise comparisons using
statannotations
[@charlierTrevismdStatannotationsV02022]. This
integration simplifies the interface and enables options for pair
selection in multidimensional plots, enhancing both user experience and
interpretability.
For statistics, plotastic
integrates with the pingouin
library to
support classical assumption and hypothesis testing, covering
parametric/non-parametric and paired/non-paired variants. Assumptions
such as normality, homoscedasticity, and sphericity are tested. Omnibus
tests include two-factor RM-ANOVA, ANOVA, Friedman, and Kruskal-Wallis.
Posthoc tests are implemented through pingouin.pairwise_tests()
,
offering (paired) t-tests, Wilcoxon, and Mann-Whitney-U.
To sum up, plotastic
stands as a unified and user-friendly solution
catering to the needs of researchers and data scientists, seamlessly
integrating statistical analysis with the power of plotting in Python.
It streamlines the workflow, translates seaborn
parameters into
statistical terms, and supports extensive customization options for both
analysis and visualization.
This work was supported by the Deutsche Forschungsgemeinschaft (DFG) SPP microBONE grants EB 447/10-1 (491715122), JA 504/17-1, HO 4462/1-1 (401358321), We thank the Elite Netzwerk Bayern and the Graduate School of Life Sciences of the University of Würzburg.