This is the repository for the 17 October Stirling Coding Club tutorial on the right way to do data science, an example using the Tidyverse
In this session, Matt Guy and Anna Deasey will lead a discussion on the process of doing data science. We will talk about what constitutes best practice for managing research data - including how to use the tidyverse suite of R packages to progress from raw data to powerful data visualisation and manipulation quickly, and in a reproducible way.
Tidyverse is a collection of R packages developed in large part by RStudio's chief scientist Hadley Wickham. A unifying philosophy underlying the development of many of these packages is that of 'tidy data' whereby each variable is in a different column and each observation on a separate row. We will discuss 'tidy data' and it's applicability to best practice in research data management.
In particular, the 'dplyr' and 'ggplot2' packages from tidyverse offer powerful data manipulation and visualisation capabilities and have made it easier for new R learners to get to grips with data analysis in R quickly. This has caused some to call for these packages and the 'tidyverse way' to be taught to new learners before 'base' R. In this session we will give an example of 'cleaning' and 'wrangling' a raw dataset to produce 'analysis' data, then plotting using 'ggplot' to visualise, ask, and answer questions about your data.
In this repo you will find documents and links relating to:
-
data management best practices
-
what is 'tidyverse' and how it relates to data management
-
tutorials to tidyverse core package basics (dplyr, ggplot2....)
Please use the issues section of this repo to raise any questions and/or thoughts on these topics.