-
Notifications
You must be signed in to change notification settings - Fork 0
/
ReproducibleResearchIsRSE.Rmd
164 lines (112 loc) · 5.87 KB
/
ReproducibleResearchIsRSE.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
---
title: "Reproducible Research <p><b>is</b></p> <p>Research Software Engineering</p>"
author: "David Mawdsley, Robert Haines and Caroline Jay"
date: "Research IT Club<p><img src='http://assets.manchester.ac.uk/logos/hi-res/TAB_UNI_MAIN_logo/White_backgrounds/TAB_col_white_background.png' style='border:0px solid black' width='50%'></p>"
output:
revealjs::revealjs_presentation:
transition: fade
theme: solarize
fig_height: 6
self_contained: true
reveal_options:
controls: false
---
## Reproducible research
![](figs/WorkflowPipeline.png)
[](Abstract below - this won't appear in the presentation)
[](Good software engineering practice improves the robustness of the toolchain. A logical extension of this is to treat the manuscript itself as an integrated part of the software project. Tools such as Knitr allow us to include the code that produces the results of our analysis in the LaTeX source of the manuscript.)
[](Through a case study, we explain how we have done this, using a combination of Makefiles, Docker images and Knitr. We will discuss some of the challenges in this approach, such as managing complex software dependencies and scripting our analysis steps.)
## Publishing reproducible research
* How it changes the publication model
* Why this is a good thing for research
* What it means for the skills pipeline
## Self-contained reproducible research papers
* We can write our paper in LaTeX or Markdown, including R code as required
* Everything is in R; the only external dependency is the data
+ In principle (pretty much) a solved problem
* Some friction points:
+ software versions → Docker or Packrat
[](Docker would be safer, since some packages call external libraries)
+ formatting, e.g. tables
+ collaboration; e.g. working with Overleaf
##
![](example/ExamplePaperSource.png)
##
![](example/ExamplePaper.png)
## What if you can't do everything in R?
* Complex dependencies
* Time consuming-analyses
* Long pipelines
[](Good figures to include: Heath Robinson or Wallis and Gromit contraption?)
## Our approach
* Make modular by containerising each step using Docker
+ Reusable, reproducible
+ The final module makes the paper
* Join outputs of containers with Makefile
+ Or a workflow management tool
## Example - IDInteraction
- Automate the coding of behaviours
- This is _really_ slow and tedious to do by hand.
![](figs/P07ScreenTablet.jpg)
##
![](figs/WorkflowPipeline.png)
## Docker images
- Each module contains its own Makefile
- Example: object tracking
[](Discuss aim of experiment, data source, processing workflow)
![](figs/ObjectTracking0.png)
## Docker images {data-transition="none"}
- Each module contains its own Makefile
- Example: object tracking
[](Discuss aim of experiment, data source, processing workflow)
![](figs/ObjectTracking1.png)
## Docker images {data-transition="none"}
- Each module contains its own Makefile
- Example: object tracking
![](figs/ObjectTracking2.png)
##
![](figs/WorkflowContainers.png)
[](Each module is self-contained; can use different versions of libraries, e.g. OpenCV)
## Challenges
* Additional complexity
+ Pipeline can be difficult to debug
* Requires Docker
* Error handling
* Top-level Makefile can become unwieldy
+ Have started using [Nextflow](https://www.nextflow.io) for parts of the process
## Benefits
* Transparency
* Allows others to re-run and extend our analyses
+ Re-run, to verify
+ Re-run, to modify (e.g. bounding box)
+ Re-run, to extend (e.g. new data, new tracking methods)
* Moves away from the static "one-shot" publication
## A new publication model {data-transition="none"}
* Improves reliability
## A new publication model {data-transition="none"}
* Improves reliability
* Improves efficiency
## A new publication model {data-transition="none"}
* Improves reliability
* Improves efficiency
* Improves effectiveness
## A new publication model {data-transition="none"}
* Improves reliability
* Improves efficiency
* Improves effectiveness
* Accelerates progress
## Reproducible research / research software engineering {data-transition="none"}
* Narrative remains important
+ paper construction is becoming less a literary work, and more a software engineering project
## Reproducible research / research software engineering {data-transition="none"}
* Narrative remains important
+ paper construction is becoming less a literary work, and more a software engineering project
* RSEs are integral to this process
[](Notes: Finish technical bit with a description of what this allows other people to do - allows other people to re-run some of our analysis. This is how we've made the paper - what does this mean for another researcher who comes to this work - they can rerun these aspects completely. But bbox placed by a human - someone could explore how variation in placement of bbox affected results, could rerun and compare. A new experiment, using our paper, updating our paper. Another potential novel contribution, from the same tool chain.)
[](Don't need to talk about the value of containerisation - talk more about the value of making the paper.)
[](Under the current publishing model - the publication is static. One of the potential benefits is moving away from one-shot papers.)
[](Gives us extra stuff, but how many reviewers will build pipeline. Probably not many. The value is improving the research (incremental) process; causes us to reflect on the publication model. What we can't do with (most) standard journals is publish an almost identical looking paper, but with a different data-set.)
[](Doing this because it's the _right_ thing to do, but what's the value in it?)
[](Not possible to build these papers without RSEs)
[](Old model - one thinks of something, writes software. Stops. Write paper)
[](This approach blends this through; no gap between software and paper - same process. The paper _is_ almost software.)