-
Notifications
You must be signed in to change notification settings - Fork 6
/
validation-report-guided.Rmd
executable file
·300 lines (247 loc) · 15 KB
/
validation-report-guided.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
---
title: "Validation Report"
output: pdf_document
---
<!-- README -->
<!-- This is the guided file. This document will give you most of the structure you need to -->
<!-- get through the assignments, but leave pieces of coding up to you. We recommend this file -->
<!-- if you're a novice at Python and trying to learn. -->
```{r rsetup, include=FALSE}
# import needed R packages, these stay imported in future chunks
library(tidyverse)
library(magrittr)
library(knitr)
library(kableExtra)
library(reticulate)
# Set up Python for reticulate
# This path is using the miniconda installation, which is located
# inside of the home directory based on how the setup script installed it
reticulate::use_python("~/miniconda/bin/python3", required = TRUE)
```
```{python pysetup, include=FALSE}
# Import needed python packages, these stay imported in future chunks
# This only has to be done once for the life of your session
# The Python session stays alive between blocks of code
import os
import re
import pandas as pd
```
<!-- begin breakout one -->
<!-- Goals: -->
<!-- - Report the run date/time and user (R or Python - your preference) -->
<!-- - Store the absolute paths to the programs, outputs, and metadata directories in Python -->
<!-- - Report each of these paths inline in bold markdown text -->
```{r usercheck, include=FALSE}
# get system information (user, date, time)
currentuser <- Sys.getenv("USER")
currentdate <- format(Sys.time(), "%Y-%m-%d")
# Format the current time. You might want to look here:
# https://www.r-bloggers.com/2013/08/date-formats-in-r/
currenttime <- <your code here>
```
<!-- Report the user, date, and time -->
<!-- We got this started - now plug in the other variables -->
This report was run by user **`r currentuser`** on **`r <what date?>`** at <!-- what time? -->.
# Report Synopsis
This is a validation report for the CDISC Replication Pilot Project. The directory of the project is laid out as seen below. These folders are where the different files for this project are held. In addition, there are checks in place for the files used in the project to check consistency and completeness.
<!-- Python code blocks start just like R codeblocks -->
<!-- But just specify `python` instead of `r` -->
```{python directorychecks, echo=FALSE}
# get current working directory
wd = os.getcwd()
# create paths to the subfolders containing programs, outputs, and metadata
metadir = os.path.join(wd, "metadata")
# Look at the files tab - where's the prgorams directory?
progdir = <your code here>
# Where's the outputs directory?
outputdir = <your code here>
```
# Directory Outline
The metadata files are located in the following folder:
<!-- Python variables are accessible inside the `py` variable in your R session -->
**`r py$metadir`**
<!-- Do these next two on your own -->
<!-- What Python variables did we store the programs and outputs directories in? -->
The program files are located in the following folder:
<your code here>
The output files are located in the following folder:
<your code here>
<!-- end breakout one -->
<!-- begin breakout two -->
<!-- Goals: -->
<!-- - Gather names of files in the programs and outputs folders -->
<!-- - Read in the metadata file metadata/metadata.csv. Review this file and familiarize -->
<!-- yourself with the contents -->
<!-- - Bidirectionally check the metadata and directories for programs and outputs -->
<!-- - Create Boolean variables for -->
<!-- - Any programs missing? -->
<!-- - Programs in folder not in metadata? -->
<!-- - Any outputs missing? -->
<!-- - Outputs in folder not in metadata? -->
<!-- Use one Python code block to do all of this -->
# File Checks
```{python filechecks, echo=FALSE}
# import metadata file into data frame
# enter the metadata file name
metacsv = os.path.join(metadir, <your code here>)
metadf = pd.read_csv(metacsv)
# iterate through the metadata data frame to check if programs and outputs exist
# Remember to review the metadata data frame to familiarize yourself with the contents
# iterrows will loop over each row of the data frame. SAS Programmer? This is similar
# to a data step. R Programmer? This is similar to dplyr::rowwise()
for index, row in metadf.iterrows():
# check to see if program exists
program = os.path.join(progdir, row.ProgramName) # Get the path to the file
programexist = os.path.exists(program) # Test if it exists
# check to see if output exists
# Get the path to the file
output = <your code here>
# Test if it exists
outputexist = <your code here>
# assign variables created to columns in the dataframe
# DataFrame.at will takes the index and a variable name,
# and will assign a value at to that variable at the specified index
metadf.at[index, 'ProgramExists'] = programexist
# What variable did we store whether an output exists?
metadf.at[index, 'OutputExists'] = <your code here>
# get contents of programs and outputs sub directories
proglist = os.listdir(progdir)
# What's the outputs directory path and how do we list the contents?
ouputlist = <your code here>
# Create lists of programs and outputs in the respective directory but not in the metadata file
# List comprehensions are similar to lapply in R, in that they return a list object.
# They're written like for loops. Here we're returning a list of files in the list variable
# proglist if the file is NOT in the metadf column ProgramName, and if they're .R files.
progsindirnotmeta = [file for file in proglist if file not in metadf.ProgramName.tolist() and re.search(".R", file)]
# And we'll do the same thing for the output files
# This should look very similar to the line above
outputsindirnotmeta = <your code here>
# Create a True/False variable to indicate if all programs in the metadata exist
# The all() function tests if all elements of an object are True
programsallgood = all(metadf.ProgramExists)
# Create a True/False variable to indicate if all outputs in the metadata exist
# What variable in metadf tells us if an output exists?
outputsallgood = <your code here>
# Create a True/False variable to indicate if there are programs in the directory not in the metadata
# The way we did the list comprehensions above, the list will be empty if all files
# in the directory are in the metadata, and therefore the length will be 0.
programsmetaallgood = (len(progsindirnotmeta) == 0)
# Create a True/False variable to indicate if there are outputs in the directory not in the metadata
# What variable did we create to tell us if a file is in the directory but not in the metadata?
outputsmetaallgood = <your code here>
```
<!-- end breakout two -->
<!-- begin breakout three -->
<!-- - Use the 4 variables created earlier to output conditional markdown text
<!-- - Some/No programs missing
<!-- - Some/No programs present in folder not in metadata
<!-- - Some/No outputs missing
<!-- - Some/No outputs present in folder not in metadata
<!-- - Create code blocks to output the detailed issues in a nicely presented table -->
## Program File Check
The programs in the metadata document were checked against the programs folder, results are as follows:
<!-- Use inline R code to write different text depending on whether the programs -->
<!-- were missing from the metadata document -->
<!-- is met, the text will populate. Otherwise, nothing will be written -->
<!-- Enter your code in the second if() condition -->
`r if(py$programsallgood){"**All programs present in the metadata document are present in the programs folder.**"}`
<!-- What's the opposite of the if condition above? -->
`r if(<your code here>){"**Some programs present in the metadata document are not present in the programs folder.** The programs listed below are not present in the programs folder."}`
<!-- Display this table only if programs are in the metadata but not directory -->
<!-- Note in eval option in the code block below - you can use variables in your session -->
<!-- to control if the block is evaluated or not. Here, we can control whether or not a -->
<!-- table is written. -->
```{r badprograms, echo=FALSE, eval=!py$programsallgood}
# Nicely write out the names of programs that are in the metadata but not directory
# Note - the python data frames are successfully coerced to R data frames! This
# happens automatically with no extra effort on your part.
# Here we use dplyr::select and dplyr::filter to filter the pandas data frame
kable(select(filter(py$metadf, ProgramExists != TRUE), ProgramName), booktabs = T) %>%
kable_styling(latex_options = "striped", position = "left")
```
<!-- Same idea as above, but now we're reporting programs in the folder that weren't in the metadata -->
<!-- Complete the if conditions in the two lines below -->
`r if(<your code here>){"**All programs present in the programs folder are present in the metadata document.**"}`
`r if(<your code here>){"**Some programs present in the programs folder are not present in the metadata document.** The programs listed below are not present in the metadata document."}`
<!-- Display this table only if programs are in the directory but not metadata -->
<!-- Set the eval option in the block below -->
```{r badprogrammeta, echo=FALSE, eval=<your code here>}
# Nicely write out the names of programs that are in the directory but not metadata
# Kable can easily present a list as a table
kable(py$progsindirnotmeta, col.names="ProgramName", booktabs = T) %>%
kable_styling(latex_options = "striped", position = "left")
```
## Output File Check
<!-- The lines below repeat all of the same checks, but now for the outputs folder -->
The output files in the metadata document were checked against the outputs folder, results are as follows:
`r if( ){"**All outputs present in the metadata document are present in the outputs folder.**"}`
`r if( ){"**Some outputs present in the metadata document are not present in the outputs folder.** The outputs listed below are not present in the outputs folder."}`
<!-- Display this table only if outputs are in the metadata but not directory -->
<!-- Don't forget to grab the logical variable that tells use whether or not to -->
<!-- display this table -->
```{r badoutputs, echo=FALSE , eval=<your code here> }
# nicely write out the names of outputs that are in the metadata but not directory
# Filter the dataframe to just where OutputExists is NOT True
kable(<your code here>, booktabs = T) %>%
kable_styling(latex_options = "striped", position = "left")
```
<!-- Same idea as above, but now we're reporting outputs in the folder that weren't in the metadata -->
`r if(<your code here>){"**All outputs present in the outputs folder are present in the metadata document.**"}`
<!-- Write the second statement by yourself, based on the way the check was done for programs -->
<your code here>
<!-- Display this table only if outputs are in the directory but not metadata -->
<!-- Don't forget to grab the logical variable that tells use whether or not to -->
<!-- display this table -->
```{r badoutputmeta, echo=FALSE , eval=<your code here>}
# nicely write out the names of outputs that are in the directory but not metadata
# Do this one yourself! Display the outputsindirnotmeta as a table
```
<!-- end breakout three -->
<!-- begin breakout four -->
<!-- - Pass over the metadata data frame from before and check if the source file name is in the output file -->
<!-- - Create a new data frame of just records where the output is in both the metadata and the outputs directory -->
<!-- - Create a variables that's True if all the source file names are in their outputs, and false if not -->
<!-- - Create a section that dynamically tells us if something is wrong and outputs issues in a nicely presented data frame -->
# Output Contents Checks
The text of the output files present in both the metadata document and outputs folder were checked for consistency with the source program files provided.
<!-- In this block, we're checking the output RTF document to look for a line containing the -->
<!-- program name that should have created this output document. This is a quick and dirty way -->
<!-- to check if there was a footnote the contained the executing program name. -->
```{python outputsourcecheck, echo=FALSE}
# Iterate through the metadata data frame to check the content of output files
for index, row in metadf.iterrows():
# Get the file path for the output document
output = <your code here>
# check the output files for source element and create boolean part to variable
# we're giving this part to you because the RTF files weren't created on this
# machine, so we're only going to check for the programs/<filename> part.
source = "programs/" + row.ProgramName
# Check if the output existed (which we tested earlier)
if row.OutputExists == True:
# Open the file as a variable named `file`
with open(output, 'r') as file:
# Grab the file text as a string
filetext = str(file.read())
# Search the file text for the source program name
sourcecheck = bool(re.search(source, filetext))
# If the file doesn't exist, then just store text saying telling us that
else:
sourcecheck = "Output does not exist"
# Assign variable created to a column in the metadf data frame
# Write this yourself! - the variable should be named `SourceCorrect`
# Use the methods we taught you earlier.
<your code here>
# Create a dataframe of all output files both in the metadata and the directory
goodoutput = metadf[metadf.OutputExists == True]
# Create a True/False variable that we'll use to trigger text output depending on if the sources were correct
sourcesallgood = <your code here>
```
## Sources
<!-- Similar to before, we're writing different text based on whether or not there were any problems with the sources -->
<!-- Write this with however you want, but vary the text based on the sourcesallgood variable> -->
<your code here>
<!-- Display a table of outputs with issues in the source footnotes based on the dataframe created above -->
```{r badsources, echo=FALSE, eval=<your code here>}
<your code here>
```
<!-- end breakout four -->