Skip to content

Commit

Permalink
docs: #84 final touches on functions
Browse files Browse the repository at this point in the history
  • Loading branch information
bms63 committed Jun 7, 2023
1 parent 6c373e7 commit 6d004b8
Showing 1 changed file with 67 additions and 39 deletions.
106 changes: 67 additions & 39 deletions vignettes/deepdive.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -168,16 +168,18 @@ adsl %>%

For the next six sections, we are going to explore the Warnings and Errors messages generated by the `{xportr}` core functions. To better explore these, we will either manipulate the ADaM dataset or specification file to help showcase the ability of the `{xportr}` functions to detect issues.

**NOTE:** We have made the ADSL, `adsl`, and Specificaion File, `var_spec` available in this package. Users can find additionl datasets and specification files on our [repo](https://github.com/atorus-research/xportr) in the `example_data_specs` folder. This is to keep the package to a minimum size.
**NOTE:** We have made the ADSL, `adsl`, and Specificaion File, `var_spec`, available in this package. Users can find additionl datasets and specification files on our [repo](https://github.com/atorus-research/xportr) in the `example_data_specs` folder. We have not included these in our package to help keep the package to a minimum size.

### Setting up our metadata object

First, lets read in the specification file and call it `var_spec`. Note that we are not using `options()` here. We will do some slight manipulation to the columns names by doing all lower case and changing `Data Type` to `type`. You can also use `options()` for this step as well. The `var_spec` object has five dataset specification files in in stack ontop of each other. We will make use of the `ADSL` section. You can make use of the Search field above the dataset column to subset the specification file for `ADSL`
First, lets read in the specification file and call it `var_spec`. Note that we are not using `options()` here. We will do some slight manipulation to the columns names by doing all lower case and changing `Data Type` to `type` and make the Order column numeric. You can also use `options()` for this step as well. The `var_spec` object has five dataset specification files in in stack on top of each other. We will make use of the `ADSL` subset of `var_spec`. You can make use of the Search field above the dataset column to subset the specification file for `ADSL`

```{r}
var_spec <- var_spec %>%
dplyr::rename(type = "Data Type") %>%
rlang::set_names(tolower)
rlang::set_names(tolower) %>%
dplyr::mutate(order = as.numeric(order))
```

```{r, echo = FALSE}
Expand All @@ -200,7 +202,7 @@ datatable(

## `xportr_type()`

We are going to explore the type column in the metadata object. A submission to a Health Authority should only have character and numeric types in the data. In the `ADSL` data we have several columns that are in the Date type: `TRTSDT`, `TRTEDT`, `DISONSDT`, `VISIT1DT` and `RFENDT`. We will change one variable type to a [factor variable](https://forcats.tidyverse.org/), which is a common data structure in R.
We are going to explore the type column in the metadata object. A submission to a Health Authority should only have character and numeric types in the data. In the `ADSL` data we have several columns that are in the Date type: `TRTSDT`, `TRTEDT`, `DISONSDT`, `VISIT1DT` and `RFENDT`. We will change one variable type to a [factor variable](https://forcats.tidyverse.org/), which is a common data structure in R, to give us some educational opportunities.

```{r}
adsl_fct <- adsl %>%
Expand Down Expand Up @@ -231,7 +233,7 @@ Success! As we can see below the `xportr_type()` function applied the types from
glimpse(adsl_type_glimpse)
```

Note that the `xportr_type(verbpse = "warn")` was set so the function has provided feedback, which would show up in the console, on which variables were converted as a warning message. However, you can set `verbose = 'stop'` so that the types are not applied as the data does not match what is in the specification file. Using `verbose = 'stop'` will instantly stop the processing of this function and not create the object.
Note that the `xportr_type(verbpse = "warn")` was set so the function has provided feedback, which would show up in the console, on which variables were converted as a warning message. However, you can set `verbose = 'stop'` so that the types are not applied as the data does not match what is in the specification file. Using `verbose = 'stop'` will instantly stop the processing of this function and not create the object. A user will need to alter the variables in their R script before using `xportr_type()`

```{r, echo = TRUE, error = TRUE}
adsl_type <- xportr_type(.df = adsl, metadata = var_spec, domain = "ADSL", verbose = "stop")
Expand All @@ -245,14 +247,14 @@ Next we will use `xportr_length()` to apply the length column of the _metadata o
str(adsl)
```

TODO: There is no warning around the length in the metadata being greater than 200.
TODO: There is no warning around the length in the metadata being greater than 200.
TODO: There is no message to users about how many lengths were applied to the dataframe.

```{r, echo = TRUE}
adsl_length <- xportr_length(.df = adsl, metadata = var_spec, domain = "ADSL", verbose = "warn")
```

Using the `xportr_length()` function with `verbose = 'warn'` we can apply the length column to all the columns in the dataset. The function detects that two variables, `TRTDUR` and `DCREASCD` are missing from the metadata file. Note that the variables have slight misspellings differences in the dataset and metadata, which is a great catch!
Using the `xportr_length()` function with `verbose = 'warn'` we can apply the length column to all the columns in the dataset. The function detects that two variables, `TRTDUR` and `DCREASCD` are missing from the metadata file. Note that the variables have slight misspellings differences in the dataset and metadata, which is a great catch! However, lengths are still applied with TRTDUR being give a length of 8 and DCREASCD a length of 200.

Using the `str()` function, you can see below that the `xportr_length()` function successfully applied all the lengths of the variable to the variables in the dataset.

Expand Down Expand Up @@ -302,64 +304,73 @@ Using `xportr_label()` we will apply all the labels from our metadata to the dat
adsl_lbl <- xportr_label(.df = adsl_lbl, metadata = var_spec_lbl, domain = "ADSL", verbose = "warn")
```

Success! All labels have been applied that are present in the both the metadata and the dataset. However, please note that the `TRTSDT` variable has the label with characters greater than 40.
Success! All labels have been applied that are present in the both the metadata and the dataset. However, please note that the `TRTSDT` variable has had the label with characters greater than 40 **applied** to the dataset and the `TRTDUR` and `DCREASCD` have empty variable labels.

```{r, max.height='300px', attr.output='.numberLines', echo = FALSE}
str(adsl_lbl)
```

Just like we did for the other functions, setting `verbose = 'stop'` immediately stops R from processing the labels Here the function detects the missing variables and labels greater than 40 and will not apply any labels to the dataset until corrective action is applied.
Just like we did for the other functions, setting `verbose = 'stop'` immediately stops R from processing the labels. Here the function detects the mismatches between the variables and labels as well as the label that is greater than 40 characters. As this stops the process, none of the labels will be applied to the dataset until corrective action is applied.

```{r, echo = TRUE, error = TRUE}
adsl_label <- xportr_label(.df = adsl_lbl, metadata = var_spec_lbl, domain = "ADSL", verbose = "stop")
```


## `xportr_order()`

The order of the dataset can greatly increase readability of the dataset for downstream stakeholders. For example, having all the treatment related variables or analysis variables grouped together can help with inspection and understanding of the dataset. `xportr_order()` can take the order information from the metadata and apply it to your dataset.

```{r}
library(dplyr)
adsl_ord <- xportr_order(adsl, var_spec, "ADSL", verbose = "warn")
```

var_spec_ord <- var_spec %>%
mutate(order = as.numeric(order))
Readers are encouraged to inspect the dataset and metadata to see the past order and updated order after calling the function. Note the messaging from `xportr_order()`:

* Variables not in the metadata are moved to the end
* Variables not in order are re-ordered and a message is printed out on which ones were re-rordered.

adsl_ord <- xportr_order(adsl, var_spec_ord, "ADSL", verbose = "warn")
```

```{r, echo = TRUE, error = TRUE}
adsl_ord <- xportr_order(.df = adsl, metadata = var_spec, domain = "ADSL", verbose = "stop")
```

```{r}
glimpse(adsl_ord)
```

TODO: I think there is something wrong with `xportr_order` as it is reordering the entire dataframe to something I don't fully understand.

TODO: What about a check on have a non-numeric value in the ordering column? I put an X in there and it did not care.
Just like we did for the other functions, setting `verbose = 'stop'` immediately stops R from processing the order. For this function if variables or metadata are missing from either, then the function will not process until corrective action is performed.

## `xportr_format()`

TODO: No warning issue for incorrect format type. I put in a "DATA" format and it applied the format even though it is not a valid one.
Formats play an important role in the SAS language and have a column in specification files. Being able to easily apply formats into your `xpt` file will allow downstream users of SAS to quickly format the data appropriately when reading into a SAS-based system. `xportr_format()` can take these formats and apply them. Please reference `xportr_length()` or `xportr_label()` to note that the the missing `attr()` for formats in our `ADSL` dataset.

```{r}
var_spec_fmt <- var_spec %>%
mutate(format = if_else(variable == "TRTSDT", "DATA", format))
This example is slightly different from previous examples. You will need to use `xportr_type()` to coerce R Date variables and others types to character or numeric. Only then can you use `xportr_format()` to apply the format column to the dataset.

```{r, echo = TRUE}
adsl_fmt <- adsl %>%
xportr_type(metadata = var_spec, domain = "ADSL", verbose = "warn") %>%
xportr_format(metadata = var_spec, domain = "ADSL", verbose = "warn")
```

Success! We have taken the metadata formats and applied them to the dataset. Please inspect variables like `TRTSDT` or `DISONSDT` to see the `DATE9.` format being applied.

adsl_fmt <- xportr_format(adsl, var_spec_fmt, "ADSL", verbose = "warn")
```{r, max.height='300px', attr.output='.numberLines', echo = FALSE}
str(adsl_fmt)
```

## `xportr_write()`

Finally, we want to
```{r, echo = TRUE, error = TRUE}
var_spec_fmt <- var_spec %>%
mutate(format = if_else(variable == "TRTSDT",
"NARNAR", format
))
TODO: xpt_validate catches my DATA format, but `xportr_format()` does not catch it.
TODO: I don't think `xportr_write()` works in the README and Get Started
adsl_fmt <- xportr_format(.df = adsl, metadata = var_spec_fmt, domain = "ADSL", verbose = "stop")
```

TODO: No information on bad formats or how many formats are applied to a dataset.

## `xportr_write()`

Finally, we want to write out an `xpt` dataset with all our metadata applied.

We will make use of our `xportr_metadata()` function to allow us to reduce repetitive calls to the metadata object and Domain. We will use default option for verbose, which is just `message` and so not set anything for `verbose`. In the `xportr_write()` function we will specify the path, which will just be our current working directory, set the dataset label and toggle the `strict_checks` to be `FALSE`.

```{r, echo = TRUE, error = TRUE}
adsl %>%
Expand All @@ -372,20 +383,37 @@ adsl %>%
xportr_write(path = "adsl.xpt", label = "Subject-Level Analysis Dataset", strict_checks = FALSE)
```

Success! We have applied types, lengths, labels, ordering and formats to our dataset. Note the messages written out to the console. Remember the `TRTDUR` and `DCREASCD` and how these are not present in the metadata, but in the dataset. This impacts the messaging for legnths and labels where `{xportr}` is printing out some feedback to us on the two issues. 5 types are also coerced as well as 36 variables re-ordered. Note that `strict_check` was set to `FALSE`.

The next two examples showcase the `strict_checks = TRUE` option in `xportr_write()` where we will look at formats and labels.

```{r, echo = TRUE, error = TRUE}
adsl %>%
xportr_write(path = "adsl.xpt", label = "Subject-Level Analysis Dataset", strict_checks = TRUE)
```


As there at several `---DT` type variables, `xportr_write()` detects the lack of formats being applied. To correct this remember you can use `xportr_type()` and `xportr_format()` to apply formats to your xpt dataset.

Below we have manipulated the labels to again be greater than 40 characters for `TRTSDT`. We have turned off `xportr_label()` verbose options to only produce a message. However, the `xportr_write()` function wtih `strict_checks = TRUE` will error out as this is one of the many `xpt_validate()` checks going one behind the scenes.

```{r, echo = TRUE, error = TRUE}
var_spec_lbl <- var_spec %>%
mutate(label = if_else(variable == "TRTSDT",
"Length of variable label must be 40 characters or less", label
))
adsl %>%
xportr_metadata(var_spec, "ADSL") %>%
xportr_type() %>%
xportr_length() %>%
xportr_metadata(var_spec_lbl, "ADSL") %>%
xportr_label() %>%
xportr_order() %>%
xportr_type() %>%
xportr_format() %>%
xportr_write(path = "adsl.xpt", label = "Subject-Level Analysis Dataset", strict_checks = TRUE)
```


## Warnings around label length

## Future Work

* Using `{xportr}` to bulk process multiple datasets.
* Preparing xpt files for upload to a validation software.
`{xportr}` is still undergoing development. We hope to produce more vignettes and functions that will allow users to bulk process multiple datasets as well have examples of piping `xpt` files and related documenation to a validation software service. As always, please let us know of any feature requests, documenation updates or bugs on our GitHub repo.

0 comments on commit 6d004b8

Please sign in to comment.