Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testdata #142

Merged
merged 10 commits into from
Oct 20, 2022
Merged

Testdata #142

merged 10 commits into from
Oct 20, 2022

Conversation

cristinamullin
Copy link
Collaborator

No description provided.

cristinamullin and others added 6 commits October 18, 2022 11:11
added importFrom(rlang,":=") to depth profile function. Added rlang as import in description file
added test as well for changing "meters" to "m"
@cristinamullin
Copy link
Collaborator Author

cristinamullin commented Oct 18, 2022

checking for non-standard things in the check directory ... NOTE
Found the following files/directories:
'cli' 'commonmark' 'curl' 'data.table' 'digest' 'jsonlite' 'purrr'
'vctrs' 'yaml'

TADAProfileClean1 <- DepthProfileData(TADAProfile, unit = "m", transform = TRUE)
Warning messages:
1: In [<-.data.frame(*tmp*, targetUnit, value = list(WQX.Depth.TargetUnit.x = c(NA, :
provided 2 variables to replace 1 variables
2: In [<-.data.frame(*tmp*, targetUnit, value = list(WQX.Depth.TargetUnit.x = c(NA, :
provided 2 variables to replace 1 variables

possibly lines 409-445

@cristinamullin
Copy link
Collaborator Author

@jbousquin Here is what I have so far, it runs and serves expected results now, but throws a new warning when I run TADAProfile1 in the vignette (see: #142). I don't see this warning when I run it on your test data though. I plan to try and troubleshoot a bit more & then we can merge after.

I also addressed the “meters” and “m” issue as part of the autoclean function which runs within TADAdataRetrieval. Not sure if that is the best place, but for now I deal with a few specific USGS/EPA data compatibility issues there within autoclean.

FYI - I provided you with write access to the TADA repo. You can tag people for review now & do other things like merge a pull request in after reviewer approval (e.g. if you wanted you could merge my pull request in after review since I tagged you as a reviewer). Since I’ve mostly been working on the repo alone the last few months, I’ve been bypassing this review requirement for myself.

@cristinamullin
Copy link
Collaborator Author

cristinamullin commented Oct 19, 2022

Note: the original code removed the original unit information & conversion factor columns when transform = TRUE. I commented out that code for now so the fields will show up in the final dataset. Not sure what users would prefer - any thoughts? Note: I added ActivityEndTime.TimeZoneCode to cols checked

In this example:

mock data frame

ActivityDepthHeightMeasure.MeasureValue <- c(2.0, 1)
ActivityDepthHeightMeasure.MeasureUnitCode <- c("m", "ft")
ActivityTopDepthHeightMeasure.MeasureValue <- c(NaN, NaN)
ActivityTopDepthHeightMeasure.MeasureUnitCode <- c(NaN, NaN)
ActivityBottomDepthHeightMeasure.MeasureValue <- c(NaN, NaN)
ActivityBottomDepthHeightMeasure.MeasureUnitCode <- c(NaN, NaN)
ResultDepthHeightMeasure.MeasureValue <- c(NaN, NaN)
ResultDepthHeightMeasure.MeasureUnitCode <- c(NaN, NaN)
ActivityEndTime.TimeZoneCode <- c(NaN, NaN)
TADAProfile <- data.frame(ActivityDepthHeightMeasure.MeasureValue,
ActivityDepthHeightMeasure.MeasureUnitCode,
ActivityTopDepthHeightMeasure.MeasureValue,
ActivityTopDepthHeightMeasure.MeasureUnitCode,
ActivityBottomDepthHeightMeasure.MeasureValue,
ActivityBottomDepthHeightMeasure.MeasureUnitCode,
ResultDepthHeightMeasure.MeasureValue,
ResultDepthHeightMeasure.MeasureUnitCode,
ActivityEndTime.TimeZoneCode)
x = DepthProfileData(TADAProfile)

These 3 columns would now be added:
"WQX.ActDepth.ConversionFactor"
"ActivityDepthHeightMeasure.MeasureUnitCode.Original"
"WQX.Depth.TargetUnit"

dplyr::relocate("ActivityDepthHeightMeasure.MeasureUnitCode",
.after = "ActivityDepthHeightMeasure.MeasureValue"
)
# uncoment below to delete ActDepth.Conversion.Unit column
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May want to uncomment to remove ActDepth.Conversion.Unit from clean dataset

@jbousquin
Copy link
Collaborator

On Note: the original code removed the original unit information & conversion factor columns when transform = TRUE.

Three pieces of information (1) Original Units (2) Updated Units, and (3) Conversion Factor

I see three options (with some combinations thereof):

  1. Replace the Original Units and delete Conversion Factor (results in fewest columns)
  2. Keep everything (best from a data management perspective)
  3. Give the user the option what to keep (flexible but don't want too many args)

What I did was to keep the original, and then intermediate columns (equivalent to conversion factor) had an optional argument where the default was to delete it but the user could change that to keep the column (mainly for debugging). At the end of my workflow I moved the original unit column to a secondary table anyway, so I figured keeping it small at this stage didn't matter (endpoint for TADA has all columns so this might be different).

@jbousquin
Copy link
Collaborator

jbousquin commented Oct 19, 2022

autoclean sounds a lot like the pre-processing of units I did (e.g., '%' -> 'percent', 'deg C' -> 'degC') so that the units package would recognize them correctly. Looks like the change is specific, i.e., avoids 'centimeter' -> 'centim'. Will it catch 'Meters'? Only caution I might add is do you want them characteristic specific (e.g., does 'meters' ever not mean 'm' in some other context). I know there are some units libraries in R but I'm not familiar with them, might be able to leverage some of those somewhere between autoclean and conversions in the table for more basic unit recognition issues like this.

Good catch on the rlang warning (importFrom(rlang,":=")), totally open to something else if there is another way to pass the column as a string to dplyr

check_autoclean_meters_works <- TADAdataRetrieval(statecode = "UT",
characteristicName = c("Ammonia", "Nitrate", "Nitrogen"),
startDate = "01-01-2021")
expect_equal(check_autoclean_meters_works$ActivityDepthHeightMeasure.MeasureUnitCode[975], "m"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this right you're running the function on the data retrieved by TADAdataRetrieval function. If data is ever added that will be included in this query result could the results at index 975 change? One way around that is to have the current result as a static file in the test data and run the function on that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realized because this is testing autoclean and that is part of the retrieval process its a bit more complicated... so may have to just keep an eye out if this test fails that it may be an index change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points. USGS is also planning to actually change their data in the future (~summer 2023) or consistency with WQX, with includes things like using "m" instead of "meters". When that happen, it would would render this specific piece of code within autoclean and this test unnecessary --so in the future this can probably be deleted.

This is all helpful to keep in mind so I'll try to track this discussion somewhere, maybe in a new issue. It might end up being helpful to track if an index changes on retrieved data in certain cases too, because that would mean the backend data profile or data itself is changing for some reason (e.g. could be a sign of updates to WQX 3.0 profiles or USGS actually changing their data in the future, etc.).

@jbousquin
Copy link
Collaborator

@jbousquin Here is what I have so far, it runs and serves expected results now, but throws a new warning when I run TADAProfile1 in the vignette (see: #142). I don't see this warning when I run it on your test data though. I plan to try and troubleshoot a bit more & then we can merge after.

I also addressed the “meters” and “m” issue as part of the autoclean function which runs within TADAdataRetrieval. Not sure if that is the best place, but for now I deal with a few specific USGS/EPA data compatibility issues there within autoclean.

FYI - I provided you with write access to the TADA repo. You can tag people for review now & do other things like merge a pull request in after reviewer approval (e.g. if you wanted you could merge my pull request in after review since I tagged you as a reviewer). Since I’ve mostly been working on the repo alone the last few months, I’ve been bypassing this review requirement for myself.

Are you still seeing the warning referenced here when running the vignette? (I tried to recreate but didn't see the warning)

@cristinamullin
Copy link
Collaborator Author

cristinamullin commented Oct 19, 2022

@jbousquin I spent some time troubleshooting the warning this afternoon & was able to fix it. The current pull request is now up to date with my commits from today. Everything should be running smoothly now with no warnings or errors. The changes I made actually removed the rlang,":=" dependency too, so I removed that dependency from the package for now.

I think the only thing left to do is decide which columns to keep when transform = TRUE vs. FALSE.
I think I want to adjust it so:
A. We keep everything (i.e., the three pieces of information (1) Original Units (2) Updated Units, and (3) Conversion Factor) when transform = FALSE. That way a user can review it, but...
B. When transform = TRUE, it would Replace the Original Units and delete Conversion Factor (results in fewest columns)

What do you think?

@jbousquin
Copy link
Collaborator

Sounds good to me - that way users have access to the intermediates by setting transform = False, but the default is what we anticipate most will want (fewer columns).

@cristinamullin cristinamullin merged commit 946f6d3 into develop Oct 20, 2022
@cristinamullin cristinamullin deleted the testdata branch October 20, 2022 18:28
@cristinamullin cristinamullin restored the testdata branch October 20, 2022 18:33
@cristinamullin cristinamullin deleted the testdata branch October 20, 2022 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants