Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the storms dataset #6320

Merged
merged 16 commits into from
Aug 18, 2022
Merged

Update the storms dataset #6320

merged 16 commits into from
Aug 18, 2022

Conversation

steveharoz
Copy link
Contributor

@steveharoz steveharoz commented Jul 5, 2022

(closes #6319)

  1. A bug in the code to reformat the data was causing some storms to be dropped. That's been fixed.
  2. Data for 2021 storms has been added.
  3. I've added data for earlier storms (1852-1974).

Point 3 might be worth discussing. Whoever originally added the dataset to dplyr dropped storms before 1975. I've been doing the same since I've been updating it, but I haven't seen a clear rationale. Considerations for adding the early data:

  • PRO: More data
  • CON: Bigger data file 42k -> 88k
  • CON: The early data may be less useful. E.g., many storms from the 1800s have only a single data point.
  • PRO: This is supposed to be an educational dataset. Filtering out the less useful data is a simple realistic exercise for learning dplyr.

@mydatacz
Copy link

Hi Steve!

Quick suggestion: to add the other status codes to the line below - if someone does want to keep the filter commented out, they won’t have to go back and clean the status for those.

Thanks again!

status = factor(recode(status, "HU" = "hurricane", "TS" = "tropical storm", "TD" = "tropical depression"))
EX – extratropical
SD – subtropical depression
SS – subtropical storm
LO – low
WV – tropical wave
DB – disturbance

@steveharoz steveharoz marked this pull request as draft July 11, 2022 04:33
@steveharoz
Copy link
Contributor Author

steveharoz commented Jul 11, 2022

Trying to figure out what's going on with storm categorization. Bug in my parser? Or inconsistency in NOAA's categorization? These records should be classified as hurricanes (winds > 64 knots) but are subtropical storms, tropical storms, or other lows:

> storms %>% 
+     filter(category > 0, !(status %in% c("hurricane", "EX"))) %>% 
+     select(name, year, month, day, hour, lat, long, status, wind, pressure)
# A tibble: 6 × 10
  name      year month   day  hour   lat  long status          wind pressure
  <chr>    <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct>          <int>    <int>
1 AL091968  1968     9    20    12  35.5 -49.5 SS                75      976
2 AL091968  1968     9    21    12  39.6 -44.7 SS                65      982
3 AL181979  1979    10    24    18  40.5 -62   SS                65      985
4 EMILY     2005     7    20    18  25   -98.7 tropical storm    70      975
5 DORIAN    2019     9     7    18  42.8 -64.6 LO                80      954
6 DORIAN    2019     9     8     0  45.2 -62.9 LO                80      956

@mydatacz
Copy link

mydatacz commented Jul 11, 2022

I think it may not be either issues. I think the data is likely correct.

Subtropical, Extratropical, Lows and Disturbances can all have high wind intensity, but that doesn't mean they are hurricanes. A storm needs to be determined to be a tropical cyclone before it can rise to the level of a hurricane (based on wind speed/intensity). Definitions of types of storms here: https://www.nhc.noaa.gov/aboutgloss.shtml

https://www.nhc.noaa.gov/data/hurdat/hurdat2-format-atl-1851-2021.pdf

HU (Spaces 20-21, before 4th comma) – Status of system. Options are:
TD – Tropical cyclone of tropical depression intensity (< 34 knots)
TS – Tropical cyclone of tropical storm intensity (34-63 knots)
HU – Tropical cyclone of hurricane intensity (> 64 knots)
EX – Extratropical cyclone (of any intensity)
SD – Subtropical cyclone of subtropical depression intensity (< 34 knots)
SS – Subtropical cyclone of subtropical storm intensity (> 34 knots)
LO – A low that is neither a tropical cyclone, a subtropical cyclone, nor an extratropical cyclone (of any intensity) WV – Tropical Wave (of any intensity)
DB – Disturbance (of any intensity)

So if a storm does not meet the requirements to be classified as a tropical cyclone, regardless of wind speed, it will never have a status of hurricane.

For example, you've probably experienced wind conditions 34 - 47 knots, which is a gale, unless the wind was associated with a storm that was already determined to be a tropical cyclone (based on additional criteria aside from wind).

Or you may have been in a winter Nor'easter (extratropical storm that can have winds over 65 knots but isn't a hurricane). I don't know the measurements behind it, but reading the definition and the link below it appears there are multiple characteristics/metrics used to determine if a storm is a tropical cyclone.

https://www.weather.gov/source/zhu/ZHU_Training_Page/tropical_stuff/sub_extra_tropical/subtropical.htm

Category > 0 had me stumped also at first, but then I realized the categories are based on a wind scale. Non-tropical cyclone storms also get assigned categories based on wind in the data. At first glance they appear to coincide with the Saffir-Simpson Wind scale, but I don't know that for sure. Having followed weather reports closely as a sailor, though, I've never heard NOAA refer to a category 1 gale or category 1 nor'easter, so I think NOAA only uses categories to describe hurricanes. https://www.nhc.noaa.gov/aboutsshws.php (see my second comment below - just realized the category data did not come from the original file).

luis.df <- storms %>%
filter(name == "Luis") %>%
select(year, name, category, status, wind)

47 | 1995 | Luis | 2 | hurricane | 95
48 | 1995 | Luis | 2 | hurricane | 90

53 | 1995 | Luis | 2 | EX | 85
54 | 1995 | Luis | 2 | EX | 95
55 | 1995 | Luis | 3 | EX | 105
56 | 1995 | Luis | 2 | EX | 90
57 | 1995 | Luis | 1 | EX | 75
58 | 1995 | Luis | 0 | EX | 60

Hurricane Luis, for example, was an extratropical storm at some point, with wind categories 1, 2, and 3, but the other characteristic of the storm at that time did not meet the criteria for a tropical cyclone anymore, even though the winds were often higher than when it was a tropical cyclone.

So I guess the take-away would be not to filter the data based on category thinking you are only going to get hurricanes.

@mydatacz
Copy link

Actually, I just realized when looking at the original hurdat2 format file linked above, it seems category is not a column coming from NOAA. Since it is a column added in/calculated as part of the dyplr file, maybe only add the category for the tropical depressions, tropical storms and hurricanes, and leave the other storm types with an NA? I'm not sure of what the purpose of the -1 and 0 category are for tropical depressions and tropical storms, since the Saffir-Simpson Wind scale doesn't start until 1 for hurricanes. If you decide to use category for only hurricanes, all the other storm status categories could be 0.

@steveharoz
Copy link
Contributor Author

Yes, category is calculated from windspeed. I've made that a bit more clear in docs and set it to NA for everything that's not a hurricane.

@steveharoz steveharoz marked this pull request as ready for review July 12, 2022 19:01
R/data-storms.R Outdated Show resolved Hide resolved
R/data-storms.R Outdated Show resolved Hide resolved
R/data-storms.R Outdated Show resolved Hide resolved
R/data-storms.R Outdated
#' }
#' @examples
#'
#' # show a plot of the storm paths
#' # show a plot of the storm paths in 1975 or later
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 1975?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The facets get too squished in the figure if too many years are included. Is there a way to make the figure bigger in the docs? https://dplyr.tidyverse.org/reference/storms.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a little comment explaining that in the example

R/data-storms.R Outdated
#' ggplot(storms) +
#' storms %>%
#' filter(year >= 1975) %>%
#' ggplot() +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#' ggplot() +
#' ggplot() +

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And maybe put the aes() in the ggplot() call?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The aes on its own line is just my one-thing-per-line code style. Feel free to change it though. Commit coming soon...

data-raw/storms.R Outdated Show resolved Hide resolved
Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
NEWS.md Outdated Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
@hadley
Copy link
Member

hadley commented Aug 9, 2022

@steveharoz do you want to finish this off?

@steveharoz
Copy link
Contributor Author

@hadley Yeah. I'll finish it later this week.

steveharoz and others added 3 commits August 17, 2022 16:41
Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
Co-authored-by: Davis Vaughan <davis@rstudio.com>
@hadley
Copy link
Member

hadley commented Aug 17, 2022

Thanks for the update! I think the last question to resolve is whether it's worth while to include the rows prior to 1975 — I'm worried that this has a high likelihood of breaking existing graphics for little additional gain. I think it's probably safer to not include the historical data here.

@steveharoz
Copy link
Contributor Author

@hadley Yeah, I see the benefit of only having the clean and more complete data.

@hadley
Copy link
Member

hadley commented Aug 18, 2022

Thanks! I did a couple more docs tweaks because I realised that this is the perfect place to use inline R code.

@steveharoz
Copy link
Contributor Author

Good call on the inline R!

@hadley hadley merged commit e7512e0 into tidyverse:main Aug 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update Storms dataset
4 participants