Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

weather uses two timezones - not clear which matches flights #19

Closed
garrettgman opened this issue Jan 1, 2017 · 9 comments
Closed

weather uses two timezones - not clear which matches flights #19

garrettgman opened this issue Jan 1, 2017 · 9 comments
Labels
bug an unexpected problem or unintended behavior wip work in progress

Comments

@garrettgman
Copy link
Member

In weather, the time_hour variable is offset by five hours from the time displayed across the year, month, day, and hour variables.

screen shot 2017-01-01 at 12 46 53 pm

It is not clear which time matches the times in flights (where year, month, day, hour, and time_hour all agree). Given the offset, it is possible that time_hour is in the America/New_York timezone and the other variables are in UTC.

@rmcd1024
Copy link
Contributor

The weather data is in UTC, which I think is standard. If you look at the code generating "time_hour" in weather.r, the time zone is not specified, which makes me think that it acquires whatever local time zone offset was in place when the package was built (from ?ISOdatetime: " ‘""’ is the current time zone")

@ltierney
Copy link

It looks like year/month/day/hour are in America/New_York in flights and in UTC in weather. That doesn't match time_hour, which is America/Chicago for weather and UTC for flights. These don't match up properly. Recalculating the time_hour variables as

flights <- mutate(flights, time_hour = make_datetime(year, month, day, hour, tz = "America/New_York"))
weather <- mutate(weather, time_hour = make_datetime(year, month, day, hour, tz = "UTC"))

produces values that seem to match up properly on a join.

@hadley hadley added bug an unexpected problem or unintended behavior wip work in progress labels Jun 7, 2018
@rudeboybert
Copy link

rudeboybert commented Jun 7, 2018

Let's put everything on the table with the following reprex:

# devtools::install_github("hadley/nycflights13")
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(nycflights13))

flights %>% 
  select(year, month, day, hour, time_hour) %>% 
  slice(1)
#> # A tibble: 1 x 5
#>    year month   day  hour time_hour          
#>   <int> <int> <int> <dbl> <dttm>             
#> 1  2013     1     1     5 2013-01-01 05:00:00
flights$time_hour[1]
#> [1] "2013-01-01 05:00:00 UTC"

weather %>% 
  select(year, month, day, hour, time_hour) %>% 
  slice(1)
#> # A tibble: 1 x 5
#>    year month   day  hour time_hour          
#>   <dbl> <dbl> <int> <int> <dttm>             
#> 1  2013     1     1     0 2012-12-31 19:00:00
weather$time_hour[1]
#> [1] "2012-12-31 19:00:00 EST"
  • flights
    • year/month/day/hour match time_hour's date/time, but the latter's time zone is off.
  • weather
    • year/month/day/hour is ahead of time_hour's time by 5 hours.
    • At the very least, the timezone output should be of form 20XX-XX-XX XX:XX:XX America/New_York and not 20XX-XX-XX XX:XX:XX EST, as per @rmcd1024's comment in Fixed weather$time_hour timezone and added to documentation for fligh… #23 on possible Eastern Standard Time (EST) vs Eastern Daylight Time (EDT) confusion.
    • Based on the visualization below of June EWR temperatures and a cursory scientific investigation suggesting that 3pm is the hottest time of the day in the summer, as @ltierney suggested,year/month/day/hour do appear to be in UTC/shifted 4 to 5 hours, suggesting that these 4 variables need to be corrected all weather measurements need to be shifted to match the correct year/month/day/hour.
weather %>% 
  filter(origin == "EWR", month == 6) %>%
  ggplot(aes(x = hour, y = temp)) +
  geom_point() +
  geom_smooth() +
  labs(title = "June 2013 hourly temperatures at EWR") + 
  geom_vline(xintercept = 15, col = "red", size = 1)
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Created on 2018-06-06 by the reprex package (v0.2.0).

Question for @hadley: What do you think the timezone/output of flights$time_hour[1] should be? (The same idea applies to weather$time_hour[1])

  1. 2013-01-01 10:00:00 UTC. @rmcd1024 favors this as it is industry convention.
  2. 2013-01-01 05:00:00 America/New_York. I favor this as it will cause users of this package less confusion and makes joining easier.

@hadley
Copy link
Member

hadley commented Jun 7, 2018

I think both the time zones for both datasets should be America/New_York since this makes joining and interpretation easier.

@rudeboybert do you want to apply @ltierney's fix in your PR? Or should we close that one and start anew?

@rudeboybert
Copy link

I'll close #23 and start anew sometime next week.

@hadley
Copy link
Member

hadley commented Jun 20, 2018

library(nycflights13)

flights[1, c("year", "month", "day", "hour", "time_hour")]
#> # A tibble: 1 x 5
#>    year month   day  hour time_hour          
#>   <int> <int> <int> <dbl> <dttm>             
#> 1  2013     1     1     5 2013-01-01 05:00:00
flights$time_hour[1]
#> [1] "2013-01-01 05:00:00 EST"
attr(flights$time_hour, "tzone")
#> [1] "America/New_York"

weather[1, c("year", "month", "day", "hour", "time_hour")]
#> # A tibble: 1 x 5
#>    year month   day  hour time_hour          
#>   <dbl> <dbl> <int> <int> <dttm>             
#> 1  2013     1     1     1 2013-01-01 01:00:00
weather$time_hour[1]
#> [1] "2013-01-01 01:00:00 EST"
attr(weather$time_hour, "tzone")
#> [1] "America/New_York"

Created on 2018-06-20 by the reprex package (v0.2.0).

@hadley hadley closed this as completed in d14b058 Jun 20, 2018
@hadley
Copy link
Member

hadley commented Jun 20, 2018

I'll plan to submit to CRAN in one week (July 27), so I'd really appreciate it if someone could double check my work and let me know if I've missed anything

@rudeboybert
Copy link

Looks good on my end. Thanks!

suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(nycflights13))

# Correct EST Standard vs EDT Daylight Savings (2013-03-10 thru 2013-11-03) for weather
weather$time_hour[1]
#> [1] "2013-01-01 01:00:00 EST"
weather$time_hour[13000]
#> [1] "2013-06-29 06:00:00 EDT"

# Correct EST Standard vs EDT Daylight Savings (2013-03-10 thru 2013-11-03) for flights
flights$time_hour[1]
#> [1] "2013-01-01 05:00:00 EST"
flights$time_hour[150000]
#> [1] "2013-03-15 17:00:00 EDT"

# Roughly hottest point of the day corresponds to 3pm
weather %>% 
  filter(origin == "EWR", month == 6) %>%
  ggplot(aes(x = hour, y = temp)) +
  geom_point() +
  geom_smooth() +
  labs(title = "June 2013 hourly temperatures at EWR") + 
  geom_vline(xintercept = 15, col = "red", size = 1)
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Created on 2018-06-20 by the reprex package (v0.2.0).

@rmcd1024
Copy link
Contributor

This looks good to me, thanks for revising this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior wip work in progress
Projects
None yet
Development

No branches or pull requests

5 participants