-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathweatheR.Rmd
601 lines (482 loc) · 31.5 KB
/
weatheR.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
---
title: "Weather data pipeline for R"
author: "By Elliot Cohen, Michael Piccirelli and James McCreight "
date: "January 8, 2015"
output:
html_document:
toc: yes
pdf_document:
toc: yes
toc_depth: 3
theme: spacelab
---
## Motivation
Long observational records of high-resolution weather data are instrumental to a wide range of research and development applications. Such data are essential to energy demand forecasts (Segal et al. 1992; Sailor 2001; Crowley et al. 2003; Thatcher 2007, Cohen et al. 2014), building energy models (Deru et al. 2011), renewable energy feasibility and siting assessments (Lambert et al. 2005), crop insurance programs (Helmath et al. 2009), emergency preparedness and early warning systems (Rogers and Tsirkunov 2011), and much more.
Globally, national weather services and climate information centers such as the U.S. National Oceanic and Atmospheric Administration (NOAA), Britain's Met Office, and India's Institute for Tropical Meteorology (IITM), collect, curate, analyze and publish meteorological data from thousands of weather stations. NOAA's National Climatic Data Center (NCDC) is the world's largest archive of weather data.
NOAA scientists offer a wealth of meteorological/climatological information through the NCDC data portal. The data is available on the [Online Climate Data Directory](http://www.ncdc.noaa.gov/cdo-web/) website. It is also available via FTP, which is more effecient for batch queries, and is the method we will use here.
The following article, and the accompanying `weatheR` library built for the statistical computing language R, is intended to facilitate geo-referenced and quality-control batch query of NOAA's National Climatic Data Center (NCDC)--the world's largest active archive of weather data. Our work builds on "Downloading and processing NOAA hourly weather station data" by Delameter et al. (YEAR). Our goal is to make meteorological data more accessible for a wide range of scientists, engineers and practicioners familiar with data analysis in R, but less familiar with the handling large meteorological datasets in particular. The weatheR libary introduces new functionality and automation previously unavailable:
* Identification and selection of nearest-neighbor weather stations for virtually any place on Earth. Locations of interest are geo-referenced via the Google Maps API.
* Batch query with a simple charachter string of location names (e.g. "New York City", "Tiennamen Square", "Abuja, Nigeria")
* Built-in quality controls for "best" data selection based on the geographic proximity and completeness of the data record.
* Quick visual inspection to make sure you're getting the right data.
## The weatheR library
### Open-Source and Available to All!
We created the `weatheR` package with all the necessary functions to download, transform, and plot NOAA weather station data. This package has 4 R package dependencies: `httr`, `ggmap`, `ggplot2`, and `grid`. The package is available on GitHub at [https://github.com/mpiccirilli/weatheR](https://github.com/mpiccirilli/weatheR). To start using the `weatheR` package immediately, all you need are the following three lines of code:
```{r weatheR, eval=FALSE, echo=TRUE}
## install weatheR library (beta version from github)
require(devtools)
install_github("mpiccirilli/weatheR")
require(weatheR)
```
### Functions
This package has **utility** functions and **data** funcitons. The **data** functions allow users to download data from the NOAA database, whereas the **utility** functions are used primarily in facilitating the data download process. We'll first define each function then go through an example of downloading data with several of the **data** functions, explaining the mechanics of each utility function along the way.
#### Utility Functions:
- `allStations`: allows the user to download the full list of Integrated Surface Database (ISD) weather stations. The list of stations, as of (insert date), is actually included in the package and can be loaded into the session by calling: `data(stations)`. The purpose for this function however is so that in the event this list is updated, the user does not have to rely solely on the list included in the package.
- `kNStations`: this function is the heart of the package and is used in every data downloading function. Using The Great Circle distance calculation, this funciton finds the k-nearest weather stations to a given input location. The distance from a refernce point and each station is quoted in kilometers.
- `plotStations`: plots each reference point and its respective k-nearest stations. This function was included as to provide a quick visual inspection to make sure you're getting the right data.
- `interpolateData`: this is a funciton with limited use and sole purpose is to interpolate missing data from the weather stations so that we can run future analysis. It takes in a list of dataframes which is the output from all of the downloading data functions.
- `plotDailyMax`: We might want to take this function out for now?
#### Data Functions:
There are four data functions, each of which has a unique level of 'functional autonomy' (does this sound like crazy talk?). At the lowest level you can specify the exact weather station of interest by it's NOAA identifier. The highest level will return the single 'best' station based on completeness of data and proximity to a given location, as well as linearly interpolate missing temperature values. From the lowest level of autonomy to the highest:
1. Return a specific station, found by it's WBAN ID
- `getStationsByID`: This allows users to specify a weather station of choice. The user can either look up the stations prior to using the package, or use the `kNstations` function to pre-select a station of interest.
1. Return k-nearest stations to a location
- `getStationsByLocation`: This is the primary driver of this package. This funciton allows the user to specify any particular location, whether it be a city or any landmark that can be identified via the Google Maps API. Locations are first geo-referenced via the Google Maps API, then subsequently uses the `kNStations` function to find the k-nearest stations to each respective location.
1. Return the "best" station to a given location
- `getFilteredStations`: This builds on the prior function by applying customizable quality controls, based on geographic proximity to geo-referenced location the and completeness of weather station data, to return the "best" weather station for each location. The filtering options will be discussed more thoroughly later.
1. Return the "best" station to a given location and interpolate missing data
- `getInterpolatedData`: This function goes one step beyond the previous. It makes the same station selection based on proximity and completeness of data, but it then interpolates all the missing temperature values.
Below is a table of all the functions mentioned above:
```{r package dependencies, message=FALSE, eval=FALSE, echo=FALSE}
require(knitr)
pkgFunctions <- data.frame(Utilities=c("allStations", "knStations",
"plotStations", "interpolateData"),
Data=c("getStationsByID", "getStationsByLocation",
"getFilteredStations", "getInterpolatedData") )
kable(pkgFunctions, format = "markdown")
```
### Dependencies
Before we can go through the mechanics of each function, we need to load in the package dependencies mentioned above.
```{r loadLibraries, include=FALSE, echo=TRUE, message=FALSE, warning=FALSE}
# require(weatheR)
packages <- c('httr', "ggmap", "ggplot2", "grid")
pkgCheck <- function(packages)
{
test <- sapply(packages, function(x){
pkg <- suppressWarnings( library(x, logical.return = TRUE, character.only = TRUE) )
sapply(names(pkg[pkg==1]), function(x) require(x, character.only = TRUE) )
sapply(names(pkg[pkg==0]), function(x) print(x) )
return(pkg)
})
}
```
### Station List
Once we load the package into memory, our first step will be to download the full list of ISD weather stations from the NOAA database which can be found here: [ftp://ftp.ncdc.noaa.gov/pub/data/noaa/](ftp://ftp.ncdc.noaa.gov/pub/data/noaa/). The name of the file in that directory is called **`isd-history.txt`** or **`isd-history.csv`**.
As mentioned above, we have included the dataset in the package (subsetted to exclude stations without LAT/LON coordinates), and can be loaded into memory by calling `data(stations)`. However we built this function to download the list from the database incase it is updated in the future.
```{r NOAA: allStations, echo=TRUE, eval=FALSE}
allStations <- function()
{
# Full list of ISD weather stations:
isd <- "ftp://ftp.ncdc.noaa.gov/pub/data/noaa/isd-history.csv"
#
response <- suppressWarnings(GET(isd))
all <- read.csv(text=content(response, "text"), header = TRUE)
colnames(all)[c(3, 9)] <- c("NAME", "ELEV") # Renaming for ease and consistency
all <- all[!is.na(all$LAT) & !is.na(all$LON),] # Removing stations without LAT/LON data
return(all)
}
stations <- allStations() # Assigns our subsetted list of ISD stations to variable 'stations'
head(stations)
```
```{r noaa: allstations noeval, echo=FALSE, eval=FALSE}
data(stations)
kable(head(station.list), format = "markdown")
```
It should be noted that the amount of data lost from the removal stations without LAT/LON data is minimal, approximately 1,300 out of 29,000 weather stations.
### Places of Interest
Now that we have the full list of weather stations, let's focus on what stations we want to hone in on. As mentioned above, the k-nearest stations are based on geo-referenced locations attained via the Google Maps API. For a location such as a state or country, Google Maps will return the centroid. However if you use a specific location such as a city or landmark, the geocode will be more precise.
If you would like to return the geocode of a city, it is advisable to specify the state or country in which it is located so as to eliminate the possibility of a mis-matched result. Similarly, if you have multiple cities of the same name you will need to specify the state or country as well, otherwise the Google Maps API will only return one location for all the repeated cities.
For demonstration, we'll use a few cities in the table of the World's Fastest Growing Cities mentioned in [cite other paper].
```{r NOAA: cities, echo=TRUE, eval=FALSE}
cities.of.interest <- c("Nairobi, Kenya", "Tema, Ghana", "Accra, Ghana", "Abidjan, Ivory Coast")
```
### Nearest Neighbor Search
Before we can find the k-nearest stations we need an accurate metric to measure distance between a location's reference point and each station. The commonly accepted calculation of distance between two points on the earth's surface is based the great circle.
#### The Great Circle
Distance calculated based on a great circle is the distance between two points on the surface of a sphere. This is different from the Euclidean distance, which would measure a straight line through the sphere's interior. Below are the functions used to calculate the great cirle.
```{r great circle, echo=TRUE, eval=TRUE}
deg2rad <- function(deg) return(deg*pi/180)
gcd.slc <- function(long1, lat1, long2, lat2) {
long1 <- deg2rad(long1)
lat1 <- deg2rad(lat1)
long2 <- deg2rad(long2)
lat2 <- deg2rad(lat2)
R <- 6371 # Earth mean radius in km
d <- acos(sin(lat1)*sin(lat2) + cos(lat1)*cos(lat2) * cos(long2-long1)) * R
return(d) # Distance in km
}
```
Note: This does not take elevation into account, however we will look into adding this feature into future versions of the package.
#### k-Nearest Stations
We can now use the distance metric calculated in the above functions to find the k-nearest stations to a given location. Each input location into the function is passed into the `geocode` function in the `ggmap` package, which sends a query to the Google Maps API and returns a pair of Latitude and Longitude coordinates. Once we have a set of LAT/LON coordiates for each location, we then calculate the distance of every station for each location, sort for the shortest distances, and return only the *k* stations desired.
```{r NOAA: kNStations, echo=TRUE, eval=FALSE}
kNStations <- function(city.list, station.list, k = 5)
{
# Find geocodes of each city
coords <- suppressMessages(geocode(city.list))
# Loop through each location, finding the k-nearest stations from the main list:
kns.ix <- NULL # create a variable to track row index the stations
kns.dist <- NULL # create a distance variable
for(i in 1:length(city.list))
{
# Great cirle distance calculator:
dist <- gcd.slc(coords$lon[i], coords$lat[i], station.list$LON, station.list$LAT)
distSort <- sort(dist, ind=TRUE)
tmp.ix <- distSort$ix[1:k] # temporary index variable for the loop
tmp.dist <- distSort$x[1:k] # temporary distance variable for the loop
kns.ix <- c(kns.ix, tmp.ix) # append row index of stations for each location
kns.dist <- c(kns.dist, tmp.dist) # append distances of stations for each location
}
st <- station.list[kns.ix,] # Subset the full list with k-nearest stations
st$city <- rep(city.list, each=k) # Insert reference City
st$Ref_Lat <- rep(coords$lat,each=k) # Insert reference Latitude
st$Ref_Lon <- rep(coords$lon, each=k) # Insert reference Longitude
st$kilo_distance <- kns.dist # Insert distance into result
st <- st[with(st,order(city, kilo_distance)),]
st$rank <- rep(1:k,length(city.list)) # Rank closest to farthest (1 to k)
# Queries are made to NOAA database by year:
st$BEGIN_Year <- as.numeric(substr(st$BEGIN,1,4)) # Start Year
st$END_Year <- as.numeric(substr(st$END, 1, 4)) # End Year
return(st)
}
```
### Visual Inspection of Nearest Neighbors
We now have all that is needed to begin downloading data, but before we do so we can visualize the stations of each city to understand how far away each one is from the given location. The code below for the `multiplot` function was developed by -- insert refernce -- and is used in the _____ package. However, instead of increasing the number of dependencies of this package, we decided to copy the code for internal use.
```{r NOAA: multiplot, echo=TRUE, eval=FALSE}
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL)
{
# Make a list from the ... arguments and plotlist
plots <- c(list(...), plotlist)
numPlots = length(plots)
# If layout is NULL, then use 'cols' to determine layout
if (is.null(layout))
{
# Make the panel
# ncol: Number of columns of plots
# nrow: Number of rows needed, calculated from # of cols
layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
ncol = cols, nrow = ceiling(numPlots/cols))
}
if (numPlots==1){
print(plots[[1]])
} else {
# Set up the page
grid.newpage()
pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
# Make each plot, in the correct location
for (i in 1:numPlots) {
# Get the i,j matrix positions of the regions that contain this subplot
matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
layout.pos.col = matchidx$col))
}
}
}
```
Next, we'll create the primary function to plot the stations for each city. The inputs here are the list of city, which gets passed into the `kNStations` function we built above, full list of NOAA ISD weather stations, and the k-number of stations you wish to return. The output will be a map for each city, with one black dot and up to *k* red dots (numbered). The black dot is the reference point for that city and the red dots represent the weather stations, labeled 1 to *k*, ordered from closest to farthest away from the reference point.
```{r NOAA: plotStations, echo=TRUE, eval=FALSE, message=FALSE, warning=FALSE}
plotStations <- function(city.list, station.list, k)
{
kns <- kNStations(city.list, station.list, k)
nc <- length(city.list)
plots <- list()
for (i in 1:nc)
{
map <- suppressMessages(get_map(location = city.list[i], zoom = 10))
p1 <- suppressMessages(ggmap(map) +
geom_point(aes(x = LON, y = LAT),
data = kns[kns$city==city.list[i],],
colour="red", size=7, alpha=.5) +
geom_text(aes(x = LON, y = LAT, label=rank),
data = kns[kns$city==city.list[i],]) +
geom_point(aes(x = lon, y = lat),
data = geocode(city.list[i]),
colour="black", size=7, alpha=.5)) +
labs(title=city.list[i]) + theme(plot.margin=unit(c(0,0,0,0),"mm"))
plots[[i]] <- p1
}
if (nc == 1) plot(p1) else multiplot(plotlist = plots, cols = round(sqrt(nc)))
}
# Run the function:
plotStations(cities.of.interest[1:4], stations, 7)
```
We can see that some cities have a lot of weather stations nearby, where as others only have a few. Now that we've confirmed that there should be at least one station close to each city, let's create a function to download the data.
### Downloading Data via FTP
The data we'll be downloading are fixed width, and saved on the ftp in `.gz` format. The widths and column names of the station data is as follows.
```{r NOAA: file formats, echo=TRUE, eval=FALSE}
col.width <- c(4, 6, 5, 4, 2, 2, 2, 2, 1, 6, 7, 5, 5, 5, 4, 3, 1,
1, 4, 1, 5, 1, 1, 1, 6, 1, 1, 1, 5, 1, 5, 1, 5, 1)
col.names <- c("CHARS", "USAFID", "WBAN", "YR", "M", "D", "HR", "MIN",
"DATE.FLAG", "LAT", "LONG", "TYPE.CODE", "ELEV", "CALL.LETTER",
"QLTY", "WIND.DIR", "WIND.DIR.QLTY", "WIND.CODE",
"WIND.SPD", "WIND.SPD.QLTY", "CEILING.HEIGHT", "CEILING.HEIGHT.QLTY",
"CEILING.HEIGHT.DETERM", "CEILING.HEIGHT.CAVOK", "VIS.DISTANCE",
"VIS.DISTANCE.QLTY", "VIS.CODE", "VIS.CODE.QLTY",
"TEMP", "TEMP.QLTY", "DEW.POINT", "DEW.POINT.QLTY",
"ATM.PRES", "ATM.PRES.QLTY")
```
Now we can begin building a function that will download data for a range of years for each of the k-nearest stations to each city that is found in the `kNStations` function. The function will have 3 inputs: the result of the `kNStations` function, a beginning year, and an ending year (both in 4-digit format).
```{r dlStationData, echo=TRUE, eval=FALSE}
dlStationData <- function(kns, beg, end)
{
base.url <- "ftp://ftp.ncdc.noaa.gov/pub/data/noaa/"
nstations <- nrow(kns)
yrs <- seq(beg,end,1)
nyrs <- end-beg+1
usaf <- as.numeric(kns$USAF)
wban <- as.numeric(kns$WBAN)
# The following dataframe will be printed to show
# which files were successfully downloaded
status <- data.frame()
temp <- as.data.frame(matrix(NA,nstations,5))
names(temp) <- c("File","Status", "City", "rank", "kilo_distance")
temp$City <- kns$city
temp$rank <- kns$rank
temp$kilo_distance <- kns$kilo_distance
# Setup for the list of data
temp.list <- df.list <- list()
city.names <- unlist(lapply(strsplit(kns$city,", "), function(x) x[1])) # City Name
df.names <- paste(city.names, kns$USAF, sep="_") # City Name_USAF#
# Download the desired stations into a list (does not save to disk)
for (i in 1:nyrs)
{
for (j in 1:nstations)
{
# Create file name
temp[j,1] <- paste(usaf[j],"-",wban[j],"-", yrs[i], ".gz", sep = "")
tryCatch({
# Create connect to the .gz file
gz.url <- paste(base.url, yrs[i], "/", temp[j, 1], sep="")
con <- gzcon(url(gz.url))
raw <- textConnection(readLines(con))
# Read the .gz file directly into R without saving to disk
temp.list[[j]] <- read.fwf(raw, col.width)
close(con)
# Some housekeeping:
names(temp.list)[j] <- df.names[j]
names(temp.list[[j]]) <- col.names
temp.list[[j]]$LAT <- temp.list[[j]]$LAT/1000
temp.list[[j]]$LONG <- temp.list[[j]]$LONG/1000
temp.list[[j]]$WIND.SPD <- temp.list[[j]]$WIND.SPD/10
temp.list[[j]]$TEMP <- temp.list[[j]]$TEMP/10
temp.list[[j]]$DEW.POINT <- temp.list[[j]]$DEW.POINT/10
temp.list[[j]]$ATM.PRES <- temp.list[[j]]$ATM.PRES/10
temp.list[[j]]$city <- city.names[j]
temp.list[[j]]$distance <- kns$kilo_distance[j]
temp.list[[j]]$rank <- kns$rank[j]
temp[j,2] <- "Success"
},
error=function(cond)
{
return(NA)
next
},
finally={ # if any of the files didn't download successfully, label as such
if(is.na(temp[j,2])=="TRUE") temp[j,2] <- "Failed"
})
}
# Combine each year's status and list
status <- rbind(status, temp)
status <- status[order(status[,3], status[,4], status[,1]),]
df.list <- append(df.list, temp.list)
}
output.list <- list(status, df.list)
names(output.list) <- c("dl_status", "station_data")
return(output.list)
}
```
The output of the above function gives a list of two items.
1) The status of downloading data for each year of each station. It will show either "Failed" or "Success"
2) A list of dataframes for each year of station.
The next step is to reduce the number of dataframes we need to work with by combining the data of each year for the same station.
### Data Compilation
```{r NOAA: combineWeatherDFs, echo=TRUE, eval=FALSE}
combineWeatherDFs <- function(dfList)
{
combined.list <- list()
keys <- unique(names(dfList$station_data))
keys <- keys[keys!=""]
nkeys <- length(keys)
for (i in 1:nkeys)
{
track <- which(names(dfList$station_data)==keys[i])
combined.list[[i]] <- as.data.frame(rbindlist(dfList$station_data[track]))
names(combined.list)[i] <- keys[i]
}
output.list <- list(dfList$dl_status, combined.list)
names(output.list) <- c("dl_status", "station_data")
return(output.list)
}
```
We're now ready to put this all together into one function:
```{r NOAA getStationByCity, echo=TRUE, eval=FALSE}
getStationsByCity <- function(city.list, station.list, k, begin, end)
{
kns <- kNStations(city.list, station.list, k)
weatherDFs <- dlStationData(kns, begin, end)
combined.list <- combineWeatherDFs(weatherDFs)
return(combined.list)
}
```
### Data Filters
This is essentially the guts of the package, however we need to perform two more steps for our future analysis. First, we only need one weather station for each city. Thus far, we have not filtered out any stations so the output of the function above could have up to k-stations for each city. We will establish minimum standards for each dataset, evaluate each station's data, then select station containing the best data. Second, we need observations at every hour of every year. No matter how good the weather station is, it is bound to have some missing observations. Therefore, we will use a method linear interpolation to estimate the missing observations.
These two steps will be add-ons to the function above, so let's start by creating a filtering mechanism. We have four steps in our filtering/selection process:
1) Remove stations with little to no data
2) Remove stations that exceed a maximum distance from each city's reference point
3) Remove stations that exceed a threshold of missing data, including NA values
4) Select closest station remaining for each city, as all remaining stations are deemed adequate
```{r NOAA: filterStationData, echo=TRUE, eval=FALSE}
filterStationData <- function(comb.list, distance, hourly_interval, tolerance, begin, end)
{
dlStatus <- comb.list$dl_status
comb.list <- comb.list$station_data
city.names <- unlist(lapply(comb.list, function(x) unique(x$city)))
# 1) remove stations with little to no data at all
rm.junk <- names(comb.list[which(sapply(comb.list, function(x) dim(x)[1] <= 10))])
comb.list <- comb.list[which(sapply(comb.list, function(x) dim(x)[1] > 10))]
# 2) remove stations that exceed maximum distance
rm.dist <- names(comb.list[which(sapply(comb.list, function(x) max(x["distance"])) > distance)])
comb.list <- comb.list[which(sapply(comb.list, function(x) max(x["distance"])) < distance)]
# keep track of which stations have been removed
rm.tmp <- unique(c(rm.junk, rm.dist))
lapply(comb.list, names)
# 3.a) remove stations that exceed threshold of missing data,
# start with counting the 999s:
cl <- c("TEMP", "DEW.POINT") # Additional columns can be added
ix <- ix.tmp <- NULL
ix.ct <- as.data.frame(matrix(nrow=length(comb.list), ncol=length(cl)))
colnames(ix.ct) <- cl
rownames(ix.ct) <- names(comb.list)
for (L in 1:length(comb.list))
{
for (i in 1:length(cl))
{
ix.tmp <- which(comb.list[[L]][cl[i]]==999.9 | comb.list[[L]][cl[i]]==999 |
comb.list[[L]][cl[i]]==99.99)
ix.ct[L,i] <- length(ix.tmp)
ix <- union(ix,ix.tmp)
}
comb.list[[L]] <- comb.list[[L]][-ix,]
}
ix.ct$temp_pct <- ix.ct[,cl[1]]/unlist(lapply(comb.list,nrow))
ix.ct$dew_pct <- ix.ct[,cl[2]]/unlist(lapply(comb.list,nrow))
# print(ix.ct)
# ms.obs <- (ix.ct$TEMP)+(ix.ct$DEW.POINT)
# 3.b) set a minimum number of observations and remove stations that do not
# meet requirement
yrs <- seq(begin,end,1)
nyrs <- end-begin+1
min.obs <- (24/hourly_interval)*365*nyrs*(1-tolerance)
obs.ix <- which(sapply(comb.list, nrow) < min.obs)
rm.obs <- names(comb.list[which(sapply(comb.list, nrow) < min.obs)])
if(length(rm.obs)==0) comb.list <- comb.list else comb.list <- comb.list[-obs.ix]
# update removed stations
rm.all <- unique(c(rm.tmp, rm.obs))
# 4) All current stations are assumed to be adequet,
# we therefore will take the closest to each reference point
kept.names <- substr(names(comb.list),1,nchar(names(comb.list))-7)
kept.ranks <- unname(unlist(lapply(comb.list, function(x) x["rank"][1,1])))
f.df <- data.frame(location=kept.names, ranks=kept.ranks)
kp.ix <- as.numeric(rownames(f.df[which(ave(f.df$ranks,f.df$location,FUN=function(x) x==min(x))==1),]))
final.list <- comb.list[kp.ix]
# Show what was removed during the filtering process:
kept <- names(comb.list)
st.df <- data.frame(count(city.names))
rm.df <- count(substr(rm.all, 1, nchar(rm.all)-7))
kept.df <- count(substr(kept, 1, nchar(kept)-7))
df.list <- list(st.df, rm.df, kept.df)
mg.df <- Reduce(function(...) merge(..., by="x", all=T), df.list)
suppressWarnings(mg.df[is.na(mg.df)] <- 0)
colnames(mg.df) <- c("city", "stations", "removed", "kept")
filterStatus <- mg.df
# Show the stations that will be in the final output:
finalStations <- names(final.list)
# Create a list for output
finalOutput <- list(dlStatus, filterStatus, finalStations, final.list)
names(finalOutput) <- c("dl_status", "removed_rows", "station_names_final", "station_data")
return(finalOutput)
}
```
This function returns a list of four items. The download status, which is the same as prior functions, a dataframe showing the number and percentage of rows that were filled with NA values, the names of the stations selected in the last step, and the data of those stations. It should be noted that the fourth item is a list of dataframes.
At this point, you could add a line for `filterStationData` to the `getStationsByCity`, however we still need to interplate missing values. This intermediate step is included as a separate function in the package, however we will skip it here.
### Data Interpolation
The final step is to create a function to interpolate all the missing values in each station's dataset.
```{r NOAA: interplateData, echo=TRUE, eval=FALSE}
interpolateData <- function(wx.list)
{
clean.list <- lapply(wx.list, function(x){
ddply(x, .(city, USAFID, distance, rank, YR, M, D, HR), summarise,
LAT=mean(LAT), LONG=mean(LONG), ELEV=mean(ELEV),
TEMP=mean(TEMP), DEW.POINT=mean(DEW.POINT))})
# Create a column with the full posix date for each hour
for (i in 1:length(clean.list))
{
clean.list[[i]]$dates <- as.POSIXct(paste(paste(clean.list[[i]]$YR,"-",clean.list[[i]]$M,
"-",clean.list[[i]]$D, " ",clean.list[[i]]$HR,sep=""),
":",0,":",0,sep=""),"%Y-%m-%d %H:%M:%S", tz="UTC")}
# Create a list of dataframes of each hour
hourly.list <- list()
for (i in 1:length(clean.list))
{
hourly.list[[i]] <- data.frame(hours=seq(
from=as.POSIXct(paste(min(clean.list[[i]]$YR),"-1-1 0:00", sep=""), tz="UTC"),
to=as.POSIXct(paste(max(clean.list[[i]]$YR),"-12-31 23:00", sep=""), tz="UTC"),
by="hour"))
}
wx.df <- data.frame()
for (i in 1:length(clean.list))
{
temp.df <- merge(hourly.list[[i]], clean.list[[i]], by.x="hours", by.y="dates", all.x=TRUE)
temp.df$city <- unique(na.omit(temp.df$city))[1]
temp.df$USAFID <- unique(na.omit(temp.df$USAFID))[1]
temp.df$distance <- unique(na.omit(temp.df$distance))[1]
temp.df$rank <- unique(na.omit(temp.df$rank))[1]
temp.df$LAT <- unique(na.omit(temp.df$LAT))[1]
temp.df$LONG <- unique(na.omit(temp.df$LONG))[1]
temp.df$ELEV <- unique(na.omit(temp.df$ELEV))[1]
temp.df$YR <- as.numeric(format(temp.df$hours,"%Y"))
temp.df$M <- as.numeric(format(temp.df$hours,"%m"))
temp.df$D <- as.numeric(format(temp.df$hours,"%d"))
temp.df$HR <- as.numeric(format(temp.df$hours,"%H"))
# Interpolation
temp.int <- approx(x=temp.df$hours, y=temp.df$TEMP, xout=temp.df$hours)
temp.df$TEMP <- temp.int$y
dew.int <- approx(x = temp.df$hours, y = temp.df$DEW.POINT, xout = temp.df$hours)
temp.df$DEW.POINT <- dew.int$y
# Merge the dataframes together
wx.df <- rbind(wx.df, temp.df)
}
return(wx.df)
}
```
The output of this function is a single, large, dataframe with hourly observations for the station with the 'best' data for each city.
### Putting It All Together
Putting it all together, we end with the following function which will then feed data in the next step of our analysis.
```{r NOAA: getInterpolatedDataByCity, echo=TRUE, eval=FALSE}
getInterpolatedDataByCity <- function(city.list, station.list, k, begin, end, distance, hourly_interval, tolerance)
{
kns <- kNStations(city.list, station.list, k)
weatherDFs <- dlStationData(kns, begin, end)
combined.list <- combineWeatherDFs(weatherDFs)
filteredData <- filterStationData(combined.list, distance, hourly_interval, tolerance, begin, end)
interpolation <- interpolateData(filteredData$station_data)
return(interpolation)
}
```
## References
Segal, M., H. Shafir, M. Mandel, P. Alpert and Y. Balmor (1992). Climatic-related Evaluations of the Summer Peak-hours’ Electric Load in Israel, *Journal of Applied Meteorology* 31 (1992): 1492-1498.
Thatcher, M. J. (2007). Modelling Changes to Electricity Demand Load Duration Curves as a Consequency of Predicted Climate Change for Australia. *Energy* 32 (2007): 1647-1659.
Hellmuth M.E., Osgood D.E., Hess U., Moorhead A. and Bhojwani H. (eds) 2009. *Index insurance and climate risk: Prospects for development and disaster management*. Climate and Society No. 2. International Research Institute for Climate and Society (IRI), Columbia University, New York, USA.
Rogers, David and Vladimir Tsirkunov (2011). *Implementing Hazard Early Warning Systems*. Global Facility for Disaster Reduction and Recovery (GFDRR) in support of strengthening Weather and Climate Information and Decision Support Systems (WCIDS).
Lambert T., Gilman P., Lilienthal P., Micropower system modeling with HOMER, Integration of Alternative Sources of Energy, Farret FA, Simões MG, John Wiley & Sons, December 2005, ISBN 0471712329
Lilienthal P.D., Lambert T.W., Gilman P., Computer modeling of renewable power systems, Cleveland CJ, editor-in-chief, Encyclopedia of Energy, Elsevier Inc., Volume 1, pp. 633-647, NREL Report No. CH-710-36771, 2004