-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathinitial_analysis_walkthrough.Rmd
1699 lines (1195 loc) · 64.7 KB
/
initial_analysis_walkthrough.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Initial Analysis Walkthrough"
#output: html_notebook
---
# Overview
This is a walkthrough of the Eastie Data Science code created during summer and fall of 2020. It includes downloading, cleaning, and doing a high level analysis of the data. At the end, we will export the cleaned datasets. These can be used as inputs to the following two walkthroughs: regime definitions and co-location.
# Downloading and Importing
We're going to be using two datasets: QuantAQ pollutant concentration data from the [QuantAQ website](https://www.quant-aq.com/) and meteorology data from the [Iowa Environmental Mesonet](https://mesonet.agron.iastate.edu/request/download.phtml?network=MA_ASOS).
Note: The paragraph above and the following section will describe the downloading, storing and importing process for a normal workflow. However, there is not always the option of using this normal workflow. After the following downloading, storing and importing sections, there will be a section that describes how to do these things when you need to use the QuantAQ data from the [DropBox folder](https://app.box.com/s/3rs3dasqitqtrwxgtem0ef96jhkbnqtm).
## QuantAQ Data
Download the following QuantAQ files:
* SN45: final and raw data taken from 9/7/2019 to 3/8/2020
* SN49: final and raw data taken from 9/7/2019 to 3/8/2020
* SN62: final and raw data taken from 9/7/2019 to 3/8/2020
* SN67: final and raw data taken from 9/7/2019 to 3/8/2020
* SN72: final and raw data taken from 9/7/2019 to 3/8/2020
## Meteorology Data
Select the following station:
* [BOS] BOSTON/LOGAN INTL
Select the following variables:
* Wind Direction
* Wind Speed [mph]
* Temperature (C)
Select the date range:
* 9/7/2019 to 3/8/2021 (this dataset is not inclusive of the last date)
Select this timezone:
* America/New_York
Use the following download options:
* csv
* no latitude/ longitude vectors
* represent missing data with blank string
* denote trace with blank string
## Storing the data
In order to easily import the data, we'll store the data in the following way:
* Store all the QuantAQ data in the folder called "quant_aq", which is a subfolder in "/data" folder of this workspace. When saving, make sure that the file name starts with the sensor name (ie sn45) and contains either "raw" or "final" in the name, to indicate which type of data was imported. For example: "sn45_final.csv".
* Store the meteorology data in the "/data" folder in this workspace. Store the file as "metdata.csv"
## Importing Data
### Importing Necessary Libraries
Our import and analysis relies on functions from a few packages. The following functions must be imported for the code to run. The first step is to install each of the following packages with the install.packages("...") command - for example, install.packages("lubridate"). Run these in the command line, and they will all be installed locally for you to then import them here. You might get a message that certain functions are "masked" or "overwritten". This means that you imported libraries that have functions with the same name. If you know that these functions do different things from package to package, and want to use a specific one, then you can specific the function's package with package::function. For example, openair::polarPlot()
```{r, warning= FALSE, error = FALSE}
library(lubridate) #date and time functions
library(data.table) #to use the data.table variable type
library(dplyr) #this library allows you to use the %>% operator
library(tidyr) #this library lets you use the complete function to account for time syncing
library(openair) #for plotting and analysis
library(stringr)
library(baseline) # for baseline correction
```
################################################################################
### Import QuantAQ data : Method 1
If we want to import only certain variables, we can define what we want to import ahead of time
```{r}
# # #defining which variables we'll keep
# final_vars <- c( "timestamp","timestamp_local","temp_box", "temp_manifold", "rh_manifold", "pressure",
# "noise", "solar", "wind_dir", "wind_speed", "co", "no" , "no2", "o3" , "pm1" , "pm25",
# "pm10", "co2" ) #final variables to keep
#
# # temp_box temp_manifold rh_manifold pressure noise solar wind_dir wind_speed
# raw_vars <- c("timestamp","bin0", "bin1", "bin2", "bin3", "bin4", "bin5", "no_ae", "co_ae", "no2_ae") #raw variables to keep
```
Next, we'll import all the final and raw datafiles separately. We do this by finding all the csv's with "final" or "raw" in the name, and then applying the R read function to them. If you prefer to import all variables, comment the "sapply" functions that are being used, and uncomment the ones below it. The difference between them is that the currently uncommented one uses the "select" parameter, and the commented one doesn't.
```{r}
# #finding all final datasets
# temp = list.files(path = "./data/quant_aq_March2021", pattern = "^.*_final.*\\.csv", full.names = TRUE)
# finalfiles = sapply(temp, FUN = function(x) fread(file=x, data.table = TRUE, select = final_vars), simplify = FALSE,USE.NAMES = TRUE)
# #finalfiles = sapply(temp, function(x) fread(file=x , data.table = TRUE) , simplify = FALSE,USE.NAMES = TRUE)
#
# #finding all raw datasets
# temp = list.files(path = "./data/quant_aq_March2021", pattern = "^.*_raw.*\\.csv", full.names = TRUE)
# rawfiles = sapply(temp, function(x) fread(file=x, data.table = TRUE, select = raw_vars) , simplify = FALSE,USE.NAMES = TRUE)
# #rawfiles = sapply(temp, function(x) fread(file=x, data.table = TRUE) , simplify = FALSE,USE.NAMES = TRUE)
```
The output of the above functions gives a list of data tables. Since we included the sapply parameter "use.names", it attached the file name to each data.table. Since we used data.table = TRUE, we can be sure that the data is given in the data.table format, which has a more efficient runtime for large datasets.
Let's use an easier naming conventon. Since we used a specific naming scheme and a specific path, we can just extract the part of the path name that describes the dataset. Specifically, we're extracting "sn" and the sensor number. For example, "sn45" or "sn46".
```{r}
# names(finalfiles)<- lapply(names(finalfiles), function(x) substr(x, 9+18, 9+21)) #extract the real sensor names from the current names used in the list
# names(rawfiles)<- lapply(names(rawfiles), function(x) substr(x, 10+17, 10+20))
# names(finalfiles)
# names(rawfiles)
```
Next, we want to combine the final and raw files for a given sensor. However, it will be beneficial to keep the sensors as a list of dataframes, as we'll see later.
Since the final and raw datasets could have a different number of rows, we'll merge them based on entries to the "timestamp" column.
3/17 edit: In the newest version of the data I pulled from the QuantAQ website, the data weren't all synced to the same seconds. Therefore, we have to put all the data in ymd_hms format first, and round it to the nearest minute.
```{r}
# clean_sensor_data <- function(sensor){
# sensor$date <- ymd_hms(sensor$timestamp, tz="UTC") #parse datetime
# sensor <- mutate(sensor, originaldate = date) #keeping original times for comparing to flight data
# sensor$date <- round_date(ymd_hms(sensor$date, tz="America/New_York"), unit="minute") #round date for merging
# sensor<- sensor[order(sensor$originaldate),] #put it in chronological order
# return(sensor)
# }
```
```{r}
# rawfiles <- lapply(rawfiles, function(x) clean_sensor_data(x))
# rawfiles <- lapply(rawfiles, function(x) unique(x))
# finalfiles <- lapply(finalfiles, function(x) clean_sensor_data(x))
# finalfiles <- lapply(finalfiles, function(x) unique(x))
```
```{r}
# snfiles <- sapply(names(rawfiles), function(k) merge.data.table(finalfiles[[k]], rawfiles[[k]], by="date", all = FALSE), simplify = FALSE,USE.NAMES = TRUE)
```
#### Proof that Combining Datasets works
In case you aren't following the normal workflow, here is proof that binding the final and raw files still works.
Let's download SN45 and SN46 data from the QuantAQ website. Download a final and raw datafile, from 10/23 to 10/30. Save these files as "sn45_trial_final.csv" and "..raw.csv", and put them in the "quant_aq" data folder.
Import those files here.
```{r}
# sn45_raw <- fread("./data/quant_aq/sn45_trial_raw.csv", data.table = TRUE) # import final sn45 data
# sn45_final <- fread("./data/quant_aq/sn45_trial_final.csv", data.table = TRUE) # import final sn46 data
#
# sn46_raw <- fread("./data/quant_aq/sn46_trial_raw.csv", data.table = TRUE) # import raw sn45 data
# sn46_final <- fread("./data/quant_aq/sn46_trial_final.csv", data.table = TRUE) # import raw sn46 data
```
Let's try merging SN45 together based on the "timestamp" vector.
```{r}
# merge(sn45_raw, sn45_final, by = "timestamp", all = FALSE) #merge both sn45 files
```
Now let's simulate the list of final dataframes and raw dataframes.
```{r}
# finalfiles <- list("sn45" = sn45_final,"sn46" = sn46_final) # create list of final dataframes
# rawfiles <- list("sn45" = sn45_raw, "sn46" = sn46_raw) # create list of raw dataframes
```
We'll bind the appropriate files together in an lapply statement.
```{r}
# snfiles_trial <- sapply(names(finalfiles), function(k) merge.data.table(finalfiles[[k]], rawfiles[[k]], by="timestamp", all = FALSE), simplify = FALSE,USE.NAMES = TRUE) #bind lists of final and raw dataframes
```
This is what's contained in the new dataframes
```{r}
# colnames(snfiles_trial$sn45)
```
#########################################################################################3
# Downloading, Saving and Importing using Dropbox workflow (Importing Method 2)
## Downloading
Click "Download" in the top right corner of the Dropbox screen. This will download all of the contents in the DropBox folder.
## Saving
The data will download as a zip file. Extract the contents of the zip file, and place them in the "quant_aq" subfolder of the "data" folder in this workspace. Make sure to save the files directly to that folder, and not in a "Final_Customer" subfolder.
## Importing
These datafiles aren't divided into final and raw, so we don't have to combine files. Instead, we can create one list of variables we want to extract (instead of two, like in the normal workflow).
```{r}
# #defining which variables we'll keep
# final_vars <- c( "timestamp","timestamp_local","temp_box", "temp_manifold", "rh_manifold", "pressure",
# "noise", "solar", "wind_dir", "wind_speed", "co", "no" , "no2", "o3" , "pm1" , "pm25",
# "pm10", "co2" , "no_ae", "co_ae", "no2_ae", "bin0", "bin1", "bin2", "bin3", "bin4", "bin5")
```
The data files can be categorized based on which sensor they represent.
Initially, the data files were categorized by whether they are "f0" or "f1" (both of these categorizes are represented in the filename). The difference between f0 and f1 is that one doesn't have filters apply, and the other does.
Currently, there are only "f0" files, so we'll specify we're looking at datafiles with "f0" in the name.
```{r}
# temp = list.files(path = "./data/quant_aq/", pattern = "^.*-f0.*\\.csv", full.names = TRUE) #specify files we're looking for
# snfiles = sapply(temp, FUN = function(x) fread(file=x, data.table = TRUE, select = final_vars), simplify = FALSE,USE.NAMES = TRUE) #importing them
```
Like in the normal workflow, we can reformat the names. Again, we'll only keep the relevant parts of the filenames.
```{r}
# new_names <- lapply(names(snfiles), function(x) substr(x, 17, 26)) #extract the real sensor names from the current names used in the list
# names(snfiles) <- new_names #rename list names
# names(snfiles)
```
As we see, these are in a different format than the ones imported from the QuantAQ website. We can change this by doing a string replacement - "sn" instead of "Output-0"
```{r}
# names(snfiles) <- lapply(names(snfiles), function(x) str_replace(x, "Output-0", "sn"))
# names(snfiles)
```
################################################################################
# Importing data : Method 3
On 3/19, we decided to merge two dataframes (quantAQ and the box data), in order to get a more complete dataset. We did this in the merging_sensor_data.Rmd file. Now, we're going to import those files.
We saved the files in the data/quant_aq/merged data folder. We're going to import them directly here, without needing to remove columns.
```{r}
nm <- list.files(path = "./data/joined/", pattern = "*.csv", full.names = TRUE)
snfiles <- sapply(nm, FUN = function(x) fread(file=x, data.table = TRUE), simplify = FALSE,USE.NAMES = TRUE)
```
```{r}
names(snfiles) <- lapply(names(snfiles), function(x) substr(x, 15, 18)) #extract the real sensor names from the current names used in the list
```
```{r}
set_vector <- function(snfile){
snfile$date <- ymd_hms(snfile$date, tz = "UTC")
return(snfile)
}
```
```{r}
snfiles <- lapply(snfiles, function(x) set_vector(x))
```
################################################################################
# Import Meteorology data
First, we import the file. We can indicate that there is a header.
```{r}
metdata <- fread("data/metdata_temp.csv", header=TRUE, data.table = TRUE) #import meteorology file
```
If you look at the file, you'll see that the datetime is not in the standard datetime format. We can standardize it to the quantAQ data format using the R built-in POSIXct function, which changes datetime formats. We can specific which timezone this data is already in, as well. Be sure that the timezone you specify matches the timezone that you originally specified in the data download.
```{r}
metdata$date <- as.POSIXct(metdata$valid, format = "%Y-%m-%d %H:%M", tz = "UTC") #setting datetime, using correct timezone on East Coast Local time
#metdata$date <- as.POSIXct(metdata$valid, format = "%Y-%m-%d %H:%M") #setting datetime, using correct timezone for UTC
```
Next, if we look at data columns, we see that the column names are unrepresentative of the data. if we look at the date column in particular, we see that the data is not in one-minute intervals, like the sensor data, but in 5 minute intervals. To solve this second concern, we can add one-minute intervals in between and then fill those 1 minute intervals with the data from the 5 minute intervals. Let's say that wind direction is 0 for 11:00 and 120 for 11:05. Then, the new data would be: 11:00 - 0, 11:01 - 0, 11:02 - 0, 11:03 - 0, 11:04 - 0, 11:05 - 120, 11:06 - 120 ... 11:09 - 120
We can solve both of these concerns at once, using the dpylr operator (%>%)
```{r}
metdata <- metdata %>%
# if only wind speed and wind direction matter
setnames(old = c("drct", "sped", "valid"), new = c("wd", "ws", "original_met_time")) %>% #rename
na.omit("date") %>% #if any dates are NA, the following function won't work
complete(date = seq(from = min(date), to= max(date), by = "1 min")) %>%#make 1 minute intervals
fill(c("wd", "ws", "tmpc")) #fill those new 1 minute interval rows
#if the entire met dataset matters
# setnames(old = c("drct", "sped", "valid", "relh", "mslp"), new = c("wd", "ws", "original_met_time", "rh", "press")) %>% #rename
# na.omit("date") %>% #if any dates are NA, the following function won't work
# complete(date = seq(from = min(date), to= max(date), by = "1 min")) %>%#make 1 minute intervals
# fill(c("wd", "ws", "rh", "press", "sknt", "tmpc", "press", "dwpc")) #fill those new 1 minute interval rows
```
We specified on the Iowa Mesonet website that our data's null values would show up as a blank string. When we import into R, this will then translate as an NA value. Since the missing strings are NAs, wind speed and wind direction become numeric vectors. If the missing data was set to "null", for example, then the vectors would be strings
```{r}
# class(metdata$ws)
# class(metdata$wd)
# class(metdata$rh)
# class(metdata$press)
# class(metdata$sknt)
# class(metdata$tmpc)
# class(metdata$press)
# class(metdata$dwpc)
# class(metdata$date)
```
On the website, the wind speed was given in in mph. However, we would prefer to use m/s. Let's convert ws from mph to m/s.
```{r}
metdata$ws <- metdata$ws * (1609/3600) #converting to m/s, 1609 meters per mile, 3600 seconds per hr
```
Lastly, let's remove variables that we won't use:
```{r}
metdata[,c( "station")] <- list(NULL) #getting rid of unnecessary variables
```
```{r}
metdata <- unique(metdata)
```
#################################################################################
# Formatting and Merging Data
## Formatting QuantAQ sensor data
Similar to the meteorology data, we can identify what needs to be reformatted in the sensor data.
Let's walk through it step-by-step for SN45.
### Formatting Walkthrough for SN45
First, let's standardize the date. The ymd_hms function puts datetime data in the format Y:M:D H:M:S, so we'll use it. This function also has a parameter where you can define the timezone. We know the timestamp data is UTC, so we use that. Afterwards, we can move the datetime vector to be in the EST/EDT timezone with the timezone (tz) function. We're using timestamp instead of timestamp local, because QuantAQ said it was more reliable and because we can check the formatted datetime against the local timestamp afterwards.
We'll print a date vector at each step, to show what the functions are doing.
```{r}
# sprintf("unmodified timestamp : %s, timezone : %s", snfiles$sn45$timestamp[1], tz(snfiles$sn45$timestamp[1]))
#
# snfiles$sn45$date <- lubridate::ymd_hms(snfiles$sn45$timestamp, tz="UTC") #parse datetime
#
# sprintf("modified timestamp : %s , timezone : %s", snfiles$sn45$date[1] , tz(snfiles$sn45$timestamp[1]))
#
# snfiles$sn45$date <- lubridate::with_tz(snfiles$sn45$date, "America/New_York") #change timezone
#
# sprintf("modified timezone: %s , timezone : %s ", snfiles$sn45$date[1], tz(snfiles$sn45$date[1]))
# sprintf("compared to local timestamp: %s", snfiles$sn45$timestamp_local[1])
```
Next, we want to round the data to the nearest minute, so that we could align it with the meteorological dataset. We'll preserve the original datetime vector by saving it as a separate vector. Then, we'll round the data so 11:30:29 rounds to 11:30:00 and 11:30:31 rounds to 11:31:00.
```{r}
# snfiles$sn45 <- mutate(snfiles$sn45, originaldate = date) #keeping original times for comparing to flight data
# sprintf("new date object : %s, old date object : %s", snfiles$sn45$date[1], snfiles$sn45$originaldate[1])
# snfiles$sn45$date <- round_date(snfiles$sn45$date, unit="minute") #round date for merging
# sprintf("new date object : %s, old date object : %s", snfiles$sn45$date[1], snfiles$sn45$originaldate[1])
```
Lastly, we'll reorder the data so it's chronological, instead of reverse chronological.
```{r}
# sprintf("first date entry : %s", snfiles$sn45$date[1])
# snfiles$sn45 <- snfiles$sn45[order(snfiles$sn45$originaldate),] #put it in chronological order
# sprintf("first date entry : %s", snfiles$sn45$date[1])
```
################################################################################
### Formatting the rest of the QuantAQ files
We'll do what we did for sn45 for all of the QuantAQ files.
In order to easily apply the cleaning to all the datasets, we're going to create a function to apply to the list of data frames. Below is the function.
```{r}
# clean_sensor_data <- function(sensor){
# sensor$date <- ymd_hms(sensor$timestamp, tz="UTC") #parse datetime
# sensor$date <- with_tz(sensor$date, "America/New_York") #change the time according to shifted timezone
# sensor <- mutate(sensor, originaldate = date) #keeping original times for comparing to flight data
# sensor$date <- round_date(ymd_hms(sensor$date, tz="America/New_York"), unit="minute") #round date for merging
# sensor<- sensor[order(sensor$originaldate),] #put it in chronological order
# return(sensor)
# }
```
In order to apply it to every dataframe, we can run lapply (list apply) which will apply a certain function to every element in a list. We're going to run it on every element of the snfiles list, except the first (sn45), since we did that above.
```{r}
# snfiles[c(-1)] <- lapply(snfiles[c(-1)], function(x) clean_sensor_data(x))
```
#### Removing Unwanted Time Intervals
There are time periods in the ambient data that we know we don't want to keep. For example, we don't want to keep any data from periods of calibration. In this section we will be removing time intervals from the data that we don't want.
To show how this is done, we will be removing the period [5/20/2020, 7/30/2020] from sn45.
```{r}
rows_to_remove <- snfiles$sn45[date %between% c("2020-05-20 00:00:00 EDT", "2020-07-31 23:59:00 EDT")]
snfiles$sn45 <- anti_join(snfiles$sn45, rows_to_remove)
```
And show that those dates are gone:
```{r}
print( snfiles$sn45[date > "2020-05-10 00:00:00 EDT" & date < "2020-08-10 23:59:00 EDT"])
```
And now we can make the above operation into a function and apply it to all the dataframes.
```{r}
remove_dates <- function(dataset, startdate, enddate){
rows_to_remove <- dataset[date %between% c(startdate, enddate)]
dataset <- anti_join(dataset, rows_to_remove)
return(dataset)
}
```
```{r}
snfiles[c(-1)] <- lapply(snfiles[c(-1)], function(x) remove_dates(x,"2020-05-20 00:00:00 EDT","2020-07-31 23:59:00 EDT"))
```
We can also make sure that all the dataframes start at the same date:
```{r}
snfiles <- lapply(snfiles, function(x) x[date > "2019-09-07 00:00:00 EST"])
```
## Merging
From trial and error, I've found that joining datasets using DT's merge works the best. Below is the code to implement it. First, we make sure that each dataset is a data.table with a key column of "date". This will allow us to the data.table merge function. We merge the meteorological and sensor datasets by date with no_match = 0. no_match = 0 ensures that there's no duplicates.
First, we'll implement it for SN45.
```{r}
setDT(metdata, key = "date") # make the meteorological data a data.table
setkey(snfiles$sn45, "date") #set the "key" column to date
sprintf("Original number of sensor data rows: %d", nrow(snfiles$sn45))
sprintf("Original number of met data rows: %d", nrow(metdata))
#sprintf("The number of dates that are the same in the sensor and metdata datetimes: %d", length(intersect(snfiles$sn45$date, metdata$date)))
snfiles$sn45 <- snfiles$sn45[metdata, nomatch = 0] #merge
sprintf("New number of rows in merged data: %d", nrow(snfiles$sn45))
```
Let's see what the vector looks like now:
```{r}
colnames(snfiles$sn45)
```
As you can see, there is only one date column, all of the sensor columns, wind direction, wind speed, and the original meteorological datetime vector.
Now, let's implement it for the rest of the data frames. Again, we'll create a merge function and apply it to every element of the snfiles object (in other words, to every dataset).
```{r}
merge_metdata <- function(sensor, metdata){
setkey(sensor, "date")
sensor <- sensor[metdata, nomatch = 0]
return(sensor)
}
```
```{r}
snfiles[c(-1)] <- lapply(snfiles[c(-1)], function(x) merge_metdata(x, metdata))
```
As before, we exclude the first dataframe, since we already merged it above.
################################################################################
# Data Cleaning
## Checking the data
First, we're going to look at statistics of the data, to make sure that there isn't a large problem with the data (ie everything is NA). We're going to do this by creating a function which prints out how many values are negative, zero and NA in the data. Then we're going to use lapply to apply it to all the datasets.
```{r}
data_check_test <- function(sensor){
temp_df <- sensor[, c("co", "no" , "correctedNO", "no2", "o3" , "co2", "pm1" , "bin0") ] #define which variables to inspect
na_count <- apply(temp_df, 2, function(x) sum(is.na(x))) #count the number of NA values
negative_count <-apply(temp_df, 2, function(x) sum(x<0, na.rm = TRUE)) #count the number of negative values
zero_count <- apply(temp_df, 2, function(x) sum(x==0, na.rm = TRUE)) #count the number of zero values
return(data.frame(na_count, negative_count, zero_count))
}
```
We'll apply it to our list of dataframes:
```{r}
lapply(snfiles, function(x) data_check_test(x))
```
We'll also test for the mean, 25th percentile and 75th percentile of each variable. This will tell us if the data is in a reasonable range.
We'll create two functions: one which computes the percentiles and one which computes the mean.
```{r}
data_check_percentiles <- function(sensor){
temp_df <- sensor[, c( "co", "no" , "correctedNO", "no2", "o3" , "co2", "pm1" , "bin0") ] #define which variables to inspect
percentiles <-apply(temp_df, 2, function(x) quantile(x, probs =c(0.25, 0.75), na.rm = TRUE)) # apply the quantile function, which calculates percentiles
return(percentiles)
}
data_check_mean <- function(sensor){
temp_df <- sensor[, c( "co", "no" , "correctedNO", "no2", "o3" , "co2", "pm1" , "bin0") ] #define which variables to inspect
mean_vals <-apply(temp_df, 2, function(x) mean(x, na.rm = TRUE)) #apply the mean function
return(mean_vals)
}
```
We'll apply these functions to our datasets:
```{r}
lapply(snfiles, function(x) data_check_percentiles(x))
lapply(snfiles, function(x) data_check_mean(x))
```
## Clean Duplicate Rows
#### Data by Row
There are some rows that are exact duplicates.
```{r}
sum(duplicated(snfiles$sn45) == 1) #checking the number of duplicate entries
```
Let's see what they look like:
```{r}
snfiles$sn45[duplicated(snfiles$sn45) | duplicated(snfiles$sn45, fromLast=TRUE)] #print all duplicates, including the first one
```
These rows are entries that are exactly the same (all the columns are the same, even timestamp). Since there is no difference between this information, we can remove the duplicates:
```{r}
snfiles <- lapply(snfiles, function(x) unique(x)) #keep only unique rows in the dataset
```
We have to remove these first, because they would otherwise skew our pollutant concentration analysis.
Let's make sure they were removed:
```{r}
snfiles$sn45[duplicated(snfiles$sn45) | duplicated(snfiles$sn45, fromLast=TRUE)] #print all duplicates, including the first one
```
## Removing Negative Values
First, we'll remove negative values. For this, we are simply finding all the points in all the pollutant vectors which are less than 0, and then setting them to 0. More filters can be added by adding in another replace statement in the same format.
```{r}
negative_value_filter <- function(sensor){
sensor$o3 <- replace(sensor$o3, sensor$o3 < 0, 0)
sensor$co <- replace(sensor$co, sensor$co <0, NA)
sensor$co2 <- replace(sensor$co2, sensor$co2 <0, NA)
sensor$no2 <- replace(sensor$no2, sensor$no2 <0, NA)
sensor$bin0 <- replace(sensor$bin0, sensor$bin0 <0, NA)
sensor$pm1 <- replace(sensor$pm1, sensor$pm1 <0, NA)
return(sensor)
}
```
We'll apply this function to all the datasets:
```{r}
snfiles <- lapply(snfiles, function(x) negative_value_filter(x))
```
## Removing NO Negative Values
NO has a changing baseline. It can't be filtered like the ones above because then we would lose real data. Therefore, we're going to apply a baseline correction algorithm to the NO data that was developed for Raman Spectroscopy. Since the baseline function wasn't made for this application, we have to prepare the data and feed it into the baseline function incrementally. We'll do this in the no_baseline_filter wrapper function below. The wrapper function was developed during summer 2020, and there is a document describing its components on the shared drive.
```{r}
no_baseline_filter <- function(sensor){
# create day column
sensor[, day := as.Date(date, format="%Y-%m-%d", tz = "America/New_York")]
# create corrected column
sensor[, correctedNO := seq(0,0,length.out= length(sensor$no))]
sensor$correctedNO[sensor$correctedNO == 0] <- NA #set them actually to NA
dropNAsensor<- sensor[!is.na(sensor$no), ] # drop NO NAs
unique_days <- c(unique(dropNAsensor$day, na.rm=TRUE)) #get all of the unique days in the sensor
for (i in 2:(length(unique_days)-1)){ #for all days
temp <- subset(dropNAsensor, day %in% unique_days[i], c("day", "no", "date")) #create temp dataset
if (nrow(temp) > 550){
wholebase.peakDetection <- baseline(t(temp$no), method='peakDetection',left=50, right=50, lwin=10, rwin=10) #baseline correction
#replace the correctedNO column values with the baseline correction from earlier
dropNAsensor$correctedNO[which(dropNAsensor$date == temp$date[1]): which(dropNAsensor$date == tail(temp$date, n=1))] <- c(getCorrected(wholebase.peakDetection))
}
else{
if (sum(temp$no < 0, na.rm = TRUE) / nrow(temp) < 0.25){
dropNAsensor$correctedNO[which(dropNAsensor$date == temp$date[1]): which(dropNAsensor$date == tail(temp$date, n=1))] <-
sensor$no[which(sensor$date == temp$date[1]): which(sensor$date == tail(temp$date, n=1))]
}
}
}
sensor$correctedNO[which(sensor$date %in% dropNAsensor$date)] <- dropNAsensor$correctedNO # replace values based on date
return(sensor)
}
```
We'll apply this function to all the datasets:
```{r}
snfiles <- lapply(snfiles, function(x) no_baseline_filter(x))
```
## Removing Values Based on Rate of Change of Auxiliary Electrode
Auxiliary electrodes indicate that the corresponding data could be incorrect. Because of that, we'll filter variables that have unexpected ae values.
In the future, we could add more auxiliary electrode checks. Therefore, we'll create an general auxiliary electrode function which takes in a variable and an auxiliary electrode. The equation will measure the standard deviation of the change in ae values, and pick out datapoints where the sd surpasses a limit that the user sets. We'll write the equation below:
```{r}
ae_cleaning <- function(sensor, pollutant, ae_val, sd_val){
ae_derivative <- c(0,diff(sensor[[ae_val]], na.rm = TRUE)) #create vector of derivatives
sensor[which(abs(ae_derivative) > sd_val*(sd(abs(ae_derivative), na.rm=TRUE))), pollutant] <- NA #compare sd
return(sensor)
}
```
For now, we're interested in no. We can filter no values that have a change in no_ae above 2.5 sd from the mean:
```{r}
#snfiles_no_test <- lapply(snfiles, function(x) ae_cleaning(x, "no", "no_ae", 2.5))
```
In the future, we can use this function to filter for multiple ae values at once by storing the parameters in a list and looping through them. Below, I provide an example:
```{r}
# parameter_list = list(c("no", "no_ae", 2.5), c("no2", "no2_ae", 2))
#
# for(parameters in parameter_list){
# snfiles <- lapply(snfiles, function(x) ae_cleaning(x, parameters[1], parameters[2], as.numeric(parameters[3])))
# }
```
## Removing High Values without flags
There are unrealistically high values which we know can't be real and we don't have to check. We'll filter those out here.
```{r}
o3_filter <- function(sensor){ #create o3 filter function
sensor$o3 <- replace(sensor$o3, sensor$o3 > 300, NA)
return(sensor)
}
```
```{r}
#sn45 <- o3_filter(sn45)
snfiles <- lapply(snfiles, function(x) o3_filter(x))
```
## Remove CO spikes
First we look for periods when there's large jumps between datapoints
```{r}
trial <- snfiles$sn45
trial$timediff <- c(NA, difftime(trial$date[-1],
trial$date[-nrow(trial)],
units="mins"))
```
Next, we find vectors where the difference is more than an hour
```{r}
large_jumps <- trial[which(trial$timediff > 60), ]
jump_indices <- which(trial$timediff > 60)
large_jumps$date
```
Then, we take where these jumps happen and we set the next 20 minutes to NA
```{r}
trial$removeCO <- "FALSE"
for(jump_index in jump_indices){
#trial$removeCO[jump_index:jump_index+20] := "TRUE"
trial[jump_index:(jump_index+20), removeCO := "TRUE"]
#set(trial,jump_index:jump_index+20, 1:40 ,NA_character_)
}
#trial <- complete.cases(trial)
```
```{r}
# trial2 <- trial[complete.cases(trial)]
# data6 <- filter(trial, rowSums(is.na(trial)) != ncol(trial))
# trial4 <- subset(trial, removeCO != "TRUE")
# trial5 <- trial[removeCO == "FALSE"]
trial2 = trial
trial2$co[trial2$removeCO == "TRUE"] <- NA
```
```{r}
trial[which(trial$timediff > 60), removeCO]
```
```{r}
removeCOspikes <- function(sensor){
#calculate how much time elapses between data points
sensor$timediff <- c(NA, difftime(sensor$date[-1],
sensor$date[-nrow(sensor)],
units="mins"))
#find where there are jumps of an hour or more
jump_indices <- which(sensor$timediff > 60)
# set up which indices should be removed from the data
sensor$removeCO <- "FALSE"
for(jump_index in jump_indices){
sensor[jump_index:(jump_index+20), removeCO := "TRUE"]
}
#remove CO
sensor$co[sensor$removeCO == "TRUE"] <- NA
return(sensor)
}
```
```{r}
snfiles <- lapply(snfiles, function(x) removeCOspikes(x))
```
## Removing High Values with flags
The following function applies flags to the data. Each flag value corresponds to a pollutant that surpassed its limit. In order to change the limit, you can change the limit value directly inside of the function.
```{r}
apply_flags <- function(sensor){
# creating a flags column to flag high values
# flag number to pollutant definition given below:
# co - 1
# co2 - 2
# no2 - 3
# bin0 - 4
# pm1 - 5
# no - 6
sensor$flag <- replicate(nrow(sensor), c(), simplify = FALSE) #create flags column which has a list for every data point
sensor$flag[which(sensor$co > 10000)] <- lapply(sensor$flag[which(sensor$co > 10000)], function(x) c(x, 1)) #add 2 to the flags data points with high CO2 values
sensor$flag[which(sensor$co2 > 2000)] <- lapply(sensor$flag[which(sensor$co2 > 2000)], function(x) c(x, 2)) #add 2 to the flags data points with high CO2 values
sensor$flag[which(sensor$no2 > 1000)] <- lapply(sensor$flag[which(sensor$no2 > 1000)], function(x) c(x, 3))
sensor$flag[which(sensor$no > 300)] <- lapply(sensor$flag[which(sensor$no > 300)], function(x) c(x, 6))
return(sensor)
}
```
The next set of functions plots the flagged values for a given variable.
```{r}
check_flags <- function(sensor, flagval, var_name){
#plots flagged intervals
index_df = create_interval_vector(sensor, flagval) #generate dataframe of flagged intervals
for(pair in 1:nrow(index_df)){ # for each interval
start_indx = index_df[pair,1] #get the start index of the interval
end_indx = index_df[pair,2] #get the end index
if(!is.null(start_indx) & !is.null(end_indx) & length(is.na(start_indx)) != 0 & length(is.na(end_indx)) != 0 & length(difftime(sensor$date[end_indx], sensor$date[start_indx], units="mins") < 2)!=0){ #if the interval lasts less than 2 minutes, set to NA
sensor[[var_name]][start_indx:end_indx] <- NA
}
else{ #otherwise plot it
if(length(is.na(start_indx)) != 0 & length(is.na(end_indx)) != 0){
sprintf("start and stop : %f, %f", start_indx, end_indx)
subset_sensor <- sensor[start_indx:end_indx,]
#cat(sprintf("start and end intervals: %s - %s \n", as.character(sensor$date[start_indx]), as.character(sensor$date[end_indx] )))
timePlot(subset_sensor, pollutant = var_name, key = FALSE, main = paste0(as.character(sensor$date[start_indx]) ," to ", as.character(sensor$date[end_indx] )))
}
}
}
}
print_flag_intervals <- function(sensor, flagval){
#plots flagged intervals
index_df = create_interval_vector(sensor, flagval) #generate dataframe of flagged intervals
if(nrow(index_df) == 0){
print("No flagged data")
}
else{
mod_start_indx <- index_df[,1]
mod_end_indx <- index_df[,2]
rows_to_keep <- which( difftime(sensor$date[mod_end_indx], sensor$date[mod_start_indx], units="mins") > 2)
sprintf("Number of intervals %f", rows_to_keep)
}
}
create_interval_vector <- function(sensor, flagval){
# input a sensor dataframe and a flag (as numeric) to plot
# returns a dataframe that has intervals where the high values start and end
flagindxs <- ifelse(
(
(sensor$flag %like% flagval )
),
1, # if condition is met, put 1
0 # else put 0
)
start_indx <- which(diff(c(0L, flagindxs)) == 1L)
end_indx <- which(diff(c(flagindxs, 0L)) == -1L)
mod_start_indx <- start_indx - 5 #adding 5/10 minutes before, depending on whether data is taken at 1 min or 2 min intervals
mod_end_indx <- end_indx + 5 #adding 5/10 minutes after
index_df <- data.frame(mod_start_indx,mod_end_indx) #create dataframe with corresponding start and end indices where high values happen
#clean dataframe for intervals that start before 0
if(any(index_df$mod_start_indx < 1)){
index_df$mod_start_indx <- replace(index_df$mod_start_indx, which(index_df$mod_start_indx <1), 1 )
}
#clean dataframe for intervals that start after the end of the pollutant vector
if(any(index_df$mod_start_indx > nrow(sensor))){
indxs_pairs <- which(index_df$mod_start_indx > nrow(sensor))
index_df <- index_df[!indxs_pairs]
}
#clean dataframe for intervals that end after the end of the pollutant vector
if(any(index_df$mod_end_indx > nrow(sensor))){
index_df$mod_end_indx <-replace(index_df$mod_end_indx, which(index_df$mod_end_indx > nrow(sensor)), nrow(sensor) )
}
#index_df <- index_df[!duplicated(index_df$mod_end_indx)]
return(index_df)
}
```
We can apply the flags to the sn dataframes:
```{r}
snfiles <- lapply(snfiles, function(x) apply_flags(x))
```
We can check if these flags capture outlying values well by printing out plots where flags occur. First, however, we'll check how many intervals of flagged values there are, to make sure that we're not generating hundreds of graphs.
```{r}
# for(var in 1:6){
# lapply(snfiles, function(x) print_flag_intervals(x, var))
# }
```
If the number of intervals is too high, it might be good to raise the limit.
Otherwise, we can print the plots below. Below, there are two options how to print the pdfs. We can print directly in R, or we can print to pdf. To print to pdfs, uncomment the "pdf" and "dev.off" lines.
```{r}
flags_list = list(c(1,"co"), c(2,"co2"), c(3,"no2"), c(6, "no"))
for(i in flags_list){
#pdf(paste("check_flags_", i, ".pdf", sep = ""))
#lapply(snfiles, function(x) check_flags(x, i[1], i[2]))
#dev.off()
}
```
The following function will remove flagged pollutant values. The user inputs which flag they want to remove values for, and the function will remove the values that correspond to that flag (as given in the function that creates flags).
```{r}
remove_flags <- function(sensor, flag_val, pollutant){
sensor[which(sensor$flag %like% flag_val), pollutant] <- NA
return(sensor)
}
```
```{r}
flags_list = list(c(1,"co"), c(2,"co2"), c(3,"no2"), c(6, "no"))
for(i in flags_list){
snfiles <- lapply(snfiles, function(x) remove_flags(x, i[1], i[2]))
}
```
################################################################################
# Igor Quality Assurance
Next, we'll export the data to csv, put it into Igor and do a quality inspection.
First, we remove the flags column, since its a list and can't export:
```{r}
snfiles$sn45$flag <- NULL
snfiles$sn46$flag <- NULL
snfiles$sn49$flag <- NULL
snfiles$sn62$flag <- NULL
snfiles$sn67$flag <- NULL
snfiles$sn72$flag <- NULL
```
Then, we'll convert the date vector to local time.
```{r}
get_localtime <- function(sensor){
##sensor$date_local = sensor$date
sensor$date_local <- with_tz(sensor$date, tzone = "America/New_York")
return(sensor)
}
```
```{r}
snfiles <- lapply(snfiles, function(x) get_localtime(x))
```
We can check that the date is actually local by comparing it to QuantAQ's local timestamp
Next, we'll make an Igor date column, to make importing into Igor faster:
```{r}
create_date_for_igor <- function(dataset){
#create igor date column
ymd_col <- c(as.character(format(dataset$date, "%Y-%m-%d")) )
middle <- rep(" ", length(ymd_col))
hms_col <- c(as.character(format(dataset$date, "%H:%M:%S")))
dataset$igor_date <- paste(ymd_col, middle, hms_col)
#create igor local date column
ymd_col <- c(as.character(format(dataset$date_local, "%Y-%m-%d")) )
middle <- rep(" ", length(ymd_col))
hms_col <- c(as.character(format(dataset$date_local, "%H:%M:%S")))
dataset$igor_date_local <- paste(ymd_col, middle, hms_col)
return(dataset)
}
```
```{r}
snfiles <- lapply(snfiles, function(x) create_date_for_igor(x))
```