initial_analysis_walkthrough.Rmd

---
title: "Initial Analysis Walkthrough"
#output: html_notebook
---

# Overview

This is a walkthrough of the Eastie Data Science code created during summer and fall of 2020. It includes downloading, cleaning, and doing a high level analysis of the data. At the end, we will export the cleaned datasets. These can be used as inputs to the following two walkthroughs: regime definitions and co-location. 

# Downloading and Importing

We're going to be using two datasets: QuantAQ pollutant concentration data from the [QuantAQ website](https://www.quant-aq.com/) and meteorology data from the [Iowa Environmental Mesonet](https://mesonet.agron.iastate.edu/request/download.phtml?network=MA_ASOS).

Note: The paragraph above and the following section will describe the downloading, storing and importing process for a normal workflow. However, there is not always the option of using this normal workflow. After the following downloading, storing and importing sections, there will be a section that describes how to do these things when you need to use the QuantAQ data from the [DropBox folder](https://app.box.com/s/3rs3dasqitqtrwxgtem0ef96jhkbnqtm). 

## QuantAQ Data 

Download the following QuantAQ files: 

* SN45: final and raw data taken from 9/7/2019 to 3/8/2020
* SN49: final and raw data taken from 9/7/2019 to 3/8/2020
* SN62: final and raw data taken from 9/7/2019 to 3/8/2020
* SN67: final and raw data taken from 9/7/2019 to 3/8/2020
* SN72: final and raw data taken from 9/7/2019 to 3/8/2020

## Meteorology Data 

Select the following station: 
* [BOS] BOSTON/LOGAN INTL

Select the following variables: 

* Wind Direction
* Wind Speed [mph]
* Temperature (C)

Select the date range: 
* 9/7/2019 to 3/8/2021 (this dataset is not inclusive of the last date)

Select this timezone: 
* America/New_York 


Use the following download options:
* csv 
* no latitude/ longitude vectors 
* represent missing data with blank string
* denote trace with blank string

## Storing the data 

In order to easily import the data, we'll store the data in the following way: 

* Store all the QuantAQ data in the folder called "quant_aq", which is a subfolder in "/data" folder of this workspace. When saving, make sure that the file name starts with the sensor name (ie sn45) and contains either "raw" or "final" in the name, to indicate which type of data was imported. For example: "sn45_final.csv". 
* Store the meteorology data in the "/data" folder in this workspace. Store the file as "metdata.csv"

## Importing Data

### Importing Necessary Libraries 

Our import and analysis relies on functions from a few packages. The following functions must be imported for the code to run. The first step is to install each of the following packages with the install.packages("...") command - for example, install.packages("lubridate"). Run these in the command line, and they will all be installed locally for you to then import them here. You might get a message that certain functions are "masked" or "overwritten". This means that you imported libraries that have functions with the same name. If you know that these functions do different things from package to package, and want to use a specific one, then you can specific the function's package with package::function. For example, openair::polarPlot()

```{r, warning= FALSE, error = FALSE}
library(lubridate) #date and time functions 
library(data.table) #to use the data.table variable type
library(dplyr) #this library allows you to use the %>% operator
library(tidyr) #this library lets you use the complete function to account for time syncing
library(openair) #for plotting and analysis
library(stringr)
library(baseline) # for baseline correction
```

################################################################################

### Import QuantAQ data : Method 1

If we want to import only certain variables, we can define what we want to import ahead of time

```{r}
# # #defining which variables we'll keep
#  final_vars <- c( "timestamp","timestamp_local","temp_box",  "temp_manifold", "rh_manifold", "pressure",
#                   "noise", "solar", "wind_dir", "wind_speed", "co", "no" , "no2", "o3" ,  "pm1" , "pm25",
#                   "pm10", "co2" ) #final variables to keep
# 
# # temp_box	temp_manifold	rh_manifold	pressure	noise	solar	wind_dir	wind_speed
#  raw_vars <- c("timestamp","bin0", "bin1", "bin2", "bin3", "bin4", "bin5", "no_ae", "co_ae", "no2_ae") #raw variables to keep
```

Next, we'll import all the final and raw datafiles separately. We do this by finding all the csv's with "final" or "raw" in the name, and then applying the R read function to them. If you prefer to import all variables, comment the "sapply" functions that are being used, and uncomment the ones below it. The difference between them is that the currently uncommented one uses the "select" parameter, and the commented one doesn't.

```{r}
# #finding all final datasets
# temp = list.files(path = "./data/quant_aq_March2021", pattern = "^.*_final.*\\.csv", full.names = TRUE)
# finalfiles = sapply(temp,  FUN = function(x) fread(file=x,  data.table = TRUE, select = final_vars), simplify = FALSE,USE.NAMES = TRUE)
# #finalfiles = sapply(temp, function(x) fread(file=x , data.table = TRUE) , simplify = FALSE,USE.NAMES = TRUE)
# 
# #finding all raw datasets
# temp = list.files(path = "./data/quant_aq_March2021", pattern = "^.*_raw.*\\.csv", full.names = TRUE)
# rawfiles = sapply(temp, function(x) fread(file=x, data.table = TRUE, select = raw_vars) , simplify = FALSE,USE.NAMES = TRUE)
# #rawfiles = sapply(temp, function(x) fread(file=x,  data.table = TRUE) , simplify = FALSE,USE.NAMES = TRUE)
```

The output of the above functions gives a list of data tables. Since we included the sapply parameter "use.names", it attached the file name to each data.table. Since we used data.table = TRUE, we can be sure that the data is given in the data.table format, which has a more efficient runtime for large datasets.

Let's use an easier naming conventon. Since we used a specific naming scheme and a specific path, we can just extract the part of the path name that describes the dataset. Specifically, we're extracting "sn" and the sensor number. For example, "sn45" or "sn46".

```{r}
# names(finalfiles)<- lapply(names(finalfiles), function(x) substr(x, 9+18, 9+21)) #extract the real sensor names from the current names used in the list
# names(rawfiles)<- lapply(names(rawfiles), function(x) substr(x, 10+17, 10+20))
# names(finalfiles)
# names(rawfiles)
``` 

Next, we want to combine the final and raw files for a given sensor. However, it will be beneficial to keep the sensors as a list of dataframes, as we'll see later.

Since the final and raw datasets could have a different number of rows, we'll merge them based on entries to the "timestamp" column.

3/17 edit: In the newest version of the data I pulled from the QuantAQ website, the data weren't all synced to the same seconds. Therefore, we have to put all the data in ymd_hms format first, and round it to the nearest minute. 

```{r}
# clean_sensor_data <- function(sensor){
#   sensor$date <- ymd_hms(sensor$timestamp, tz="UTC") #parse datetime
#   sensor <- mutate(sensor, originaldate = date) #keeping original times for comparing to flight data
#   sensor$date <- round_date(ymd_hms(sensor$date, tz="America/New_York"), unit="minute") #round date for merging
#   sensor<- sensor[order(sensor$originaldate),] #put it in chronological order
#   return(sensor)
# }
```

```{r}
# rawfiles <- lapply(rawfiles, function(x) clean_sensor_data(x))
# rawfiles <- lapply(rawfiles, function(x) unique(x)) 
# finalfiles <- lapply(finalfiles, function(x) clean_sensor_data(x))
# finalfiles <- lapply(finalfiles, function(x) unique(x)) 
```


```{r}
# snfiles <- sapply(names(rawfiles), function(k) merge.data.table(finalfiles[[k]], rawfiles[[k]], by="date", all = FALSE), simplify = FALSE,USE.NAMES = TRUE)
```


#### Proof that Combining Datasets works 

In case you aren't following the normal workflow, here is proof that binding the final and raw files still works. 

Let's download SN45 and SN46 data from the QuantAQ website. Download a final and raw datafile, from 10/23 to 10/30. Save these files as "sn45_trial_final.csv" and "..raw.csv", and put them in the "quant_aq" data folder. 

Import those files here.

```{r}
# sn45_raw <- fread("./data/quant_aq/sn45_trial_raw.csv", data.table = TRUE) # import final sn45 data
# sn45_final <- fread("./data/quant_aq/sn45_trial_final.csv", data.table = TRUE) # import final sn46 data
# 
# sn46_raw <- fread("./data/quant_aq/sn46_trial_raw.csv", data.table = TRUE) # import raw sn45 data
# sn46_final <- fread("./data/quant_aq/sn46_trial_final.csv", data.table = TRUE) # import raw sn46 data
```

Let's try merging SN45 together based on the "timestamp" vector. 
```{r}
# merge(sn45_raw, sn45_final, by = "timestamp", all = FALSE) #merge both sn45 files
```

Now let's simulate the list of final dataframes and raw dataframes. 
```{r}
# finalfiles <- list("sn45" = sn45_final,"sn46" = sn46_final) # create list of final dataframes
# rawfiles <- list("sn45" = sn45_raw, "sn46" = sn46_raw) # create list of raw dataframes
```

We'll bind the appropriate files together in an lapply statement.

```{r}
# snfiles_trial <- sapply(names(finalfiles), function(k) merge.data.table(finalfiles[[k]], rawfiles[[k]], by="timestamp", all = FALSE), simplify = FALSE,USE.NAMES = TRUE) #bind lists of final and raw dataframes
```

This is what's contained in the new dataframes
```{r}
# colnames(snfiles_trial$sn45)
```

#########################################################################################3

# Downloading, Saving and Importing using Dropbox workflow (Importing Method 2)

## Downloading 

Click "Download" in the top right corner of the Dropbox screen. This will download all of the contents in the DropBox folder. 

## Saving 

The data will download as a zip file. Extract the contents of the zip file, and place them in the "quant_aq" subfolder of the "data" folder in this workspace. Make sure to save the files directly to that folder, and not in a "Final_Customer" subfolder.

## Importing 

These datafiles aren't divided into final and raw, so we don't have to combine files. Instead, we can create one list of variables we want to extract (instead of two, like in the normal workflow). 

```{r}
# #defining which variables we'll keep
# final_vars <- c( "timestamp","timestamp_local","temp_box",  "temp_manifold", "rh_manifold", "pressure", 
#                   "noise", "solar", "wind_dir", "wind_speed", "co", "no" , "no2", "o3" ,  "pm1" , "pm25",
                  # "pm10", "co2" , "no_ae", "co_ae", "no2_ae", "bin0", "bin1", "bin2", "bin3", "bin4", "bin5") 
```

The data files can be categorized based on which sensor they represent. 

Initially, the data files were categorized by whether they are "f0" or "f1" (both of these categorizes are represented in the filename). The difference between f0 and f1 is that one doesn't have filters apply, and the other does. 

Currently, there are only "f0" files, so we'll specify we're looking at datafiles with "f0" in the name. 

```{r}
# temp = list.files(path = "./data/quant_aq/", pattern = "^.*-f0.*\\.csv", full.names = TRUE) #specify files we're looking for
# snfiles = sapply(temp,  FUN = function(x) fread(file=x,  data.table = TRUE, select = final_vars), simplify = FALSE,USE.NAMES = TRUE) #importing them
```
Like in the normal workflow, we can reformat the names. Again, we'll only keep the relevant parts of the filenames. 

```{r}
# new_names <- lapply(names(snfiles), function(x) substr(x, 17, 26)) #extract the real sensor names from the current names used in the list
# names(snfiles) <- new_names #rename list names
# names(snfiles)
```

As we see, these are in a different format than the ones imported from the QuantAQ website. We can change this by doing a string replacement - "sn" instead of "Output-0"

```{r}
# names(snfiles) <- lapply(names(snfiles), function(x) str_replace(x, "Output-0", "sn"))
# names(snfiles)
```


################################################################################

# Importing data : Method 3

On 3/19, we decided to merge two dataframes (quantAQ and the box data), in order to get a more complete dataset. We did this in the merging_sensor_data.Rmd file. Now, we're going to import those files. 

We saved the files in the data/quant_aq/merged data folder. We're going to import them directly here, without needing to remove columns. 

```{r}
nm <- list.files(path = "./data/joined/", pattern = "*.csv", full.names = TRUE)
snfiles <-  sapply(nm,  FUN = function(x) fread(file=x,  data.table = TRUE), simplify = FALSE,USE.NAMES = TRUE)
```
```{r}
names(snfiles) <- lapply(names(snfiles), function(x) substr(x, 15, 18)) #extract the real sensor names from the current names used in the list
```

```{r}
set_vector <- function(snfile){
  snfile$date <- ymd_hms(snfile$date, tz = "UTC")
  return(snfile)
}
```


```{r}
snfiles <- lapply(snfiles, function(x) set_vector(x))
```

################################################################################

# Import Meteorology data

First, we import the file. We can indicate that there is a header. 

```{r}
metdata <- fread("data/metdata_temp.csv", header=TRUE, data.table = TRUE) #import meteorology file
```

If you look at the file, you'll see that the datetime is not in the standard datetime format. We can standardize it to the quantAQ data format using the R built-in POSIXct function, which changes datetime formats. We can specific which timezone this data is already in, as well. Be sure that the timezone you specify matches the timezone that you originally specified in the data download.

```{r}
metdata$date <- as.POSIXct(metdata$valid, format = "%Y-%m-%d %H:%M", tz = "UTC") #setting datetime, using correct timezone on East Coast Local time
#metdata$date <- as.POSIXct(metdata$valid, format = "%Y-%m-%d %H:%M") #setting datetime, using correct timezone for UTC
```

Next, if we look at data columns, we see that the column names are unrepresentative of the data. if we look at the date column in particular, we see that the data is not in one-minute intervals, like the sensor data, but in 5 minute intervals. To solve this second concern, we can add one-minute intervals in between and then fill those 1 minute intervals with the data from the 5 minute intervals. Let's say that wind direction is 0 for 11:00 and 120 for 11:05. Then, the new data would be: 11:00 - 0, 11:01 - 0, 11:02 - 0, 11:03 - 0, 11:04 - 0, 11:05 - 120, 11:06 - 120 ... 11:09 - 120

We can solve both of these concerns at once, using the dpylr operator (%>%)

```{r}
metdata <- metdata %>%
  # if only wind speed and wind direction matter
  setnames(old = c("drct", "sped", "valid"), new = c("wd", "ws", "original_met_time")) %>% #rename
  na.omit("date") %>% #if any dates are NA, the following function won't work
  complete(date = seq(from = min(date), to= max(date),  by = "1 min")) %>%#make 1 minute intervals
  fill(c("wd", "ws", "tmpc")) #fill those new 1 minute interval rows

  #if the entire met dataset matters
  # setnames(old = c("drct", "sped", "valid", "relh", "mslp"), new = c("wd", "ws", "original_met_time", "rh", "press")) %>% #rename
  # na.omit("date") %>% #if any dates are NA, the following function won't work
  # complete(date = seq(from = min(date), to= max(date),  by = "1 min")) %>%#make 1 minute intervals
  # fill(c("wd", "ws", "rh", "press", "sknt", "tmpc", "press", "dwpc")) #fill those new 1 minute interval rows 
```

We specified on the Iowa Mesonet website that our data's null values would show up as a blank string. When we import into R, this will then translate as an NA value. Since the missing strings are NAs, wind speed and wind direction become numeric vectors. If the missing data was set to "null", for example, then the vectors would be strings

```{r}
# class(metdata$ws)
# class(metdata$wd)
# class(metdata$rh)
# class(metdata$press)
# class(metdata$sknt)
# class(metdata$tmpc)
# class(metdata$press)
# class(metdata$dwpc)
# class(metdata$date)

```

On the website, the wind speed was given in in mph. However, we would prefer to use m/s. Let's convert ws from mph to m/s.

```{r}
metdata$ws <- metdata$ws * (1609/3600) #converting to m/s, 1609 meters per mile, 3600 seconds per hr
```

Lastly, let's remove variables that we won't use:

```{r}
metdata[,c( "station")] <- list(NULL) #getting rid of unnecessary variables
```

```{r}
metdata <- unique(metdata)
```


#################################################################################

# Formatting and Merging Data 

## Formatting QuantAQ sensor data 


Similar to the meteorology data, we can identify what needs to be reformatted in the sensor data. 

Let's walk through it step-by-step for SN45. 

### Formatting Walkthrough for SN45 

First, let's standardize the date. The ymd_hms function puts datetime data in the format Y:M:D H:M:S, so we'll use it. This function also has a parameter where you can define the timezone. We know the timestamp data is UTC, so we use that. Afterwards, we can move the datetime vector to be in the EST/EDT timezone with the timezone (tz) function. We're using timestamp instead of timestamp local, because QuantAQ said it was more reliable and because we can check the formatted datetime against the local timestamp afterwards.

We'll print a date vector at each step, to show what the functions are doing. 

```{r}
# sprintf("unmodified timestamp : %s, timezone : %s", snfiles$sn45$timestamp[1], tz(snfiles$sn45$timestamp[1]))
# 
# snfiles$sn45$date <- lubridate::ymd_hms(snfiles$sn45$timestamp, tz="UTC") #parse datetime
# 
# sprintf("modified timestamp : %s , timezone : %s", snfiles$sn45$date[1] , tz(snfiles$sn45$timestamp[1]))
# 
# snfiles$sn45$date <- lubridate::with_tz(snfiles$sn45$date, "America/New_York") #change timezone
# 
# sprintf("modified timezone: %s , timezone : %s ", snfiles$sn45$date[1], tz(snfiles$sn45$date[1]))
# sprintf("compared to local timestamp: %s", snfiles$sn45$timestamp_local[1])
```

Next, we want to round the data to the nearest minute, so that we could align it with the meteorological dataset. We'll preserve the original datetime vector by saving it as a separate vector. Then, we'll round the data so 11:30:29 rounds to 11:30:00 and 11:30:31 rounds to 11:31:00. 

```{r}
# snfiles$sn45 <- mutate(snfiles$sn45, originaldate = date) #keeping original times for comparing to flight data
# sprintf("new date object : %s, old date object : %s", snfiles$sn45$date[1], snfiles$sn45$originaldate[1])
# snfiles$sn45$date <- round_date(snfiles$sn45$date, unit="minute") #round date for merging
# sprintf("new date object : %s, old date object : %s", snfiles$sn45$date[1], snfiles$sn45$originaldate[1])
```

Lastly, we'll reorder the data so it's chronological, instead of reverse chronological. 

```{r}
# sprintf("first date entry : %s", snfiles$sn45$date[1])
# snfiles$sn45 <- snfiles$sn45[order(snfiles$sn45$originaldate),] #put it in chronological order
# sprintf("first date entry : %s", snfiles$sn45$date[1])
```


################################################################################


### Formatting the rest of the QuantAQ files 

We'll do what we did for sn45 for all of the QuantAQ files. 

In order to easily apply the cleaning to all the datasets, we're going to create a function to apply to the list of data frames. Below is the function.

```{r}
# clean_sensor_data <- function(sensor){
#   sensor$date <- ymd_hms(sensor$timestamp, tz="UTC") #parse datetime
#   sensor$date <- with_tz(sensor$date, "America/New_York") #change the time according to shifted timezone
#   sensor <- mutate(sensor, originaldate = date) #keeping original times for comparing to flight data
#   sensor$date <- round_date(ymd_hms(sensor$date, tz="America/New_York"), unit="minute") #round date for merging
#   sensor<- sensor[order(sensor$originaldate),] #put it in chronological order
#   return(sensor)
# }
```

In order to apply it to every dataframe, we can run lapply (list apply) which will apply a certain function to every element in a list. We're going to run it on every element of the snfiles list, except the first (sn45), since we did that above.

```{r}
# snfiles[c(-1)] <- lapply(snfiles[c(-1)], function(x) clean_sensor_data(x)) 
```

#### Removing Unwanted Time Intervals 

There are time periods in the ambient data that we know we don't want to keep. For example, we don't want to keep any data from periods of calibration. In this section we will be removing time intervals from the data that we don't want. 

To show how this is done, we will be removing the period [5/20/2020, 7/30/2020] from sn45. 

```{r}
rows_to_remove <- snfiles$sn45[date %between% c("2020-05-20 00:00:00 EDT", "2020-07-31 23:59:00 EDT")]
snfiles$sn45 <- anti_join(snfiles$sn45, rows_to_remove)
```

And show that those dates are gone: 

```{r}
print( snfiles$sn45[date > "2020-05-10 00:00:00 EDT" & date < "2020-08-10 23:59:00 EDT"])
```
And now we can make the above operation into a function and apply it to all the dataframes.

```{r}
remove_dates <- function(dataset, startdate, enddate){
  rows_to_remove <- dataset[date %between% c(startdate, enddate)]
  dataset <- anti_join(dataset, rows_to_remove)
  return(dataset)
}
```

```{r}
snfiles[c(-1)] <- lapply(snfiles[c(-1)], function(x) remove_dates(x,"2020-05-20 00:00:00 EDT","2020-07-31 23:59:00 EDT"))
```

We can also make sure that all the dataframes start at the same date: 

```{r}
snfiles <- lapply(snfiles, function(x) x[date > "2019-09-07 00:00:00 EST"])
```


## Merging 

From trial and error, I've found that joining datasets using DT's merge works the best. Below is the code to implement it. First, we make sure that each dataset is a data.table with a key column of "date". This will allow us to the data.table merge function. We merge the meteorological and sensor datasets by date with no_match = 0. no_match = 0 ensures that there's no duplicates. 

First, we'll implement it for SN45.

```{r}
setDT(metdata, key = "date") # make the meteorological data a data.table
setkey(snfiles$sn45,  "date") #set the "key" column to date
sprintf("Original number of sensor data rows: %d", nrow(snfiles$sn45))
sprintf("Original number of met data rows: %d", nrow(metdata))
#sprintf("The number of dates that are the same in the sensor and metdata datetimes: %d", length(intersect(snfiles$sn45$date, metdata$date)))

snfiles$sn45 <- snfiles$sn45[metdata, nomatch = 0] #merge
sprintf("New number of rows in merged data: %d", nrow(snfiles$sn45))
```
Let's see what the vector looks like now: 
```{r}
colnames(snfiles$sn45)
```
As you can see, there is only one date column, all of the sensor columns, wind direction, wind speed, and the original meteorological datetime vector. 

Now, let's implement it for the rest of the data frames. Again, we'll create a merge function and apply it to every element of the snfiles object (in other words, to every dataset).

```{r}
merge_metdata <- function(sensor, metdata){
  setkey(sensor, "date")
  sensor <- sensor[metdata, nomatch = 0]
  return(sensor)
}
```

```{r}
snfiles[c(-1)] <- lapply(snfiles[c(-1)], function(x) merge_metdata(x, metdata))
```
As before, we exclude the first dataframe, since we already merged it above.


################################################################################

# Data Cleaning

## Checking the data

First, we're going to look at statistics of the data, to make sure that there isn't a large problem with the data (ie everything is NA). We're going to do this by creating a function which prints out how many values are negative, zero and NA in the data. Then we're going to use lapply to apply it to all the datasets.


```{r}
data_check_test <- function(sensor){
  
  temp_df <- sensor[, c("co", "no" , "correctedNO", "no2", "o3" , "co2",  "pm1" ,  "bin0") ] #define which variables to inspect
  
  na_count <- apply(temp_df, 2,  function(x) sum(is.na(x))) #count the number of NA values
  
  negative_count <-apply(temp_df, 2,  function(x) sum(x<0, na.rm = TRUE)) #count the number of negative values
  
  zero_count <- apply(temp_df, 2,  function(x) sum(x==0, na.rm = TRUE)) #count the number of zero values
  
  return(data.frame(na_count, negative_count, zero_count))
}
```

We'll apply it to our list of dataframes: 

```{r}
lapply(snfiles, function(x) data_check_test(x))
```

We'll also test for the mean, 25th percentile and 75th percentile of each variable. This will tell us if the data is in a reasonable range. 

We'll create two functions: one which computes the percentiles and one which computes the mean.

```{r}
data_check_percentiles <- function(sensor){
  
  temp_df <- sensor[, c( "co", "no" , "correctedNO", "no2", "o3" , "co2",  "pm1" ,  "bin0") ] #define which variables to inspect
  
  percentiles <-apply(temp_df, 2,  function(x) quantile(x, probs =c(0.25, 0.75), na.rm = TRUE)) # apply the quantile function, which calculates percentiles
  
  return(percentiles)
}
data_check_mean <- function(sensor){
  
  temp_df <- sensor[, c( "co", "no" , "correctedNO", "no2", "o3" , "co2",  "pm1" ,  "bin0") ] #define which variables to inspect
  
  mean_vals <-apply(temp_df, 2,  function(x) mean(x, na.rm = TRUE)) #apply the mean function
  
  return(mean_vals)
}
```

We'll apply these functions to our datasets:
```{r}
lapply(snfiles, function(x) data_check_percentiles(x))
lapply(snfiles, function(x) data_check_mean(x))
```


## Clean Duplicate Rows 

#### Data by Row 

There are some rows that are exact duplicates. 

```{r}
sum(duplicated(snfiles$sn45) == 1) #checking the number of duplicate entries 
```

Let's see what they look like: 

```{r}
snfiles$sn45[duplicated(snfiles$sn45) | duplicated(snfiles$sn45, fromLast=TRUE)] #print all duplicates, including the first one
```

These rows are entries that are exactly the same (all the columns are the same, even timestamp). Since there is no difference between this information, we can remove the duplicates: 

```{r}
snfiles <- lapply(snfiles, function(x) unique(x)) #keep only unique rows in the dataset
```


We have to remove these first, because they would otherwise skew our pollutant concentration analysis.

Let's make sure they were removed: 

```{r}
snfiles$sn45[duplicated(snfiles$sn45) | duplicated(snfiles$sn45, fromLast=TRUE)] #print all duplicates, including the first one
```


## Removing Negative Values 

First, we'll remove negative values. For this, we are simply finding all the points in all the pollutant vectors which are less than 0, and then setting them to 0. More filters can be added by adding in another replace statement in the same format. 

```{r}
negative_value_filter <- function(sensor){
  sensor$o3 <- replace(sensor$o3, sensor$o3 < 0, 0)
  sensor$co <- replace(sensor$co, sensor$co <0, NA)
  sensor$co2 <- replace(sensor$co2, sensor$co2 <0, NA)
  sensor$no2 <- replace(sensor$no2, sensor$no2 <0, NA)
  sensor$bin0 <- replace(sensor$bin0, sensor$bin0 <0, NA)
  sensor$pm1 <- replace(sensor$pm1, sensor$pm1 <0, NA)
  return(sensor)
}
```

We'll apply this function to all the datasets:

```{r}
snfiles <- lapply(snfiles, function(x) negative_value_filter(x))
```

## Removing NO Negative Values 

NO has a changing baseline. It can't be filtered like the ones above because then we would lose real data. Therefore, we're going to apply a baseline correction algorithm to the NO data that was developed for Raman Spectroscopy. Since the baseline function wasn't made for this application, we have to prepare the data and feed it into the baseline function incrementally. We'll do this in the no_baseline_filter wrapper function below. The wrapper function was developed during summer 2020, and there is a document describing its components on the shared drive. 

```{r}
no_baseline_filter <- function(sensor){
  # create day column
  sensor[, day := as.Date(date, format="%Y-%m-%d", tz = "America/New_York")]
  
  # create corrected column
  sensor[, correctedNO := seq(0,0,length.out= length(sensor$no))]
  sensor$correctedNO[sensor$correctedNO == 0] <- NA  #set them actually to NA
  
  dropNAsensor<- sensor[!is.na(sensor$no), ] # drop NO NAs
  unique_days <- c(unique(dropNAsensor$day, na.rm=TRUE)) #get all of the unique days in the sensor
  
  for (i in 2:(length(unique_days)-1)){ #for all days
    temp <- subset(dropNAsensor, day %in% unique_days[i], c("day", "no", "date")) #create temp dataset
  
    if (nrow(temp) > 550){
      wholebase.peakDetection <- baseline(t(temp$no), method='peakDetection',left=50, right=50, lwin=10, rwin=10) #baseline correction
  
    #replace the correctedNO column values with the baseline correction from earlier
    dropNAsensor$correctedNO[which(dropNAsensor$date == temp$date[1]): which(dropNAsensor$date == tail(temp$date, n=1))] <-    c(getCorrected(wholebase.peakDetection))
    }
  
    else{
      if (sum(temp$no < 0, na.rm = TRUE) / nrow(temp) < 0.25){
        dropNAsensor$correctedNO[which(dropNAsensor$date == temp$date[1]): which(dropNAsensor$date == tail(temp$date, n=1))] <-
          sensor$no[which(sensor$date == temp$date[1]): which(sensor$date == tail(temp$date, n=1))]
      }
  
    }
  
  }
  
  sensor$correctedNO[which(sensor$date %in% dropNAsensor$date)] <- dropNAsensor$correctedNO # replace values based on date
  
  return(sensor)
}
```

We'll apply this function to all the datasets:

```{r}
snfiles <- lapply(snfiles, function(x) no_baseline_filter(x))
```


## Removing Values Based on Rate of Change of Auxiliary Electrode 

Auxiliary electrodes indicate that the corresponding data could be incorrect. Because of that, we'll filter variables that have unexpected ae values. 

In the future, we could add more auxiliary electrode checks. Therefore, we'll create an general auxiliary electrode function which takes in a variable and an auxiliary electrode. The equation will measure the standard deviation of the change in ae values, and pick out datapoints where the sd surpasses a limit that the user sets. We'll write the equation below: 


```{r}
ae_cleaning <- function(sensor, pollutant, ae_val, sd_val){
  ae_derivative <- c(0,diff(sensor[[ae_val]], na.rm = TRUE)) #create vector of derivatives

  sensor[which(abs(ae_derivative) > sd_val*(sd(abs(ae_derivative), na.rm=TRUE))), pollutant] <- NA #compare sd
  
  return(sensor)
}
```

For now, we're interested in no. We can filter no values that have a change in no_ae above 2.5 sd from the mean:

```{r}
#snfiles_no_test <- lapply(snfiles, function(x) ae_cleaning(x, "no", "no_ae", 2.5)) 
```


In the future, we can use this function to filter for multiple ae values at once by storing the parameters in a list and looping through them. Below, I provide an example: 

```{r}
# parameter_list = list(c("no", "no_ae", 2.5), c("no2", "no2_ae", 2))
# 
# for(parameters in parameter_list){
#   snfiles <- lapply(snfiles, function(x) ae_cleaning(x, parameters[1], parameters[2], as.numeric(parameters[3])))
# }
```

## Removing High Values without flags

There are unrealistically high values which we know can't be real and we don't have to check. We'll filter those out here. 

```{r}
o3_filter <- function(sensor){ #create o3 filter function
  sensor$o3 <- replace(sensor$o3, sensor$o3 > 300, NA)
  return(sensor)
}
```

```{r}
#sn45 <- o3_filter(sn45)
snfiles <- lapply(snfiles, function(x) o3_filter(x))
```


## Remove CO spikes

First we look for periods when there's large jumps between datapoints 

```{r}
trial <- snfiles$sn45
trial$timediff <- c(NA, difftime(trial$date[-1],
                               trial$date[-nrow(trial)],
                               units="mins"))
```

Next, we find vectors where the difference is more than an hour

```{r}
large_jumps <- trial[which(trial$timediff > 60), ]

jump_indices <- which(trial$timediff > 60)

large_jumps$date
```

Then, we take where these jumps happen and we set the next 20 minutes to NA

```{r}
trial$removeCO <- "FALSE"
for(jump_index in jump_indices){
  #trial$removeCO[jump_index:jump_index+20] := "TRUE"
  trial[jump_index:(jump_index+20), removeCO := "TRUE"] 
  #set(trial,jump_index:jump_index+20, 1:40 ,NA_character_)
}

#trial <- complete.cases(trial)
```

```{r}
# trial2 <- trial[complete.cases(trial)]
# data6 <- filter(trial, rowSums(is.na(trial)) != ncol(trial))
# trial4 <- subset(trial, removeCO != "TRUE")
# trial5 <- trial[removeCO == "FALSE"]

trial2 = trial
trial2$co[trial2$removeCO == "TRUE"] <- NA  
```


```{r}
trial[which(trial$timediff > 60), removeCO]
```
```{r}
removeCOspikes <- function(sensor){
  #calculate how much time elapses between data points
  sensor$timediff <- c(NA, difftime(sensor$date[-1],
                               sensor$date[-nrow(sensor)],
                               units="mins"))
  #find where there are jumps of an hour or more
  jump_indices <- which(sensor$timediff > 60)
  # set up which indices should be removed from the data 
  sensor$removeCO <- "FALSE"
  for(jump_index in jump_indices){
    sensor[jump_index:(jump_index+20), removeCO := "TRUE"] 
  }
  #remove CO 
  sensor$co[sensor$removeCO == "TRUE"] <- NA 
  
  return(sensor)
}
```

```{r}
snfiles <- lapply(snfiles, function(x) removeCOspikes(x))
```


## Removing High Values with flags 

The following function applies flags to the data. Each flag value corresponds to a pollutant that surpassed its limit. In order to change the limit, you can change the limit value directly inside of the function.

```{r}
apply_flags <- function(sensor){
  # creating a flags column to flag high values 
  # flag number to pollutant definition given below:
  # co - 1 
  # co2 - 2
  # no2 - 3
  # bin0 - 4
  # pm1 - 5
  # no - 6
  
  sensor$flag <- replicate(nrow(sensor), c(), simplify = FALSE) #create flags column which has a list for every data point
  sensor$flag[which(sensor$co > 10000)] <- lapply(sensor$flag[which(sensor$co > 10000)], function(x) c(x, 1)) #add 2 to the flags data points with high CO2 values
  sensor$flag[which(sensor$co2 > 2000)] <- lapply(sensor$flag[which(sensor$co2 > 2000)], function(x) c(x, 2)) #add 2 to the flags data points with high CO2 values
  sensor$flag[which(sensor$no2 > 1000)] <- lapply(sensor$flag[which(sensor$no2 > 1000)], function(x) c(x, 3))
  sensor$flag[which(sensor$no > 300)] <- lapply(sensor$flag[which(sensor$no > 300)], function(x) c(x, 6))

  return(sensor)
}
```

The next set of functions plots the flagged values for a given variable. 

```{r}
check_flags <- function(sensor, flagval, var_name){
  #plots flagged intervals
  
  index_df = create_interval_vector(sensor, flagval) #generate dataframe of flagged intervals
  
  for(pair in 1:nrow(index_df)){   # for each interval
    start_indx = index_df[pair,1] #get the start index of the interval
    end_indx = index_df[pair,2] #get the end index

    if(!is.null(start_indx) & !is.null(end_indx) & length(is.na(start_indx)) != 0 & length(is.na(end_indx)) != 0 & length(difftime(sensor$date[end_indx], sensor$date[start_indx], units="mins") < 2)!=0){ #if the interval lasts less than 2 minutes, set to NA
      sensor[[var_name]][start_indx:end_indx] <- NA
    }
    
    else{ #otherwise plot it
      if(length(is.na(start_indx)) != 0 & length(is.na(end_indx)) != 0){
         sprintf("start and stop : %f, %f", start_indx, end_indx)
         subset_sensor <- sensor[start_indx:end_indx,]
      #cat(sprintf("start and end intervals: %s - %s \n", as.character(sensor$date[start_indx]), as.character(sensor$date[end_indx] )))
         timePlot(subset_sensor, pollutant = var_name, key = FALSE, main = paste0(as.character(sensor$date[start_indx]) ," to ", as.character(sensor$date[end_indx] )))
      }
     
      
    }
    
    
  }
}

print_flag_intervals <- function(sensor, flagval){
  #plots flagged intervals
  
  index_df = create_interval_vector(sensor, flagval) #generate dataframe of flagged intervals
  if(nrow(index_df) == 0){
    print("No flagged data")
  }
  else{
    mod_start_indx <- index_df[,1]
    mod_end_indx <- index_df[,2]
    
    rows_to_keep <- which( difftime(sensor$date[mod_end_indx], sensor$date[mod_start_indx], units="mins") > 2)
    
    sprintf("Number of intervals %f", rows_to_keep)
  }

}

create_interval_vector <- function(sensor, flagval){
  # input a sensor dataframe and a flag (as numeric) to plot 
  # returns a dataframe that has intervals where the high values start and end 
  
  
  flagindxs <- ifelse(
    ( 
      (sensor$flag %like% flagval )
    ),
    1,  # if condition is met, put 1
    0   # else put 0
  )

  start_indx <- which(diff(c(0L, flagindxs)) == 1L)
  end_indx <- which(diff(c(flagindxs, 0L)) == -1L)
  
  mod_start_indx <- start_indx - 5 #adding 5/10 minutes before, depending on whether data is taken at 1 min or 2 min intervals
  mod_end_indx <- end_indx + 5 #adding 5/10 minutes after
  
  index_df <- data.frame(mod_start_indx,mod_end_indx) #create dataframe with corresponding start and end indices where high values happen
  
  
  #clean dataframe for intervals that start before 0
  if(any(index_df$mod_start_indx < 1)){
    
    index_df$mod_start_indx <- replace(index_df$mod_start_indx, which(index_df$mod_start_indx <1), 1 )
  }
  
  #clean dataframe for intervals that start after the end of the pollutant vector
  if(any(index_df$mod_start_indx > nrow(sensor))){

    indxs_pairs <- which(index_df$mod_start_indx > nrow(sensor))
    index_df <- index_df[!indxs_pairs]
  }
  
  #clean dataframe for intervals that end after the end of the pollutant vector
  if(any(index_df$mod_end_indx > nrow(sensor))){
    
    index_df$mod_end_indx <-replace(index_df$mod_end_indx, which(index_df$mod_end_indx > nrow(sensor)), nrow(sensor) )
  }

  #index_df <- index_df[!duplicated(index_df$mod_end_indx)]
  
  return(index_df)
  
}
```

We can apply the flags to the sn dataframes: 

```{r}
snfiles <- lapply(snfiles, function(x) apply_flags(x)) 
```

We can check if these flags capture outlying values well by printing out plots where flags occur. First, however, we'll check how many intervals of flagged values there are, to make sure that we're not generating hundreds of graphs. 

```{r}
# for(var in 1:6){
#   lapply(snfiles, function(x) print_flag_intervals(x, var))
# }
```

If the number of intervals is too high, it might be good to raise the limit. 

Otherwise, we can print the plots below. Below, there are two options how to print the pdfs. We can print directly in R, or we can print to pdf. To print to pdfs, uncomment the "pdf" and "dev.off" lines.


```{r}
flags_list = list(c(1,"co"), c(2,"co2"), c(3,"no2"), c(6, "no"))
for(i in flags_list){
  #pdf(paste("check_flags_", i, ".pdf", sep = ""))
  #lapply(snfiles, function(x) check_flags(x, i[1], i[2]))
  #dev.off()
}
```


The following function will remove flagged pollutant values. The user inputs which flag they want to remove values for, and the function will remove the values that correspond to that flag (as given in the function that creates flags). 

```{r}
remove_flags <- function(sensor, flag_val, pollutant){
  sensor[which(sensor$flag %like% flag_val), pollutant] <- NA
  return(sensor)
}
```

```{r}
flags_list = list(c(1,"co"), c(2,"co2"), c(3,"no2"), c(6, "no"))
for(i in flags_list){
  snfiles <- lapply(snfiles, function(x) remove_flags(x, i[1], i[2]))
}
```


################################################################################

# Igor Quality Assurance 

Next, we'll export the data to csv, put it into Igor and do a quality inspection.

First, we remove the flags column, since its a list and can't export:

```{r}
snfiles$sn45$flag <- NULL 
snfiles$sn46$flag <- NULL 
snfiles$sn49$flag <- NULL 
snfiles$sn62$flag <- NULL 
snfiles$sn67$flag <- NULL 
snfiles$sn72$flag <- NULL 
```

Then, we'll convert the date vector to local time. 

```{r}
get_localtime <- function(sensor){
  ##sensor$date_local = sensor$date
  sensor$date_local <- with_tz(sensor$date, tzone = "America/New_York")
  return(sensor)
}
```


```{r}
snfiles <- lapply(snfiles, function(x) get_localtime(x))
```


We can check that the date is actually local by comparing it to QuantAQ's local timestamp


Next, we'll make an Igor date column, to make importing into Igor faster: 

```{r}
create_date_for_igor <- function(dataset){
  #create igor date column
  ymd_col <- c(as.character(format(dataset$date, "%Y-%m-%d")) )
  middle <- rep(" ", length(ymd_col))
  hms_col <- c(as.character(format(dataset$date, "%H:%M:%S")))
  dataset$igor_date <- paste(ymd_col, middle, hms_col)
  
  #create igor local date column
  ymd_col <- c(as.character(format(dataset$date_local, "%Y-%m-%d")) )
  middle <- rep(" ", length(ymd_col))
  hms_col <- c(as.character(format(dataset$date_local, "%H:%M:%S")))
  dataset$igor_date_local <- paste(ymd_col, middle, hms_col)
  return(dataset)
}
```

```{r}
snfiles <- lapply(snfiles, function(x) create_date_for_igor(x))
```


We export using the following function: 

```{r}
# mapply(
#   fwrite, #apply function write table
#   x=snfiles, file=paste(names(snfiles), "-for-igor.csv", sep=""), #for each element in snfiles, use its name to make a csv of it
#   MoreArgs=list(row.names=FALSE, sep=",")
# )
```
# Integrating Igor Data and Final Check

After checking in Igor, we'll import the data back in. 

```{r, warning= FALSE, error = FALSE}
library(lubridate) #date and time functions 
library(data.table) #to use the data.table variable type
library(dplyr) #this library allows you to use the %>% operator
library(tidyr) #this library lets you use the complete function to account for time syncing
library(openair) #for plotting and analysis
library(stringr)
library(baseline) # for baseline correction
```

We'll import the igor files and make sure the files are listed correctly
```{r}
#importing igor files
temp = list.files(path = "./data/cleaned_in_igor/", pattern = "*.csv", full.names = TRUE) #specify files we're looking for
snfiles_correction = sapply(temp, function(x) fread(file=x, data.table = TRUE) , simplify = FALSE,USE.NAMES = TRUE) #import them in
names(snfiles_correction) <- lapply(names(snfiles_correction), function(x) substr(x, 24, 27)) #change the names
names(snfiles_correction) <- lapply(names(snfiles_correction), function(x) str_replace(x, "SN", "sn"))
```

Next, we're going to import the sensor data back into R, if needed. We're going to use the following function to format the dates in the imported files. 

```{r}
format_dates <- function(sensor){
  sensor$date <- ymd_hms(sensor$date, tz = "UTC")
  sensor$date_local <- ymd_hms(sensor$date_local, tz = "America/New_York")
  
  return(sensor)
}
```

And now we can import the files. 
```{r}
#importing snfiles back in
# temp2 = list.files(path = "./", pattern = "*-for-igor.csv", full.names = TRUE) #specify files we're looking for
# snfiles = sapply(temp2, function(x) fread(file=x, data.table = TRUE) , simplify = FALSE,USE.NAMES = TRUE) #import them in
# names(snfiles) <- lapply(names(snfiles), function(x) substr(x, 3, 6)) #change the names
# 
# #reformatting the dates 
# snfiles <- lapply(snfiles, function(x) format_dates(x))

```


As an example, we'll work through integrating the igor data for sn45. 

First, we'll rename the data, for ease of use.

```{r}
sn45cleaning <- snfiles_correction$sn45
```

If you look at the sn45cleaning file, you'll see that the pollutants are given with uppercase letters. We'll change it to lowercase to match the dataset.

```{r}
sprintf("Pollutant format before : %s", sn45cleaning$Pollutants[1] )

sn45cleaning$Pollutants[which(sn45cleaning$Pollutants != "all")] <- tolower(sn45cleaning$Pollutants[which(sn45cleaning$Pollutants != "all")])

sprintf("Pollutant format after : %s", sn45cleaning$Pollutants[1] )
```

We'll also change all instances of "no" to "correctedNO". We'll do this because correctedNO is the no vector we want to look at and modify. Since the Pollutants vector is a list of strings, we have to go through each string and replace the "no" substring. We're going to look for "no, " with the comma, because otherwise, no2 will change as well. 

```{r}
sprintf("Pollutant format after : %s", sn45cleaning$Pollutants[1] )

#change "Pollutants" item that's just an "no" string 
sn45cleaning$"Pollutants"[which(sn45cleaning$"Pollutants" == "no")] <- "correctedNO"
# change substrings 

if(any(grepl("no, ", sn45cleaning$Pollutants, fixed = TRUE)) == TRUE){
  indx <- grepl("no, ", sn45cleaning$Pollutants, fixed = TRUE)
  sn45cleaning$Pollutants[indx] <- gsub("no, ", "correctedNO, ", sn45cleaning$Pollutants[indx])
}

sprintf("Pollutant format after : %s", sn45cleaning$Pollutants[1] )
```


Next, we'll change the start and end vectors to be in datetime format. In order to convert, we first need to specify that "19" refers to 2019. We can do this by appending a "20" to the start and end strings. 

```{r}
sprintf("This is what the dates look like before: %s - %s", sn45cleaning$Start[1], sn45cleaning$End[1])
```

```{r}
sn45cleaning$Start <- paste("20", sn45cleaning$Start, sep="")
sn45cleaning$End <- paste("20", sn45cleaning$End, sep="")
```

If a start&end pair encompass only one minute, we can make sure that we get all the data from that minute by specifying that "start" begins when the minute starts (ie: 00 seconds) and ends when the minute ends (ie: 59 seconds). 
```{r}
sn45cleaning$Start <- paste(sn45cleaning$Start, "00", sep="")
sn45cleaning$End <- paste(sn45cleaning$End,"59", sep="")
```

Now we can convert the start and end columns to datetime objects. 

```{r}
sn45cleaning$Start <- ymd_hms(sn45cleaning$Start, tz = "America/New_York")
sn45cleaning$End <- ymd_hms(sn45cleaning$End, tz = "America/New_York")
```

```{r}
sprintf("This is what the dates look like after: %s - %s", sn45cleaning$Start[1], sn45cleaning$End[1])
```

Since the file is now formatted, we can use it to modify the data. We're going to loop through each start and end interval, and change them one by one. 

Since there's no "all" items in the data, we'll use an if statement to specify that that means all the pollutants. 

Since other items could be a list of some pollutants, we'll have to check that it is a list by checking if there are any commas in the Pollutants item. If there are, we split the string, and change the pollutants one by one. 

Otherwise, if a Pollutant item is just one pollutant (like "co"), then we can just change it directly 

First, we'll print the values before the change: 

```{r}
for(i in 1:nrow(sn45cleaning)){
  listpollutants = as.vector(strsplit(sn45cleaning$Pollutants[i], ", ")[[1]])
  print(subset(snfiles$sn45[which(date_local >= sn45cleaning$Start[i] & date_local <= sn45cleaning$End[i])] , select = c("date_local", listpollutants)))
}
```

Now, we'll apply the change. 

```{r}
for(i in 1:nrow(sn45cleaning)){ #loop through igor file

  pollutant = sn45cleaning$`Pollutants`[i] #get which pollutant we're observing
  
  if (pollutant == "all"){ #if it's all
    
    snfiles$sn45[snfiles$sn45$date_local >= sn45cleaning$Start[i] & snfiles$sn45$date_local  <= sn45cleaning$End[i], c(  "co", "no" ,"correctedNO", "no2", "o3" , "pm1" , "pm25","pm10", "co2" , "no_ae", "bin0") ]  <- NA
  }
  
  #if it's a list of pollutants
  else if(grepl(",", pollutant, fixed = TRUE)){
    listpollutants = as.list(strsplit(pollutant, ", ")[[1]])
    
    for(subpollutant in listpollutants){
        snfiles$sn45[[subpollutant]][snfiles$sn45$date_local >= sn45cleaning$Start[i] & snfiles$sn45$date_local  <= sn45cleaning$End[i]] <- NA
      }
    }
  
  else{ #otherwise
    
   snfiles$sn45[[pollutant]][snfiles$sn45$date_local >= sn45cleaning$Start[i] & snfiles$sn45$date_local  <= sn45cleaning$End[i]] <- NA
  }

  }
```

And print how it looks after.

```{r}
for(i in 1:nrow(sn45cleaning)){
  listpollutants = as.vector(strsplit(sn45cleaning$Pollutants[i], ", ")[[1]])
  print(subset(snfiles$sn45[which(date_local >= sn45cleaning$Start[i] & date_local <= sn45cleaning$End[i])] , select = c("date_local", listpollutants)))
}
```


Let's now put this whole process in two functions: one to format the igor files and one to modify the datasets. 

```{r}
modify_igor <- function(igor_file){
  #changing string formatting
  if(any(igor_file$Pollutants != "all")){
    igor_file$"Pollutants"[which(igor_file$"Pollutants" != "all")] <- tolower(igor_file$"Pollutants"[which(igor_file$"Pollutants" != "all")])
  }
  #making sure that the function runs on the cleaned NO
  if(any(igor_file$Pollutants == "no")){
    igor_file$"Pollutants"[which(igor_file$"Pollutants" == "no")] <- "correctedNO"
  }
  if(any(grepl("no, ", igor_file$Pollutants, fixed = TRUE)) == TRUE){
    indx <- grepl("no, ", igor_file$Pollutants, fixed = TRUE)
    igor_file$Pollutants[indx] <- gsub("no, ", "correctedNO, ", igor_file$Pollutants[indx])
  }
  #formatting dates
  igor_file$Start <- paste("20", igor_file$Start, sep="")
  igor_file$End <- paste("20", igor_file$End, sep="")
  igor_file$Start <- paste(igor_file$Start, "00", sep="")
  igor_file$End <- paste(igor_file$End,"59", sep="")
  igor_file$Start <- ymd_hms(igor_file$Start, tz = "America/New_York")
  igor_file$End <- ymd_hms(igor_file$End, tz = "America/New_York")
  
  return(igor_file)
}
```

```{r}
apply_igor <- function(igor_file, snfile){
  for(i in 1:nrow(igor_file)){ #loop through igor file

    pollutant = igor_file$`Pollutants`[i] #get which pollutant we're observing
    
    if (pollutant == "all"){ #if it's all
      
      snfile[snfile$date_local >= igor_file$Start[i] & snfile$date_local  <= igor_file$End[i], c(  "co", "no" ,"correctedNO", "no2", "o3" , "pm1" , "pm25","pm10", "co2" , "no_ae", "bin0") ]  <- NA
    }
    #if it's a list of pollutants
    else if(grepl(",", pollutant, fixed = TRUE) == TRUE){
      listpollutants = as.list(strsplit(pollutant, ", ")[[1]]) #it's ", " to get rid of the extra space
      for(subpollutant in listpollutants){
        snfile[[subpollutant]][snfile$date_local >= igor_file$Start[i] & snfile$date_local  <= igor_file$End[i]] <- NA
      }
    }
    else{ #otherwise
      
     snfile[[pollutant]][snfile$date_local >= igor_file$Start[i] & snfile$date_local  <= igor_file$End[i]] <- NA
    }

  }
  return(snfile)
}
```

Now, we'll apply it to all the files 

```{r}
snfiles_correction <- lapply(snfiles_correction, function(x) modify_igor(x)) #modify all the igor files

#one list wasn't changed correctly
snfiles_correction[["sn72"]][2]$Pollutants <- "co, co2, correctedNO, no2, o3"

snfiles <-  sapply(names(snfiles), function(k) apply_igor(snfiles_correction[[k]], snfiles[[k]]), simplify = FALSE, USE.NAMES = TRUE) #apply the igor corrections, by matching corrections and snfiles by name

```

################################################################################

# Final data check 

We'll check the data again using the data checking functions from before. We'll do this to make sure that the data is correct to our knowledge, before we do any analysis. 

```{r}
# lapply(snfiles, function(x) data_check_test(x))
# lapply(snfiles, function(x) data_check_percentiles(x))
# lapply(snfiles, function(x) data_check_mean(x))
```

We'll also plot the data.

```{r}
for (i in seq(length(names(snfiles)))){

  timePlot(snfiles[[i]], pollutant = c("no2", "co", "correctedNO", "bin0", "pm1"),y.relation = "free", main = paste0(names(snfiles)[i], "Time Plots"))
}
```


## Removing Points Based on Final analysis 3/31 

On 3/31 Scott and Alli did a final run-through of the data. We picked out some erronenous points, which we're going to remove now: 

Here are the intervals we talked about:
sn46 : 2019-09-11 17:23-17:35
sn49 : any erroneous bin data (more on this below)
sn62 : 2021-01-31 10:19:00-10:22:00
       2020-12-15 23:32:00-23:38:00
       2020-12-30 18:18:00-20:00:00
       start - 2019-09-11 15:56:00
sn67 : 2021-02-05 19:04:00 - 2021-02-06 02:30:00
sn72 : 2020-09-23 11:16:00-18:00:00

For the dates listed above, we will be removing no,no2,co,o3

```{r}
snfiles$sn46[which(date_local > "2019-09-11 17:23:00" & date_local < "2019-09-11 17:35:00" ), c("correctedNO", "no2", "co", "o3")] <- NA

snfiles$sn62[which(date_local > "2021-01-31 10:19:00" & date_local < "2021-01-31 10:22:00" ), c("correctedNO", "no2", "co", "o3")] <- NA
snfiles$sn62[which(date_local > "2020-12-15 23:32:00" & date_local < "2020-12-15 23:38:00" ), c("correctedNO", "no2", "co", "o3")] <- NA
snfiles$sn62[which(date_local > "2020-12-30 18:18:00" & date_local < "2020-12-30 20:00:00" ), c("correctedNO", "no2", "co", "o3")] <- NA
snfiles$sn62[which(date_local < "2019-09-11 15:56:00" ), c("correctedNO", "no2", "co", "o3")] <- NA

snfiles$sn67[which(date_local > "2021-02-05 19:04:00" & date_local < "2021-02-06 02:30:00" ), c("correctedNO", "no2", "co", "o3")] <- NA

snfiles$sn72[which(date_local > "2020-09-23 11:16:00" & date_local < "2020-09-23 18:00:00" ), c("correctedNO", "no2", "co", "o3")] <- NA
```

```{r}
listpollutants =  c("correctedNO", "no2", "co", "o3")
print(subset(snfiles$sn46[which(date_local >= "2019-09-11 17:23:00" & date_local <= "2019-09-11 17:35:00")] , select = c("date_local", listpollutants)))

print(subset(snfiles$sn62[which(date_local > "2021-01-31 10:19:00" & date_local < "2021-01-31 10:22:00" )] , select = c("date_local", listpollutants)))
print(subset(snfiles$sn62[which(date_local > "2020-12-15 23:32:00" & date_local < "2020-12-15 23:38:00" )] , select = c("date_local", listpollutants)))
print(subset(snfiles$sn62[which(date_local > "2020-12-30 18:18:00" & date_local < "2020-12-30 20:00:00" )] , select = c("date_local", listpollutants)))
print(subset(snfiles$sn62[which(date_local < "2019-09-11 15:56:00")] , select = c("date_local", listpollutants)))

print(subset(snfiles$sn67[which(date_local > "2021-02-05 19:04:00" & date_local < "2021-02-06 02:30:00")] , select = c("date_local", listpollutants)))

print(subset(snfiles$sn72[which(date_local > "2020-09-23 11:16:00" & date_local < "2020-09-23 18:00:00" )] , select = c("date_local", listpollutants)))


```


When we looked at bin0 data, we also realized that we have to include a filter for 0-valued bin values. We'll make a function for that below and apply it. 


We'll check how many data points will be set to NA:

```{r}
checkBins <- function(sensor){
  print(sum(sensor$bin0 == 0, na.rm = TRUE))
}

lapply(snfiles, function(x) checkBins(x))
```
And we'll create and apply the filter: 

```{r}
clean_bins <- function(sensor){
  num_zeros <- which(sensor$bin0 == 0)
  if(length(num_zeros) > 0){
    sprintf("deleting %f entries", sum(which(sensor$bin0 == 0)))
    sensor$bin0[which(sensor$bin0 == 0)] <- NA
    sensor$bin1[which(sensor$bin0 == 0)] <- NA
    sensor$bin2[which(sensor$bin0 == 0)] <- NA
    sensor$bin3[which(sensor$bin0 == 0)] <- NA
    sensor$bin4[which(sensor$bin0 == 0)] <- NA
  }

  sensor$bin1[which(sensor$bin1 == 0)] <- NA
  sensor$bin2[which(sensor$bin2 == 0)] <- NA
  sensor$bin3[which(sensor$bin3 == 0)] <- NA
  sensor$bin4[which(sensor$bin4 == 0)] <- NA

  return(sensor)
}
```


This is what those datapoints looked like before:
```{r}
listpollutants = c("bin0", "bin1", "bin2", "bin3", "bin4", "bin5")

print(subset(snfiles$sn45[which(snfiles$sn45$bin0 == 0)] , select = c("date_local", listpollutants)))

print(subset(snfiles$sn46[which(snfiles$sn46$bin0 == 0)] , select = c("date_local", listpollutants)))

print(subset(snfiles$sn49[which(snfiles$sn49$bin0 == 0)] , select = c("date_local", listpollutants)))

print(subset(snfiles$sn62[which(snfiles$sn62$bin0 == 0)] , select = c("date_local", listpollutants)))

print(subset(snfiles$sn67[which(snfiles$sn67$bin0 == 0)] , select = c("date_local", listpollutants)))

print(subset(snfiles$sn72[which(snfiles$sn72$bin0 == 0)] , select = c("date_local", listpollutants)))
```


```{r}
snfiles <- lapply(snfiles, function(x) clean_bins(x))
```

And this is what they look like after.

```{r}
listpollutants = c("bin0", "bin1", "bin2", "bin3", "bin4", "bin5")

print(subset(snfiles$sn45[which(snfiles$sn45$bin0 == 0)] , select = c("date_local", listpollutants)))

print(subset(snfiles$sn46[which(snfiles$sn46$bin0 == 0)] , select = c("date_local", listpollutants)))

print(subset(snfiles$sn49[which(snfiles$sn49$bin0 == 0)] , select = c("date_local", listpollutants)))

print(subset(snfiles$sn62[which(snfiles$sn62$bin0 == 0)] , select = c("date_local", listpollutants)))

print(subset(snfiles$sn67[which(snfiles$sn67$bin0 == 0)] , select = c("date_local", listpollutants)))

print(subset(snfiles$sn72[which(snfiles$sn72$bin0 == 0)] , select = c("date_local", listpollutants)))
```

We'll plot and check one last time: 

```{r}
for (i in seq(length(names(snfiles)))){

  timePlot(snfiles[[i]], pollutant = c("no2", "co", "correctedNO", "bin0", "pm1"),y.relation = "free", main = paste0(names(snfiles)[i], "Time Plots"))
}
```

And we'll remove the final extraneous point.

```{r}
snfiles$sn62[which(date_local == "2021-01-30 10:19:00"), "no2"] <- NA
```


# QuantAQ Corrections

After additional processing of the data by QuantAQ, we have determined that some other factors need to be changed. Mainly, we need to include a new ML model and eliminate data that we now know is erroneous. 

## Removing erroneous data 

It was discovered that following sensors experienced technical issues: 
- SN46's O3 sensor
- SN72's gas phase pollutants (o3, no, no2, co)

We will remove them below: 

```{r}
snfiles$sn46$o3 <- NA

snfiles$sn72$o3 <- NA
snfiles$sn72$correctedNO <- NA
snfiles$sn72$no <- NA
snfiles$sn72$no2 <- NA
snfiles$sn72$co <- NA
```


## correcting PM 

We have discovered that PM data needs to be adjusted by a factor of 17.35. We will do that below by multiplying the PM vectors by that value. 

```{r}
correct_PM <- function(sensor){
  sensor$pm1 <- sensor$pm1*17.35
  sensor$pm25 <- sensor$pm25*17.35
  sensor$pm10 <- sensor$pm10*17.35
  return(sensor)
}
```

```{r}
snfiles <- lapply(snfiles, function(x) correct_PM(x))
```


```{r}
mapply(
  fwrite, #apply function write table
  x=snfiles, file=paste("new-", names(snfiles), ".csv", sep="")
)
```


## Adding new ML models 

After we generated the files, new ML models came in, so we will integrate those into the data to see how well they match

First, we'll import them and format the dates, same as what we do above.

```{r}
tempcsv <- list.files(path = "./data/MLdata/", pattern = "*.csv", full.names = TRUE)
MLfiles = sapply(tempcsv, function(x) fread(file=x, data.table = TRUE) , simplify = FALSE,USE.NAMES = TRUE) #import them in
names(MLfiles) <- c("sn45", "sn46", "sn49", "sn62", "sn67", "sn72")
```

```{r}
extractColumns <- function(sensor){
  sensor <- subset(sensor, select= c("timestamp",  "no", "flag", "pm1", "pm25", "pm10" ))
}
```

```{r}
MLfiles <- lapply(MLfiles, function(x) extractColumns(x))
```

```{r}
clean_sensor_data <- function(sensor){
  sensor$date <- ymd_hms(sensor$timestamp, tz="UTC") #parse datetime
  #sensor$date <- with_tz(sensor$date, "America/New_York") #change the time according to shifted timezone
  sensor <- mutate(sensor, originaldate = date) #keeping original times for comparing to flight data
  sensor$date <- round_date(ymd_hms(sensor$date, tz="UTC"), unit="minute") #round date for merging
  sensor<- sensor[order(sensor$originaldate),] #put it in chronological order
  return(sensor)
}

get_localtime <- function(sensor){
  ##sensor$date_local = sensor$date
  sensor$date_local <- with_tz(sensor$date, tzone = "America/New_York")
  return(sensor)
}
```
```{r}
MLfiles <- lapply(MLfiles, function(x) clean_sensor_data(x))
MLfiles <- lapply(MLfiles, function(x) get_localtime(x))
```


Now, we're going to merge this dataframe with SN files. We're going to identify it by adding a suffix "ML"

```{r}
SnMl <- sapply(names(snfiles), function(k) merge.data.table(snfiles[[k]], MLfiles[[k]], by.x="date_local", by.y = "date_local", all = FALSE, suffixes = c("", ".ML")), simplify = FALSE,USE.NAMES = TRUE) #bind lists of final and raw dataframes
```

```{r}
flagFilter <- function(sensor){
  for(i in nrow(sensor)){
    if(sensor$flag[i] & 1 == 1 | sensor$flag[i] & 2 == 2){
      sensor$co[i] <- NA
      sensor$co.ML[i] <- NA
      sensor$no[i] <- NA
      sensor$no.ML[i] <- NA
      sensor$no2[i] <- NA
      sensor$no2.ML[i] <- NA
        }
    
    if(sensor$flag[i] & 4 == 4){
      sensor$co[i] <- NA
      sensor$co.ML[i] <- NA
    }
    
    if(sensor$flag[i] & 8 == 8){
      sensor$no[i] <- NA
      sensor$no.ML[i] <- NA
    }
    
    if(sensor$flag[i] & 16 == 16){
      sensor$no2[i] <- NA
      sensor$no2.ML[i] <- NA
    }
  }
  
  return(sensor)
}
```


```{r}
SnMl <- lapply(SnMl, function(x) flagFilter(x))
```


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

# Export 

Now we can export the cleaned data into csvs. 


## Export

Since we want to save the datasets to files with representative names, we can use mapply (like lapply but with more functionality). For each dataset, we can set the name we used when we set snfiles as the file names. 

```{r}
mapply(
  fwrite, #apply function write table
  x=SnMl, file=paste(names(snfiles), "-final-w-ML", ".csv", sep="")
)
```

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

# Analysis 

## Polar Plots 

Polar plots are a tool that shows the relationship between pollutant concentration and wind direction and speed. In this section, we'll generate polar plots for NO, NO2, CO, CO2, bin0 and PM1. We'll do this by creating a function that will create all these polar plots, for a given sensor dataset. We'll apply this function to the snfiles list. 

```{r}
generate_overall_polarplots <- function(sn, snstring){
  #generates the polar plots for the variables we are interested in 
  polarPlot(sn, pollutant = "correctedNO", main = paste0(snstring, " NO Polar Plot", sep= " "))
  polarPlot(sn, pollutant = "no2", main = paste0(snstring, " NO2 Polar Plot", sep= " "))
  polarPlot(sn, pollutant = "co",  main = paste0(snstring, " CO Polar Plot", sep= " "))
  polarPlot(sn, pollutant = "co2",  main = paste0(snstring, " CO2 Polar Plot", sep= " "))
  polarPlot(sn, pollutant = "bin0",  main = paste0(snstring, " bin0 Polar Plot", sep= " "))
  polarPlot(sn, pollutant = "pm1",  main = paste0(snstring, " PM1 Polar Plot", sep= " "))

}
```

```{r}
generate_NO_polarplots <- function(sn, snstring){
  #generates the polar plots for the variables we are interested in 
  polarPlot(sn, pollutant = "correctedNO", main = paste0(snstring, " correctedNO Polar Plot", sep= " "))
  polarPlot(sn, pollutant = "no.ML", main = paste0(snstring, " no.ML Polar Plot", sep= " "))
}

generate_PM_polarplots <- function(sn, snstring){
  #generates the polar plots for the variables we are interested in 
  polarPlot(sn, pollutant = "pm1.ML", main = paste0(snstring, " PM1 Polar Plot", sep= " "))
  polarPlot(sn, pollutant = "pm25.ML", main = paste0(snstring, " PM2.5 Polar Plot", sep= " "))
  polarPlot(sn, pollutant = "pm10.ML", main = paste0(snstring, " PM10 Polar Plot", sep= " "))
}
```

The following chunk of code can also save the polar plots to a pdf. To do this, uncomment the "pdf" and "dev.off" lines. 

```{r}
#pdf("cleaned_sn45_polarplots.csv")
# generate_overall_polarplots(snfiles$sn45, "sn45")
#dev.off()
```

We can also apply this function as a for loop to the datasets. We can also loop through the dataset names, and add those to the titles. 

```{r}
# pdf("polarplots.pdf")
# for (i in seq(length(names(snfiles)))){
#  generate_overall_polarplots(snfiles[[i]], names(snfiles)[i])
# }
# dev.off()
```


```{r}
pdf("PM-polarplots.pdf")
for (i in seq(length(names(snfiles)) )){
 generate_PM_polarplots(snfiles[[i]], names(snfiles)[i])
}
dev.off()
```


## Diurnal Plots 

Diurnal plots show the relationship between pollutant concentration and time (on the scale of hours, days and months). To create diurnal profiles for all of the sensors, we can apply the same method as we did to create polar plots.

```{r}
# pdf("diurnal_plots.pdf")
# for (i in seq(from= 7, to =17, by = 2)){
# 
#   timeVariation(snfiles[[i]],  pollutant = c( "co", "no2", "correctedNO"), local.tz = "America/New_York", normalise = TRUE, main = paste0(names(snfiles)[i], "pre-covid Normalized Group of Pollutants Diurnal Profile"))
# }
# 
# dev.off()
```

```{r}
# for (i in 1:3){
# 
#   timeVariation(snfiles[[i]],  pollutant = c( "co", "no2", "correctedNO"), local.tz = "America/New_York", normalise = TRUE, main = paste0(names(snfiles)[i], "pre-covid Normalized Group of Pollutants Diurnal Profile"))

# }
```

```{r}
#pdf("diurnal-plot-sn45.pdf")
# timeVariation(snfiles$sn45_before, pollutant = c( "co", "no2", "correctedNO"), normalise = TRUE, main = paste0(names(snfiles)[i], "pre-covid Normalized Group of Pollutants Diurnal Profile"))
# 
# timeVariation(snfiles$sn45_before, pollutant = c( "co", "no2", "correctedNO"), local.tz = "America/New_York", normalise = TRUE, main = paste0(names(snfiles)[i], "pre-covid Normalized Group of Pollutants Diurnal Profile"))
#dev.off()
```


```{r}
pdf("NO-diurnalplot.pdf")

for (i in seq(length(names(snfiles)) -1 )){
 timeVariation(SnMl[[i]], pollutant = c( "correctedNO", "no.ML"), normalise = FALSE, local.tz = "America/New_York", main = paste0(names(snfiles)[i], "corrected NO and no.ML Diurnal Profiles"))
}

dev.off()
```


## CorPlot 

We will make a correlation plot, using openair's corPlot function. 

```{r}
# pdf("cor-plots-ML.pdf")
# for (i in seq(length(names(snfiles)))){
# 
#   corPlot(SnMl[[i]], pollutants = c("correctedNO", "no.ML", "no2", "no2.ML",  "co", "co.ML", "o3", "o3.ML", "co2", "bin0", "pm1", "pm25", "pm10", "temp_manifold", "rh_manifold", "pressure", "ws"), main = paste0(names(snfiles)[i], " CorPlot Profile"))
# 
# }
# 
# dev.off()
```

```{r}
# pdf("timeplots-ML.pdf")
# for (i in seq(length(names(snfiles)))){
# 
#   timePlot(SnMl[[i]], pollutant = c("correctedNO", "no.ML", "no2", "no2.ML",  "co", "co.ML", "o3", "o3.ML"), y.relation = "free", main = paste0(names(snfiles)[i], "Time Plots"))
#            
# }
# dev.off()
```

```{r}
# corPlot(snfiles[[7]], pollutants = c("correctedNO", "no2", "co", "o3","co2", "bin0", "pm1", "temp_manifold", "rh_manifold", "pressure", "ws"), main = paste0(names(snfiles)[7], " CorPlot Profile"))
```