- If you are using the src files without cloning, make sure:
- You create a plt folder that exists exists at a level higher than the code
mkdir ../plt
- You NEED the
src/helper.r
file in addition to the foursrc/plot[1-4].r
- Future feature could be to include folder creation
- You create a plt folder that exists exists at a level higher than the code
- Clone the repo and run
src/plot[1-4].r
- These files rely on the
src/helper.r
file which does all the heavy lifting to get and clean the data
- These files rely on the
- The entire project relies on just two add on packages that are auto-installed if missing
Replicate 4 Plots using publicly available data from the UC Irvine Machine Learning Repository. Specifically using the "Individual household electric power consumption Data Set" which is available on the coursera course web site.
The following is heavily borrowed/modified from the orginal assignment source:
-
Dataset: Electric power consumption [20Mb]
-
Description: Measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 years. Different electrical quantities and some sub-metering values are available.
-
Variables: The descriptions of the 9 variables in the dataset can be located at the UCI web site
The assignment description warns:
- The dataset has 2,075,259 rows and 9 columns. First calculate a rough estimate of how much memory the dataset will require in memory before reading into R.
So, I decided to only read in the rows that I needed into a dataframe using the sqldf package
In order to filter our targeted data - the where clause in the sql-like statement focuses on the "Date" variable ("m/d/yyyy" format NOT "mm/dd/yyyy" as claimed). Specifically we're given two dates to select, and rather than doing extra transformations we can treat them as strings and check equivalence (faster than determining within a range).
So, assuming we open a file handle with the needed data set: A sqlquery is generated with the two dates and a data frame is the result of the query, with a few extra parameters passes in to handle the header and seperator:
sqlstatement <- sprintf("select * from fhandle where Date = '%s' or Date = '%s'", sDate, eDate)
tdf <- sqldf(sqlstatement, file.format = list(header=TRUE, sep = ";"))
- In each section the requested plot is followed by the one created
- The plots generated in this assignment are in the "plt" directory