A package for easy learning of univariate relationships in R using splines.
This package provides functionality to fit a piecewise cubic spline to model the relationship between a univariate X
time series and a univariate Y
time series. The output is akin to linear regression but has higher "capacity" in that it can learn more complicated relationships despite leveraging only a single predictor (independent variable) time series. It thus can learn very reasonable relationships between X
and Y
with minimal effort. The package also provides functionality for using the fitted spline models to make predictions based on new X
data. Examples of model fitting and prediction workflows are provided in the usage section.
The canonical intended use case is for predicting electric demand (Y
) from temperature (X
). There can be complicated, non-linear relationships between these two variables, but these can be reasonably approximated by the sum of: 1) an "aggregate" relationship between daily average temperature and daily average demand, and 2) intra-day patterns represented as hourly offsets from the daily average load. The spline functionality is used for learning a robust "aggregate" relationship without the inconvenience of manual feature engineering. More details on the modeling approach are provided in the modeling details section.
The model fitting workhorse is R's built-in smooth.spline
. This package provides a suite of options for controlling how smooth.spline
fits a spline to the data. In particular, it is designed to return a model with a single critical point (i.e., a single location along the support where the first derivative is zero). This is to ensure that changes in predicted electric demand are always positive as temperatures become more extreme. This helps to prevent overfitting, particularly when the input data has few observations or has significant uncertainty from other sources (e.g., small geographies).
- Install R (and, optionally, RStudio) on your machine.
- In your preferred R environment, install
devtools
via:install.packages("devtools")
. - In your preferred R environment, install this package via:
devtools::install_github(repo="NREL/adaptive-splines")
.
- Create a project directory where you will perform your analysis.
- In your project directory, create a
data
directory withx
andy
sub-directories, afits
directory, and aplots
directory. - Place your input
X
andY
data into the relevantdata
sub-directories (i.e.,data/x
anddata/y
). See the input data requirements section for details.
Perform your desired analysis by loading the package, setting R's working directory to your project directory, defining your input data parameters, and then running the top-level functions exposed to the user as shown below. Separate code snippets are provided to demonstrate model fitting functionality and prediction functionality. See the input data requirements and explanation of user-accessible settings sections for further details.
Model fitting should be performed using one or (ideally) multiple full years of X
and Y
data.
#load the package
library(adaptivesplines)
#set working directory
setwd("/path/to/project")
#define data loading args for model fitting (modify to fit your data, but note that 'yyyy' is required)
x.root <- "hourly_temperature_by_geog_yyyy.csv"
y.root <- "hourly_load_by_geog_yyyy.csv"
data.years <- 2007:2013
#load data
df.list <- load.and.check.data.fxn(x.root, y.root, data.years)
#fit models
results.list <- fit.geogs.fxn(df.list, data.years) #optional args: c(lam.mult, lam.mult.base, degrees.cold, degrees.hot, degrees.to.fit)
#plot modeled "aggregate" relationships
plot.results.fxn(results.list$daily.avg, data.years) #optional args: c(plot.weekdays, plot.weekends)
#write out models
write.results.fxn(results.list) #optional args: c(write.daily.avg, write.intra.day)
The prediction workflow can be run immediately after fitting the models (in which case the in-memory results.list
can be used directly) or in a separate R session leveraging models previously written to CSV. In contrast to the model fitting workflow, the expected use case for prediction is to load one or more days of forecast X
observations and use the fitted models to predict each corresponding Y
value. Note that first.date
is used both to read in the correct array of forecast X
observations and to determine the appropriate models to employ in generating each hour's predictions.
#load the package
library(adaptivesplines)
#set working directory
setwd("/path/to/project")
#define data loading args for forecasting (modify to fit your data, but note that 'yyyymmdd' is required)
x.preds.root <- "hourly_temperature_by_geog_yyyymmdd.csv"
first.date <- "20500101" #format as 'yyyymmdd'
#load forecast data
x.preds.df <- load.data.fxn("x", x.preds.root, first.date, patt="yyyymmdd")
#make predictions (two approaches available, choose one)
y.preds.df <- predict.fxn(x.preds.df, first.date) #from CSV
y.preds.df <- predict.fxn(x.preds.df, first.date, from.csv=F, fitted.models=results.list) #from memory
#write out predictions
write.preds.fxn(y.preds.df, first.date)
The prediction workflow can be readily looped. Each iteration of the loop will result in a set of predictions written to CSV.
#loop example
first.dates <- c("20500101", "20500102", "20500103")
for (first.date in first.dates) {
print(first.date)
x.preds.df <- load.data.fxn("x", x.preds.root, first.date, patt="yyyymmdd")
y.preds.df <- predict.fxn(x.preds.df, first.date)
write.preds.fxn(y.preds.df, first.date)
}
All Y
arrays should retain their true weekday/weekend/holiday schedule rather than being adjusted to that of a "target year" (if any). This is true regardless of whether the values in the arrays have been normalized (e.g., to account for changes in infrastructure over time), which is the typical load modeling use case at NREL.
In addition, input files are required to satisfy the following:
- All
X
andY
files used in model fitting represent full calendar years of hourly data with size (8760 xn.geographies
), excluding row names and column names;X
files used for prediction can contain any number of complete days of hourly data. - All arrays have row names (ideally datetimes) and column names (identifying geographies).
- All time series are in the local time zone of each respective geography.
- Row names must be consistent across the
X
andY
data used for model fitting. - Only the intersection of the
X
andY
column names will be retained for model fitting; similarly, only the intersection of column names in the forecastX
data and the fitted models will be retained for prediction. - The three sets of input data files are each named in an internally consistent manner, differing only in their dates (with the date placeholder taking the form of "yyyy" for model fitting and "yyyymmdd" for prediction).
x.root
: string indicating the structure of the filenames for the independent variableX
(e.g., "hourly_temperature_by_geog_yyyy.csv"). Note the "yyyy" which is a placeholder for the four-digit year in each filename.y.root
: string indicating the structure of the filenames for the dependent variableY
(e.g., "hourly_load_by_geog_yyyy.csv"). Note the "yyyy" which is a placeholder for the four-digit year in each filename.data.years
: vector of integers indicating the calendar years of the data.
df.list
: list containing the arrays generated byload.and.check.data.fxn
.data.years
: described above.lam.mult
(optional): a multiplier (>1
) indicating how much to stiffen thesmooth.spline
in each fitting iteration (default:2
).lam.mult.base
(optional): a multiplier (≥1
) indicating how much to stiffen thesmooth.spline
on the initial fit, with1
corresponding to the lambda obtained via leave-one-out cross-validation (default:2
).degrees.cold
(optional): how many degrees to extrapolate the empirical support for cold temperatures (default:5
).degrees.hot
(optional): how many degrees to extrapolate the empirical support for hot temperatures (default:5
).degrees.to.fit
(optional): how many degrees to use when fitting the linear regressions used to extrapolate the empirical support (default:5
).
daily.average.list
: list containing the "daily average" model fits; a sub-list of what is generated byfit.geogs.fxn
.data.years
: described above.plot.weekdays
(optional): whether to plot the "weekday" model fits (default:TRUE
).plot.weekends
(optional): whether to plot the "weekend" model fits (default:TRUE
).
results.list
: the output generated byfit.geogs.fxn
.write.daily.avg
(optional): whether to write out the "daily average" model fits (default:TRUE
).write.intra.day
(optional): whether to write out the "intra-day" model fits (default:TRUE
).
x.preds.root
: string indicating the structure of the filenames for forecast values of the independent variableX
(e.g., "hourly_temperature_by_geog_yyyymmdd.csv"). Note the "yyyymmdd" which is a placeholder for the first prediction date contained in the file.first.date
: the first day to forecast, specified as "yyyymmdd".patt
: string indicating the structure of the date in the input filepath; must match the structure offirst.date
(i.e., "yyyymmdd").
x.preds.df
: data frame of forecastX
values generated byload.data.fxn
.first.date
: described above.from.csv
(optional): whether to read the fitted models from CSV files (default:TRUE
).fitted.models
(optional): list containing the fitted models; required whenfrom.csv
isFALSE
(default:NULL
).
y.preds.df
: data frame ofY
predictions generated bypredict.fxn
.first.date
: described above.
This component learns the "aggregate" relationship between X
(e.g., temperature) and Y
(e.g., electric demand) via R's built-in smooth.spline
using all provided years of input data as defined by x.root
, y.root
, and data.years
. This relationship is learned separately for weekdays and weekends since electric demand is typically higher on weekdays. If a given geography's spline has to be stiffened in order to satisfy the stopping criterion, its final "multiplier count" will be printed to the console. Each geography ends up with a daily average Y
prediction for each integer of corresponding daily average X
value spanning its historical range (plus predictions for extrapolated daily average X
values beyond the historical range). Weekdays and weekends are written out as separate CSV files, each with as many rows as unique temperatures (including extrapolated temperatures, plus a row for the column names indicating the geography) and as many columns as geographies (plus a column for the row names, indicating the temperature). NA
is used to populate cells where a given geography did not experience a given temperature.
This component learns separate models of the typical hour-of-day offset for Y
for each calendar month x weekday/weekend "level" (i.e., 24 different models) via simple averaging. Weekdays and weekends are written out as separate CSV files, each with 288 (=12*24) rows (plus a row for column names indicating the geography), where the first 24 rows represent the hourly offsets for January, and as many columns as geographies (plus a column for the row names, indicating the hour-of-day x month offset).
Guidance for warnings and errors users may encounter:
spar-finding: non-finite value inf; using BIG value
: Indicates thatsmooth.spline
is bounding non-finite values to large finite values for safety when fitting the spline model. This is not an error and a fitted model should still be obtained for the affected geography. Best practice is to view the plot for the affected geography to confirm that a reasonable model has been learned. Note that additional years of data may improve model fit.
Shalizi, C. (n.d.). Splines. Chapter 7 in Advanced Data Analysis from an Elementary Point of View.
Funding for initial capability development provided by the U.S. Department of Energy (DOE) Office of Electricity. Funding for package creation provided by DOE Solar Energy Technology Office.
Comments are welcome; please email me.
- Treat federal holidays as weekends
- Handle leap years