Skip to content

Commit

Permalink
Merge pull request #95 from ModelOriented/add_data
Browse files Browse the repository at this point in the history
adding vignette on geo components
  • Loading branch information
mayer79 authored Jun 7, 2023
2 parents c5a008d + 89f1570 commit 9e3574f
Show file tree
Hide file tree
Showing 11 changed files with 273 additions and 18 deletions.
3 changes: 2 additions & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Package: shapviz
Title: SHAP Visualizations
Version: 0.8.1
Version: 0.9.0
Authors@R: c(
person("Michael", "Mayer", , "mayermichael79@gmail.com", role = c("aut", "cre")),
person("Adrian", "Stando", , "adrian.j.stando@gmail.com", role = "ctb")
Expand Down Expand Up @@ -45,3 +45,4 @@ VignetteBuilder: knitr
Config/testthat/edition: 3
URL: https://github.com/ModelOriented/shapviz
BugReports: https://github.com/ModelOriented/shapviz/issues
LazyData: true
6 changes: 4 additions & 2 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,15 @@
# shapviz 0.8.1
# shapviz 0.9.0

## New features

- New plot function `sv_dependence2D()`: x and y are two features, while their summed SHAP values are shown on the color scale. If `interaction = TRUE`, SHAP interaction values are shown on the color scale instead. The function is vectorized in `x` and/or `y`.
- New plot function `sv_dependence2D()`: x and y coordinates are two features, while their summed SHAP values are shown on the color scale. If `interaction = TRUE`, SHAP interaction values are shown on the color scale instead. The function is vectorized in `x` and/or `y`. This visualization is especially useful for models with geographic components.
- `split(x, f)` splits a "shapviz" object `x` into a "mshapviz" object.

## Documentation

- Slight improvements in help/docu.
- New vignette on models with geographic components.
- Added a fantastic house price dataset with about 14,000 houses sold in Miami-Date County, thanks Steven C. Bourassa.

## API improvements

Expand Down
32 changes: 32 additions & 0 deletions R/data.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
#' Miami-Dade County House Prices
#'
#' @description
#' The dataset contains information on 13,932 single-family homes sold in
#' Miami-Dade County in 2016.
#' Besides publicly available information, the dataset creator Steven C. Bourassa has
#' added distance variables, aviation noise as well as latitude and longitude.
#'
#' More information can be found open-access on <https://www.mdpi.com/1595920>.
#'
#' The dataset can also be downloaded via `miami <- OpenML::getOMLDataSet(43093)$data`.
#'
#' @format A data frame with 13,932 rows and 17 columns:
#' \describe{
#' \item{PARCELNO}{unique identifier for each property. About 1% appear multiple times.}
#' \item{SALE_PRC}{sale price ($)}
#' \item{LND_SQFOOT}{land area (square feet)}
#' \item{TOT_LVG_AREA}{floor area (square feet)}
#' \item{SPEC_FEAT_VAL}{value of special features (e.g., swimming pools) ($)}
#' \item{RAIL_DIST}{distance to the nearest rail line (an indicator of noise) (feet)}
#' \item{OCEAN_DIST}{distance to the ocean (feet)}
#' \item{WATER_DIST}{distance to the nearest body of water (feet)}
#' \item{CNTR_DIST}{distance to the Miami central business district (feet)}
#' \item{SUBCNTR_DI}{distance to the nearest subcenter (feet)}
#' \item{HWY_DIST}{distance to the nearest highway (an indicator of noise) (feet)}
#' \item{age}{age of the structure}
#' \item{avno60plus}{dummy variable for airplane noise exceeding an acceptable level}
#' \item{structure_quality}{quality of the structure}
#' \item{month_sold}{sale month in 2016 (1 = jan)}
#' \item{LATITUDE, LONGITUDE}{Coordinates}
#' }
"miami"
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,7 @@ sv_dependence2D(shp, x = "carat", y = c("clarity", "color"), alpha = 0.5)

![](man/figures/README-dep2D.png)

This is especially interesting for geographic components in a model.
This is especially interesting for geographic components in a model, see corresponding vignette.

### Interactions

Expand Down
15 changes: 8 additions & 7 deletions cran-comments.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,20 @@
# {shapviz} 0.8.0
# shapviz 0.9.0

Hello CRAN team

- {shapviz} will work more smoothly for upcoming {fastshap} versions. I did this change together with the {fastshap} maintainer (Brandon Greenwell).
- Better help files
- Switched from "import ggplot2" to "ggplot2::function" code style
- I have added a beautiful house price dataset with 14000 transactions from Miami in 2016. A friend of mine has created it (Prof. Steven Bourassa).
- There is a new plot `sv_dependence2D()` that is ideal for visualization of geographic components.
- Added new vignette on SHAP analyses for models with geographic components.

## Checks look good

### check(manual = TRUE, cran = TRUE)

* checking for future file timestamps ... NOTE
unable to verify current time
> checking data for ASCII and uncompressed saves ... OK
WARNING
'qpdf' is needed for checks on size reduction of PDFs

* checking HTML version of manual ... NOTE
> checking HTML version of manual ... NOTE
Skipping checking HTML validation: no command 'tidy' found

### check_rhub(): Some usual notes...
Expand Down
Binary file added data/miami.rda
Binary file not shown.
41 changes: 41 additions & 0 deletions man/miami.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

10 changes: 5 additions & 5 deletions packaging.R
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ library(usethis)
use_description(
fields = list(
Title = "SHAP Visualizations",
Version = "0.8.1",
Version = "0.9.0",
Description = "Visualizations for SHAP (SHapley Additive exPlanations),
such as waterfall plots, force plots, various types of importance plots,
dependence plots, and interaction plots.
Expand All @@ -30,8 +30,7 @@ use_description(
`Authors@R` =
"c(person('Michael', family = 'Mayer', role = c('aut', 'cre'), email = 'mayermichael79@gmail.com'),
person('Adrian', family = 'Stando', role = 'ctb', email = 'adrian.j.stando@gmail.com'))",
Depends = "R (>= 3.6.0)",
LazyData = NULL
Depends = "R (>= 3.6.0)"
),
roxygen = TRUE
)
Expand Down Expand Up @@ -60,7 +59,8 @@ use_build_ignore(c("^packaging.R$", "[.]Rproj$", "^logo.png$"), escape = FALSE)
# use_pipe()

# If your package contains data. Google how to document
# use_data()
# miami <- OpenML::getOMLDataSet(43093)$data
use_data(miami)

# Add short docu in Markdown (without running R code)
use_readme_md()
Expand Down Expand Up @@ -101,7 +101,7 @@ library(devtools)

document()
test()
check(manual = TRUE, cran = TRUE, vignettes = FALSE)
check(manual = TRUE, cran = TRUE, vignettes = TRUE)
build()
# build(binary = TRUE)
install(upgrade = FALSE)
Expand Down
3 changes: 1 addition & 2 deletions vignettes/basic_use.Rmd
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: "Using {shapviz}"
title: "Using 'shapviz'"
bibliography: "biblio.bib"
link-citations: true
output: rmarkdown::html_vignette
Expand All @@ -9,7 +9,6 @@ vignette: >
%\VignetteEncoding{UTF-8}
---


```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
Expand Down
16 changes: 16 additions & 0 deletions vignettes/biblio.bib
Original file line number Diff line number Diff line change
Expand Up @@ -33,3 +33,19 @@ @article{lundberg2020
year={2020},
publisher={Nature Publishing Group}
}

@article{Mayer2022MachineLA,
title={Machine Learning Applications to Land and Structure Valuation},
author={Michael Mayer and Steven C. Bourassa and Martin Hoesli and Donato Flavio Scognamiglio},
journal={Journal of Risk and Financial Management},
year={2022}
}

@misc{mayer2022shap,
title={SHAP for additively modeled features in a boosted trees model},
author={Michael Mayer},
year={2022},
eprint={2207.14490},
archivePrefix={arXiv},
primaryClass={stat.ML}
}
163 changes: 163 additions & 0 deletions vignettes/geographic.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
---
title: "Geographic Components"
bibliography: "biblio.bib"
link-citations: true
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Geographic Components}
%\VignetteEncoding{UTF-8}
%\VignetteEngine{knitr::rmarkdown}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
warning = FALSE,
message = FALSE,
fig.height = 5,
fig.width = 8,
fig.align = "center"
)
```

## Setting

In a model with geographic components, we want to express a functional $T$ (usually the expectation or a quantile) of a response $Y$ as a function $f$ of a set of geographic features (latitude/longitude and/or postal code and/or other features varying with location):

$$
T(Y \mid X^\textrm{geo}, X^\textrm{other}) \approx f(X^\textrm{geo}, X^\textrm{other})
$$
Like any feature, the effect of a single geographic feature $X^{\textrm{geo}, j}$ can be described using SHAP dependence plots. However, studying the effect of latitude (or any other location dependent feature) alone is often not very illuminating - simply due to strong interaction effects and correlations with other geographic features.

That's where the additivity of SHAP values comes into play: The sum of SHAP values of all geographic components represent the total effect of $X^\textrm{geo}$, and this sum can be visualized as a heatmap or 3D scatterplot against latitude/longitude (or any other geographic representation)

## A first example

For illustration, we will use a beautiful house price dataset containing information on about 14'000 houses sold in 2016 in Miami-Date County. Some of the columns are as follows:

- **SALE_PRC**: Sale price in USD: Its logarithm will be our model response.
- *LATITUDE*, *LONGITUDE*: Coordinates
- *CNTR_DIST*: Distance to central business district
- *OCEAN_DIST*: Distance (ft) to the ocean
- *RAIL_DIST*: Distance (ft) to the next railway track
- *HWY_DIST*: Distance (ft) to next highway
- TOT_LVG_AREA: Living area in square feet
- LND_SQFOOT: Land area in square feet
- structure_quality: Measure of building quality (1: worst to 5: best)
- age: Age of the building in years

(Italic features are geographic components.) For more background on this dataset, see @Mayer2022MachineLA.

We will fit an XGBoost model to explain log(price) as a function of lat/long, size, and quality/age.

```{r}
library(xgboost)
library(ggplot2)
library(shapviz)
head(miami)
x_coord <- c("LATITUDE", "LONGITUDE")
x_nongeo <- c("TOT_LVG_AREA", "LND_SQFOOT", "structure_quality", "age")
x <- c(x_coord, x_nongeo)
# Train/valid split
set.seed(1)
ix <- sample(nrow(miami), 0.8 * nrow(miami))
X_train <- data.matrix(miami[ix, x])
X_valid <- data.matrix(miami[-ix, x])
y_train <- log(miami$SALE_PRC[ix])
y_valid <- log(miami$SALE_PRC[-ix])
# Fit XGBoost model with early stopping
dtrain <- xgb.DMatrix(X_train, label = y_train)
dvalid <- xgb.DMatrix(X_valid, label = y_valid)
params <- list(learning_rate = 0.2, objective = "reg:squarederror", max_depth = 5)
fit <- xgb.train(
params = params,
data = dtrain,
watchlist = list(valid = dvalid),
early_stopping_rounds = 20,
nrounds = 1000,
callbacks = list(cb.print.evaluation(period = 100))
)
```

Let's first study selected SHAP dependence plots, evaluated on the validation dataset with around 2800 observations. Note that we could as well use the training data for this purpose, but it is a bit too large.

```{r}
sv <- shapviz(fit, X_pred = X_valid)
sv_dependence(
sv,
v = c("TOT_LVG_AREA", "structure_quality", "LONGITUDE", "LATITUDE"),
alpha = 0.2
)
# And now the two-dimensional plot of the sum of SHAP values
sv_dependence2D(sv, x = "LONGITUDE", y = "LATITUDE") +
coord_equal()
```

The last plot gives a good impression on price levels.

Notes:

1. Since we have modeled logarithmic prices, the effects are on relative scale (0.1 means about 10% above average).
2. Due to interaction effects with non-geographic components, the location effects might depend on features like living area. This is not visible in above plot. We will modify the model now in this respect.

## Two modifications

We will now change above model in two ways, not unlike the model in @Mayer2022MachineLA:

1. We will use additional geographic features like distance to railway track or to the ocean.
2. We will use interaction constraints to allow only interactions between geographic features.

The second step leads to a model that is additive in each non-geographic component and also additive in the combined location effect. According to the technical report @mayer2022shap, SHAP dependence plots of additive components in a boosted trees model are shifted versions of corresponding partial dependence plots (evaluated at observed values). This allows a "Ceteris Paribus" interpretation of SHAP dependence plots of corresponding components.

```{r}
# Extend the feature set
more_geo <- c("CNTR_DIST", "OCEAN_DIST", "RAIL_DIST", "HWY_DIST")
x2 <- c(x, more_geo)
X_train2 <- data.matrix(miami[ix, x2])
X_valid2 <- data.matrix(miami[-ix, x2])
dtrain2 <- xgb.DMatrix(X_train2, label = y_train)
dvalid2 <- xgb.DMatrix(X_valid2, label = y_valid)
# Build interaction constraint vector
ic <- c(
list(which(x2 %in% c(x_coord, more_geo)) - 1),
as.list(which(x2 %in% x_nongeo) - 1)
)
# Modify parameters
params$interaction_constraints <- ic
fit2 <- xgb.train(
params = params,
data = dtrain2,
watchlist = list(valid = dvalid2),
early_stopping_rounds = 20,
nrounds = 1000,
callbacks = list(cb.print.evaluation(period = 100))
)
# SHAP analysis
sv2 <- shapviz(fit2, X_pred = X_valid2)
# Two selected features: Thanks to additivity, structure_quality can be read as
# Ceteris Paribus
sv_dependence(sv2, v = c("structure_quality", "LONGITUDE"), alpha = 0.2)
# Total geographic effect (Ceteris Paribus thanks to additivity)
sv_dependence2D(sv2, x = "LONGITUDE", y = "LATITUDE", add_vars = more_geo) +
coord_equal()
```

Again, the resulting total geographic effect looks reasonable. Note that, unlike in the first example, there are no interactions to non-geographic components, leading to a Ceteris Paribus interpretation.

## References

0 comments on commit 9e3574f

Please sign in to comment.