Skip to content

Commit

Permalink
Added the final wildfire post, plus now using the min css versions of…
Browse files Browse the repository at this point in the history
… the coding themes.
  • Loading branch information
MughilM committed Feb 4, 2024
1 parent aa9ee35 commit 5830c03
Show file tree
Hide file tree
Showing 5 changed files with 330 additions and 5 deletions.
12 changes: 9 additions & 3 deletions _includes/my-scripts.html
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,14 @@
</script>

<!-- Highlight.js -->
<link rel="stylesheet" href="/assets/highlight/styles/night-owl.css" media="(prefers-color-scheme: dark), (prefers-color-scheme: no-preference)">
<link rel="stylesheet" href="/assets/highlight/styles/docco.css" media="(prefers-color-scheme: light)">
<link rel="stylesheet" href="/assets/highlight/styles/night-owl.min.css" media="(prefers-color-scheme: dark), (prefers-color-scheme: no-preference)">
<link rel="stylesheet" href="/assets/highlight/styles/docco.min.css" media="(prefers-color-scheme: light)">
<script src="/assets/highlight/highlight.min.js"></script>

<script>hljs.highlightAll()</script>
<!--<script>hljs.highlightAll()</script>-->

<script>
document.getElementById('_pushState').addEventListener('hy-push-state-load', function () {
hljs.highlightAll()
});
</script>
7 changes: 7 additions & 0 deletions _sass/my-style.scss
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,13 @@
}
}

div.highlight {
button {
border-bottom: none;
border-left: none;
}
}

code, pre {
font-family: "Recursive", monospace;

Expand Down
11 changes: 10 additions & 1 deletion wildfires/_posts/2020-05-15-wildfire-preprocessing.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,13 @@ image:
7952w: /assets/img/projects/wildfire_orig.jpg
accent_color: '#e25822'
theme_color: '#e25822'
related_posts:
- wildfires/_posts/2020-05-17-wildfire-regression.md
- wildfires/_posts/2020-05-19-wildfire-network.md
---

# Wildfire Incidence Prediction - Part I - Preprocessing

These series of blog posts will go over our methods for predicting wildfires across the United States
given historical wildfire incidences, land usage data, and high-resolution climatology data.
Please refer to the [Wildfire Incidence Prediction (CU)](/_projects/wildfire.md){:.heading.flip-title} to learn more about an
Expand All @@ -27,6 +31,7 @@ standalone runnable script, and a notebook containing similar language and some

* toc
{:toc}
* * *

## Introduction
Before building any model, it is important to go over the data on hand, and do any preprocessing needed.
Expand Down Expand Up @@ -704,4 +709,8 @@ for res, divs, (LAT_MIN_IDX, LON_MIN_IDX, LAT_MAX_IDX, LON_MAX_IDX) in (
## Conclusion
With this, we have preprocessed all the data we have on hand. The next step is finally to move forward with the
spatial logistic regression. There will actually be extra processing needed in order to format the matrix of values
for model ingestion. Like with this notebook, we will work our way up from the most coarse resolution to the finest.
for model ingestion. Like with this notebook, we will work our way up from the most coarse resolution to the finest.


Continue with [Wildfire Incidence II - Spatial Logistic Regression](2020-05-17-wildfire-regression.md){:.heading.flip-title}
{:.read-more}
9 changes: 8 additions & 1 deletion wildfires/_posts/2020-05-17-wildfire-regression.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,16 +14,21 @@ image:
7952w: /assets/img/projects/wildfire_orig.jpg
accent_color: '#e25822'
theme_color: '#e25822'
related_posts:
- wildfires/_posts/2020-05-15-wildfire-preprocessing.md
- wildfires/_posts/2020-05-19-wildfire-network.md
---

# Wildfire Incidence Prediction - Part II - Spatial Logistic Regression
This is second in a series of posts detailing how to predict incidence of wildfires. For input data, we use
historical wildfires, land usage data, and climatology data. In terms of preprocessing, we force all datasets to
have the same 702 x 306 shape. The fire matrix contains a count of how many fires have occurred in that location.
Please see ...part 1 post here... for a detailed code overview of the actual preprocessing.
Please see [Wildfire Incidence I - Preprocessing](/wildfires/_posts/2020-05-15-wildfire-preprocessing.md){:.heading.flip-title} for a detailed
code overview of the actual preprocessing.

* toc
{:toc}
* * *

## Introduction
In this post, I will go over how we perform our spatial logistic regression. It is "spatial" because we use
Expand Down Expand Up @@ -756,3 +761,5 @@ Another thing you have noticed that is that the regression model was unable to c

In the next notebook, we will actually apply neural networks to see if we can squeeze even more additional performance out of our data.

Continue with [Wildfire Incidence III - Basic Neural Network](2020-05-19-wildfire-network.md){:.heading.flip-title}
{:.read-more}
296 changes: 296 additions & 0 deletions wildfires/_posts/2020-05-19-wildfire-network.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,296 @@
---
layout: post
title: Wildfire Incidence III - Basic Neural Network
description: >
Part 3 of predicting wildfire incidence. I show a basic example of using
a neural network on the same data for prediction.
date: 19 May 2020
image:
path: /assets/img/projects/wildfire_medium.jpg
srcset:
640w: /assets/img/projects/wildfire_small.jpg
1920w: /assets/img/projects/wildfire_medium.jpg
2400w: /assets/img/projects/wildfire_large.jpg
7952w: /assets/img/projects/wildfire_orig.jpg
accent_color: '#e25822'
theme_color: '#e25822'
related_posts:
- wildfires/_posts/2020-05-15-wildfire-preprocessing.md
- wildfires/_posts/2020-05-17-wildfire-regression.md
---

# Wildfire Incidence III - Basic Neural Network

We have seen how to properly preprocess the 3 data sources (historical wildfire incidence, climatology,
and land usage) to get it ready to push it through a logistic regression model. However, the beauty of machine
learning is that there is often more than one tool for the job.
{:.lead}

* toc
{:toc}


## Introduction
The previous notebook saved the post-processed data into a CSV so we can easily read it in this notebook.
Due to the relatively limited number of input variables, the network will not be very big and computationally
intensive. Moreover, the network structure allows for non-linear relationships to be built *between* inputs.

It should be noted that this input format is not very well suited to the neural network structure, but we
will press forward regardless.

## Import Packages
As always, we import all the packages we will use. To build the network, we will be using **Tensorflow**.
However, there is one caveat here if you are using Windows. If using native Windows, then the maximum allowed
Python version is 3.10. This is due to Tensorflow 2.10 being the last version **that allows for native GPU
support.** If newer versions of Tensorflow wish to be used, then WSL2 need to be used instead.
However, using Python 3.10 does not influence the functionality of the other required packages in any way.

```python
import os

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, PowerTransformer
from sklearn.compose import make_column_transformer
from sklearn.metrics import classification_report

import tensorflow.keras.layers as layers
from tensorflow.keras.models import Sequential

# IMPORTANT, this is the same seed as the logistic regression,
# so the same train test split is generated, and metrics are comparable.
np.random.seed(7145)
ROOT = '../'
```
Notice we used the same seed as the logistic regression. When we perform a train test split, we want to ensure
the same training and testing set is produced for a much cleaner comparison.

## Metadata and Data Reading
This should be familiar by now, but we will create variables for directory locations, and other convenient
variables. Importantly, we read in the **spatialReg** CSV file that was generated during the logistic
regression notebook after all the lagged variables were calculated. This will save us a lot of time and code.
```python
LAG = 5
RES = '10m'
PREFIX = os.path.join(ROOT, 'data')
PROCESSED_PATH = os.path.join(PREFIX, 'processed', RES)

spatial_reg = pd.read_csv(os.path.join(PROCESSED_PATH, 'spatialReg.csv'))
spatial_reg.head()
```

| | Month | fires\_L1\_% | tavg\_L1\_avg | prec_L1_avg | srad_L1_avg | wind_L1_avg | vapr_L1_avg | LC11_L1_% | LC12_L1_% | LC21_L1_% | ... | LC41_L5_% | LC42_L5_% | LC43_L5_% | LC52_L5_% | LC71_L5_% | LC81_L5_% | LC82_L5_% | LC90_L5_% | LC95_L5_% | fireCenter |
|---|-------|--------------|---------------|-------------|-------------|-------------|-------------|-----------|-----------|-----------|-----|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|------------|
| 0 | Jan | 0.0 | -17.215083 | 18.0 | 4447.666667 | 3.445333 | 0.165133 | 0.666667 | 0.0 | 0.0 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.285714 | 0.714286 | 0.000000 | 0.0 |
| 1 | Jan | 0.0 | -17.008937 | 18.5 | 4425.500000 | 3.405188 | 0.165156 | 0.750000 | 0.0 | 0.0 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.142857 | 0.428571 | 0.428571 | 0.0 |
| 2 | Jan | 0.0 | -16.799875 | 17.5 | 4429.000000 | 3.361625 | 0.168513 | 1.000000 | 0.0 | 0.0 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.857143 | 0.142857 | 0.0 |
| 3 | Jan | 0.0 | 4.225424 | 117.0 | 3243.000000 | 3.450847 | 0.680000 | 1.000000 | 0.0 | 0.0 | ... | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.200000 | 0.200000 | 0.000000 | 0.000000 | 0.0 |
| 4 | Jan | 0.0 | 3.318625 | 144.5 | 3287.000000 | 3.139708 | 0.643800 | 0.500000 | 0.0 | 0.0 | ... | 0.0 | 0.142857 | 0.0 | 0.142857 | 0.0 | 0.142857 | 0.142857 | 0.142857 | 0.000000 | 0.0 |

## Preprocessing the data
We need to follow **exactly the same processing steps** we used in the logistic regression. This involves
doing class balancing of no fires vs. fires in the training set, and appropriately transforming the climatology
metrics. Average temperature, solar radiation, and wind speeds are standardized using their mean and standard
deviation, while the precipitation and humidity are power transformed using the Yeo-Johnson method.
All standardization parameters need to calculated from the training set, and applied to the testing set.

The code block is fairly long, so for additional details, please refer to the logistic regression notebook.

```python
# First handle the multicollinearity.
for lag in range(1, LAG + 1):
spatial_reg.drop(columns=f'LC95_L{lag}_%', inplace=True)
# Get rid of rows with NaNs in them...
spatial_reg = spatial_reg.dropna()
# Do a train test split
train_indices, test_indices = train_test_split(np.arange(spatial_reg.shape[0]), test_size=0.2)
train_set = spatial_reg.iloc[train_indices]
test_set = spatial_reg.iloc[test_indices]

# Split the input from output
X_train = train_set.iloc[:, 1:-1]
y_train = train_set['fireCenter']
X_test = test_set.iloc[:, 1:-1]
y_test = test_set['fireCenter']

# Equalize the number of positive and negative samples in the training set...
no_fire_samples = train_set[train_set['fireCenter'] == 0]
fire_samples = train_set[train_set['fireCenter'] == 1]
# Randomly choose the number of 1 samples we have from the no fire samples
chosen_no_fire_samples = no_fire_samples.sample(n=fire_samples.shape[0])
# Concatenate both sets together, and shuffle with .sample(frac=1)
train_set = pd.concat((chosen_no_fire_samples, fire_samples), axis=0, ignore_index=True).sample(frac=1)
print('New number of records:', train_set.shape[0])
print('New proportion of fires:', np.mean(train_set['fireCenter']))

# Split off X and y train again
X_train = train_set.iloc[:, 1:-1]
y_train = train_set['fireCenter']

standard_cols = [col for col in spatial_reg.columns if any(col.startswith(metric)
for metric in ['tavg', 'srad', 'wind'])]
power_cols = [col for col in spatial_reg.columns if any(col.startswith(metric)
for metric in ['prec', 'vapr'])]

transform = make_column_transformer(
(StandardScaler(), standard_cols),
(PowerTransformer(method='yeo-johnson'), power_cols),
remainder='passthrough', # To avoid dropping columns we DON'T transform
verbose_feature_names_out=False
# To get the final mapping of input to output columns without original transformer name.
)
transform.fit(X_train)

# Create a transformed DataFrame, with the transformed data, and the new column ordering
X_transform = pd.DataFrame(data=transform.transform(X_train),
columns=transform.get_feature_names_out(transform.feature_names_in_))
# Now, find the new index ordering
col_index_ordering = [X_transform.columns.get_loc(orig_col) for orig_col in X_train.columns]
# Reindexing into the column list with the new indices will automatically reorder them!
X_transform = X_transform[X_transform.columns[col_index_ordering]]

X_test_transform = pd.DataFrame(data=transform.transform(X_test),
columns=transform.get_feature_names_out(transform.feature_names_in_))
X_test_transform = X_test_transform[X_test_transform.columns[col_index_ordering]]
```
```text
New number of records: 140198
New proportion of fires: 0.5
```

## Building the model
We are ready to build our network. However, because we comparably do not have that many features (~100),
the network will not have too many parameters. But we can still stack a few layers to make a prediction.
Starting with 105 input features, we will have 3 layers with 75, 50, and 25 output neurons respectively.
The last layer will just consist of a single neuron for the binary prediction. We will use ReLU for
intermediate layer activations, and sigmoid for the final layer, since the output is required to be between 0
and 1.

```python
in_features = X_transform.shape[1]

model = Sequential()
model.add(layers.Dense(75, activation='relu', input_shape=(in_features,)))
model.add(layers.Dense(50, activation='relu'))
model.add(layers.Dense(25, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid')) # Output!

model.summary()
```
```text
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 75) 7950
dense_1 (Dense) (None, 50) 3800
dense_2 (Dense) (None, 25) 1275
dense_3 (Dense) (None, 1) 26
=================================================================
Total params: 13,051
Trainable params: 13,051
Non-trainable params: 0
_________________________________________________________________
```

## Training!
For training, we simply pass in our transformed training, and provide the transformed test data as
"validation". This way we can see the accuracy performance as it trains. We will use the standard
Adam optimizer, and the binary cross entropy loss function.

```python
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
model.fit(X_transform, y_train, batch_size=64, epochs=30, validation_data=(X_test_transform, y_test))
```
```text
Epoch 1/30
2191/2191 [==============================] - 12s 5ms/step - loss: 0.4129 - acc: 0.8122 - val_loss: 0.4090 - val_acc: 0.8127
Epoch 2/30
2191/2191 [==============================] - 12s 5ms/step - loss: 0.4034 - acc: 0.8165 - val_loss: 0.4069 - val_acc: 0.8139
Epoch 3/30
2191/2191 [==============================] - 12s 5ms/step - loss: 0.4014 - acc: 0.8172 - val_loss: 0.4147 - val_acc: 0.8102
Epoch 4/30
2191/2191 [==============================] - 10s 4ms/step - loss: 0.4004 - acc: 0.8176 - val_loss: 0.4125 - val_acc: 0.8127
Epoch 5/30
2191/2191 [==============================] - 11s 5ms/step - loss: 0.3991 - acc: 0.8183 - val_loss: 0.4064 - val_acc: 0.8131
[...]
Epoch 24/30
2191/2191 [==============================] - 11s 5ms/step - loss: 0.3864 - acc: 0.8245 - val_loss: 0.4037 - val_acc: 0.8117
Epoch 25/30
2191/2191 [==============================] - 11s 5ms/step - loss: 0.3855 - acc: 0.8244 - val_loss: 0.4050 - val_acc: 0.8112
Epoch 26/30
2191/2191 [==============================] - 11s 5ms/step - loss: 0.3845 - acc: 0.8256 - val_loss: 0.4113 - val_acc: 0.8151
Epoch 27/30
2191/2191 [==============================] - 11s 5ms/step - loss: 0.3844 - acc: 0.8253 - val_loss: 0.3950 - val_acc: 0.8144
Epoch 28/30
2191/2191 [==============================] - 11s 5ms/step - loss: 0.3832 - acc: 0.8260 - val_loss: 0.4182 - val_acc: 0.8119
Epoch 29/30
2191/2191 [==============================] - 11s 5ms/step - loss: 0.3825 - acc: 0.8263 - val_loss: 0.3987 - val_acc: 0.8138
Epoch 30/30
2191/2191 [==============================] - 11s 5ms/step - loss: 0.3819 - acc: 0.8271 - val_loss: 0.4046 - val_acc: 0.8126
<keras.callbacks.History at 0x1db91442f20>
```
On paper, the accuracy is 81%, but in our case, the accuracy doesn't tell the story.
We'll show a classification report just like the logistic regression.
## Evaluation
The evaluation process is straightforward, we grab the predictions with `model.predict`, and have our
threshold be 0.5. Any probabilities greater than this amount, and the model predicts a fire, and
anything less, no fire. This is the threshold that the logisiic regression operated under, so we'll
do as close as a comparison as we can.
```python
threshold = 0.5

predictions = model.predict(X_test_transform).ravel()
predictions[predictions < threshold] = 0
predictions[predictions >= threshold] = 1

print(classification_report(y_true=y_test, y_pred=predictions))
```
```text
2289/2289 [==============================] - 4s 2ms/step
precision recall f1-score support
0 0.93 0.81 0.87 55761
1 0.58 0.82 0.67 17475
accuracy 0.81 73236
macro avg 0.75 0.81 0.77 73236
weighted avg 0.85 0.81 0.82 73236
```

Compared with our logistic regression, our metrics are about 1-2 points lower across the board.
Perhaps the fully connected neural network did not work as well as we thought. There are other neural
networks we can try.
## Conclusion
This relatively small post went over how to build a neural network solution to our wildfire incidence problem.
This network is extremely barebones, as it simply took our input data and fed it directly through a
fully connected network. To extract even more performance, different processing to the dataset will need
to be performed.

It might be tempting to apply an image recognition CNN network to the fire, climatology, and land usage
images themselves, but there could be reasons why a CNN may not be viable. Each pixel in the image is
treated as both an input **and** output. Conventional CNNs produce a single label given a single image.
However, here, we are basically classifying each pixel in the image.

The particular class of image problems is called **image segmentation**, which requires an entirely different
network architecture. However, the downside is that comparably, we do not have that many images to work with,
since the original climatology images were only separated by month. Additional data would need to be
collected at a more fine temporal resolution.

Neural networks should only be used as a last resort to most machine learning problems. In fact, we did
not even test very basic solutions. For example, we could have simply assigned fire incidence as such:
If a majority of neighboring cells had a fire, then the central spot had a fire. This does not require any
training, nor does it require the climatology or land usage data. This could also easily be tested at various
lag values.

So it is important to stay grounded in which solution you end up choosing, ensuring it is not overly
complicated, and that the algorithm is not being forced to solve the problem.


0 comments on commit 5830c03

Please sign in to comment.