Skip to content
Matthew Perry edited this page Oct 16, 2015 · 2 revisions

General process

not pyimpute specific, just an overview of how one would generally approach these problems

  1. Loading spatial data

    • Easiest method: import table containing training observations with both explanatory and response variables.
    • Alternate method: perform random stratified sampling on a response data and generate training data from rasters
    • Another alternate method: Use python-rasters-stats to grab the data from explanatory rasters using point data
  2. Fit a classification or regression model

    • Leverage scikit-learn classifiers or regressors
    • Fit to training data
    • optional: scale and optionally reduce dimensionality of data
    • optional: calibrate using grid search cv to find optimal parameters
    • optional: create your own ensemble [2]
    • Evaluate:
      • crossvalidation (average score over k-folds)
      • train_test split
      • metrics [3]
      • confusion matrix
      • compare to dummy estimators
      • identify most informative features [1]
  3. Generate spatial prediction from target data

    • scikit classifiers to make predictions and generate certainty estimates
    • GDAL to write predicted classes array to new raster
    • write raster of prediction probability for each pixel
    • write rasters (one for each class) with probability of that class over space

Some resources

[1] http://stackoverflow.com/questions/11116697/how-to-get-most-informative-features-for-scikit-learn-classifiers?rq=1

[2] http://stackoverflow.com/questions/21506128/best-way-to-combine-probabilistic-classifiers-in-scikit-learn/21544196#21544196

[3] http://scikit-learn.org/stable/modules/model_evaluation.html#prediction-error-metrics

[4] http://scikit-learn.org/stable/auto_examples/plot_classification_probability.html

Example

Please check out the examples

This example walks through the main steps of loading training data, setting up and evaluating a classifier, and using it to predict a raster of the response variable.

The resulting prediction raster

alt tag

The certainty estimates

alt tag

Note about performance and memory limitations

Depending on the classifier you use, memory and/or time might become limited.

The impute method takes an optional argument, (linechunk) which calibrates the performance. Specifically, it determines how many lines/rows of the raster file are processed at once.

tl;dr; You want to set linechunk as high as possible without exceeding your memory capacity.

In this example, I use the RandomForest classifier. Other classifiers may exhibit different behavior but, in general, there is a tradeoff between speed and memory;

as you increase linechunk memory increases linearly

alt tag

while performance increases exponentially.

alt tag

Note about geostatistics

While kriging and other geostatistical techniques are technically "geospatial prediction", they rely on spatial dependence between observations. The problems for which this module was built are landscape scale and rarely suited to such approaches. There is great potential to meld the two approaches (i.e. consider spatial autocorrelation between training data as well as explanatory variables) but this is currently outside the scope of this module.

The naming and core concept is based on the yaImpute R Package.Welcome to the pyimpute wiki!