Skip to content

Latest commit

 

History

History
657 lines (446 loc) · 20 KB

Raster-in-45-min-julia.md

File metadata and controls

657 lines (446 loc) · 20 KB

Raster analysis in Julia in 45 minutes

This tutorial is based on the xarray in 45 minutes https://tutorial.xarray.dev/overview/xarray-in-45-min.html

In this lesson, we discuss cover the basics of Raster analysis data structures. By the end of the lesson, we will be able to:

  • Understand the basic data structures in Julia
  • Inspect DimArray and DimStack objects.
  • Read and write netCDF files using Rasters.jl.
  • Understand that there are many packages that build on top of xarray

We'll start by reviewing the various components of the Xarray data model, represented here visually:

As example data, we use the data in the xarray-data repository.

Here we'll use air temperature from the National Center for Environmental Prediction. Xarray objects have convenient HTML representations to give an overview of what we're working with:

Before we can start we need to activate the current environment.

using Pkg
Pkg.activate(@__DIR__)
using Rasters
using NCDatasets

path = "data/air_temperature.nc"
# First we download the data locally if needed.
if !isfile(path)
    path = download("https://github.com/pydata/xarray-data/blob/master/air_temperature.nc", "air_temperature.nc")
end

# Now we can open the data as a RasterStack
ds = RasterStack(path)

What's in a DimStack?

Many DimArrays!

DimStacks are dictionary-like containers of "DimArray"s. They are a mapping of variable name to DimArray. The RasterStack is a special case of a DimStack with some geospatial information. The DimArray s in the DimStack can share dimensions butdon't have all the same dimensionality. If layers share a dimension name, this dimension is only stored once for the whole DimStack.

# pull out "air" dataarray with dictionary syntax
ds["air"]

You can save some typing by using the "attribute" or "dot" notation and using tab completion.

# pull out dataarray using dot notation
ds.air

What's in a DimArray?

data + (a lot of) metadata

Name (optional)

da = ds.air

Rasters.name(da)

Named dimensions

dims(da) correspond to the axes of your data.

In this case we have 2 spatial dimensions (X and Y) and one temporal dimension (Ti).

dims(da)

You can also extract a single dimension.

# extracting coordinate variables
dims(da, X)

It is useful to think of the values in the dimensions as axis "labels" such as "tick labels" in a figure. These are coordinate locations on a grid at which you have data.

Arbitrary attributes

metadata is a metadata object that can hold arbitrary attributes that describe the underlying data.

metadata(da)
# assign your own attributes!
metadata(da)["myattrs"] = "Mine"

Underlying data

A DimensionalData data structures wrap underlying simpler array-like data structures. These arrays have to fit into the Julia Array interface but can be either in memory arrays or DiskArray.jl arrays for lazy access or part of Xarray is quite extensible allowing for distributed array, GPU arrays, sparse arrays, arrays with units etc. We'll briefly look at this later in this tutorial.

To access the underlying data use the parent function:

parent(da)
# what is the type of the underlying data
typeof(parent(da))

We can change the underlying data type by using the lazy keyword for opening the data. This is especially helpful for very large data or data that is hosted online where we would not want to download the whole dataset before starting the analysis.

dsl = RasterStack(path, lazy=true)
dal = dsl.air
typeof(parent(dal))

Review

DimensionalData provides two main data structures:

  1. DimArrays that wrap underlying data containers and contain associated metadata in the
  2. DimStacks that are dictionary-like containers of DataArrays

DimArrays contain underlying arrays and associated metadata:

  1. Name
  2. Dimension names
  3. Lookup values
  4. Metadata

Why named dimensions and labeled dimensions?

Metadata provides context and provides code that is more legible. This reduces the likelihood of errors from typos and makes analysis more intuitive and fun!

Analysis without named dimensions:

# plot the first timestep
lon = ds.air.dims[1].val.data  # Vector
lat = ds.air.dims[2].val.data  # Vector
temp = parent(da)  # vector
using GLMakie
heatmap(lon, lat, temp[1, :, :])
using Statistics
mean(temp, dims=3)# On what dimensions did we apply the reduction? I can't tell by looking at this line.

Analysis with DimensionalData

How readable is this code?

plot(ds.air[Ti=1])

Use dimension names instead of axis numbers

plot((mean(ds.air, dims=Ti)[Ti=1]))

Extracting data or "indexing"

DimensionalData supports

  • label-based indexing using Selectors
  • position-based indexing using Integer

See the Documentation about Selectors for more.

Label-based indexing

DimensionalData implements label based indexing where you can use the name of the dimension and also the labels for the entries in the dimension.

# here's what our dataset looks like
ds
# We can extract the Time dimension
dims(ds, Ti)
# pull out data for all of 2013-May
ds[Ti=Where(x->yearmonth(x) == (2013, 5))]
# demonstrate slicing, extract all time slices between to given dates
ds[Ti=Date(2013,5,1)..Date(2013,8,1)]
# demonstrate "nearest" indexing
ds[X=Near(240.2)]
# "nearest indexing at multiple points"
ds[X=Near([240.125, 234]), Y=Near([40.1, 50.1])]

These selectors can be mixed for different dimensions. So that we could have a Where selection for time and a nearest neighbor selection in space.

Position-based indexing

This is similar to usual array indexing array[1, 2, 3] but with the power of named dimensions!

# pull out time index 0, lat index 2, and lon index 3
ds.air[Ti=1, Y=2, X=3]  #  much better than ds.air[3, 2, 1]
# demonstrate slicing
ds.air[X=1:10]

Concepts for computation

Consider calculating the mean air temperature per unit surface area for this dataset. Because latitude and longitude correspond to spherical coordinates for Earth's surface, each 2.5x2.5 degree grid cell actually has a different surface area as you move away from the equator! This is because latitudinal length is fixed ($ \delta Lat = R \delta \phi $), but longitudinal length varies with latitude ($ \delta Lon = R \delta \lambda \cos(\phi) $)

So the area element for lat-lon coordinates is

$$ \delta A = R^2 \delta\phi , \delta\lambda \cos(\phi) $$

where $\phi$ is latitude, $\delta \phi$ is the spacing of the points in latitude, $\delta \lambda$ is the spacing of the points in longitude, and $R$ is Earth's radius. (In this formula, $\phi$ and $\lambda$ are measured in radians)

TODO: mention the Rasters.coverage function that does this for you.

lon
# Earth's average radius in meters
R = 6.371e6

# Coordinate spacing for this dataset is 2.5 x 2.5 degrees
dϕ = deg2rad(2.5)
dλ = deg2rad(2.5)

dlat = fill(R * dϕ, dims(ds,X))
dlonval = R .* dλ .* cos.(deg2rad.(dims(ds.air, Y)))
dlon = DimArray(reshape(dlonval, (1, length(dlonval))), (X,dims(ds, Y)))
cell_area = dlat .* dlon

You can apply functions like cos and deg2rad elementwise on array types by using broadcasting in Julia.

Broadcasting: expanding data

In DimensionalData the broadcast does not automatically expand the data therefore we have to reshape the underlying data into a row vector to get a 2D array from the two 1D vectors.

cell_area = dlon .* dlat
cell_area

Alignment: putting data on the same grid

When broadcasting arithmetic operations DimensionalData automatically "aligns" i.e. puts the data on the same grid. In this case cell_area and ds.air are at the same lat, lon points we end up with a result with the same shape (25x53):

ds.air[Ti=1] ./ cell_area

DimensionalData only compares the dimension name and would currently broadcast two arrays with the same size along each dimension together.

# make a copy of cell_area
# then add 1e-5 degrees to latitude
cell_area_bad = similar(cell_area)
set(cell_area_bad, set(dims(cell_area_bad, X), 1:53))
cell_area_bad .* ds.air[Ti=1]

This results in the same values but the dimensions are different.


High level computation

(groupby, resample, rolling, coarsen, weighted)

Rasters has some very useful high level objects that let you do common computations:

  1. groupby : Bin data in to groups and reduce
  2. resample : Groupby specialized for time axes. Either downsample or upsample your data.
  3. rolling : Operate on rolling windows of your data e.g. running mean
  4. coarsen : Downsample your data
  5. weighted : Weight your data before reducing

Below we quickly demonstrate these patterns. See the user guide links above and the tutorial for more.

groupby

# here's ds
ds
groups = groupby(ds, Ti=>seasons())
# make a seasonal mean
seasonal_mean = mean.(groups, dims=Ti)
seasonal_mean
#TODO: Plot the mean map for every season

resample

# Reduce the time dimension to monthly means 
monthlymeans = dropdims.(mean.(groupby(ds.air, Ti=>yearmonth), dims=Ti), dims=Ti)

weighted

# weight by cell_area and take mean over (time, lon)
#ds.weighted(cell_area).mean(["lon", "time"]).air.plot(y="lat");
weightedmean = dropdims(mean(ds.air, dims=(X, Ti)), dims=(X,Ti))
weightedmean
y = dims(weightedmean, Y)
@show y
nx = X(reverse(Float32[75.0, 72.5, 70.0, 67.5, 65.0, 62.5, 60.0, 57.5, 55.0, 52.5, 50.0, 47.5, 45.0, 42.5, 40.0, 37.5, 35.0, 32.5, 30.0, 27.5, 25.0, 22.5, 20.0, 17.5, 15.0]))
ny = Y(reverse(Float32[75.0, 72.5, 70.0, 67.5, 65.0, 62.5, 60.0, 57.5, 55.0, 52.5, 50.0, 47.5, 45.0, 42.5, 40.0, 37.5, 35.0, 32.5, 30.0, 27.5, 25.0, 22.5, 20.0, 17.5, 15.0]))

narr = DimArray(rand(25,25), (ny, nx))
fig, ax, pl = plot(narr)
nsingle = DimArray(rand(25), ny)
fig, ax, pl = plot(nsingle)
ax.finallimits

Visualization

(.plot)

We have seen very simple plots earlier. Xarray also lets you easily visualize 3D and 4D datasets by presenting multiple facets (or panels or subplots) showing variations across rows and/or columns.

# facet the seasonal_mean
seasonal_mean.air.plot(col="season", col_wrap=2);
# contours
seasonal_mean.air.plot.contour(col="season", levels=20, add_colorbar=True);
# line plots too? wut
seasonal_mean.air.mean("lon").plot.line(hue="season", y="lat");

For more see the user guide, the gallery, and the tutorial material.


Reading and writing files

Rasters supports many disk formats. Below is a small example using netCDF. For more see the documentation.

Writing is done through the standard Base interface.

# write to netCDF
write("my-example-dataset.nc", ds)

!!! note To avoid the SerializationWarning you can assign a _FillValue for any NaNs in 'air' array by adding the keyword argument encoding=dict(air={_FillValue=-9999}) TODO: change this to Rasters.jl syntax

# read from disk
fromdisk = Raster("my-example-dataset.nc")

You can also read these in lazily via the lazy=true keyword.

# check that the two are identical
ds == fromdisk

!!! tip A common use case to read datasets that are a collection of many netCDF files. See the documentation for how to handle that.

Finally to read other file formats, you might find yourself reading in the data using a different library and then creating a DataArray(docs, tutorial) from scratch. For example, you might use h5py to open an HDF5 file and then create a Dataset from that. For MATLAB files you might use scipy.io.loadmat or h5py depending on the version of MATLAB file you're opening and then construct a Dataset.


The scientific python ecosystem

Xarray ties in to the larger scientific python ecosystem and in turn many packages build on top of xarray. A long list of such packages is here: https://docs.xarray.dev/en/stable/related-projects.html.

Now we will demonstrate some cool features.

Pandas: tabular data structures

You can easily convert between xarray and pandas structures. This allows you to conveniently use the extensive pandas ecosystem of packages (like seaborn) for your work.

# convert to pandas dataframe
df = ds.isel(time=slice(10)).to_dataframe()
df
# convert dataframe to xarray
df.to_xarray()

Alternative array types

This notebook has focused on Numpy arrays. Xarray can wrap other array types! For example:

distributed parallel arrays & Xarray user guide on Dask

pydata/sparse : sparse arrays

GPU arrays & cupy-xarray

pint : unit-aware arrays & pint-xarray

Dask

Dask cuts up NumPy arrays into blocks and parallelizes your analysis code across these blocks

# demonstrate dask dataset
dasky = xr.tutorial.open_dataset(
    "air_temperature",
    chunks={"time": 10},  # 10 time steps in each block
)

dasky.air

All computations with dask-backed xarray objects are lazy, allowing you to build up a complicated chain of analysis steps quickly

# demonstrate lazy mean
dasky.air.mean("lat")

To get concrete values, call .compute or .load

# "compute" the mean
dasky.air.mean("lat").compute()

HoloViz

Quickly generate interactive plots from your data!

The hvplot package attaches itself to all xarray objects under the .hvplot namespace. So instead of using .plot use .hvplot

import hvplot.xarray

ds.air.hvplot(groupby="time", clim=(270, 300), widget_location='bottom')
The time slider will only work if you're executing the notebook, rather than viewing the website

cf_xarray

cf_xarray is a project that tries to let you make use of other CF attributes that xarray ignores. It attaches itself to all xarray objects under the .cf namespace.

Where xarray allows you to specify dimension names for analysis, cf_xarray lets you specify logical names like "latitude" or "longitude" instead as long as the appropriate CF attributes are set.

For example, the "longitude" dimension in different files might be labelled as: (lon, LON, long, x…), but cf_xarray let's you always refer to the logical name "longitude" in your code:

import cf_xarray
# describe cf attributes in dataset
ds.air.cf

The following mean operation will work with any dataset that has appropriate attributes set that allow detection of the "latitude" variable (e.g. units: "degress_north" or standard_name: "latitude")

# demonstrate equivalent of .mean("lat")
ds.air.cf.mean("latitude")
# demonstrate indexing
ds.air.cf.sel(longitude=242.5, method="nearest")

Other cool packages

  • xgcm : grid-aware operations with xarray objects
  • xrft : fourier transforms with xarray
  • xclim : calculating climate indices with xarray objects
  • intake-xarray : forget about file paths
  • rioxarray : raster files and xarray
  • xesmf : regrid using ESMF
  • MetPy : tools for working with weather data

Check the Xarray Ecosystem page and this tutorial for even more packages and demonstrations.

Next

  1. Read the tutorial material and user guide
  2. See the description of common terms used in the xarray documentation:
  3. Answers to common questions on "how to do X" with Xarray are here
  4. Ryan Abernathey has a book on data analysis with a chapter on Xarray
  5. Project Pythia has foundational and more advanced material on Xarray. Pythia also aggregates other Python learning resources.
  6. The Xarray Github Discussions and Pangeo Discourse are good places to ask questions.
  7. Tell your friends! Tweet!

Welcome!

DimensionalData and the whole Julia data analysis ecosystem is an open-source project and gladly welcomes all kinds of contributions. This could include reporting bugs, discussing new enhancements, contributing code, helping answer user questions, contributing documentation (even small edits like fixing spelling mistakes or rewording to make the text clearer). Welcome!