Skip to content

A new data interface

Chris Beaumont edited this page Feb 18, 2014 · 3 revisions

The following page hashes out a new interface for accessing data with Glue. The idea is to enable Glue to visualize data which are inconvenient or infeasible to load into memory.

The interface described below is pseudocode, and not yet an actual API. It describes the functionality that Glue requires from a data backend.

Basic operations on Data

data.components: Lists what attributes exist in the data. Each item is a ComponentID

data.aggregate(one_or_more_component_id, bin_edges, reducer): Generalized histogram/heatmap interface. To make a histogram of a particular component, call data.aggregate(component_id, [1, 2, 3]). This counts the number of records where component_id is in the range [1, 2], [2, 3]. Likewise, passing two sets of ComponentIDs and bin edges would compute a heat map. Optionally, reduction functions could be used to compute things like the average of another quantity for each bin.

data.slice(component_id, slice_info): Computes a fixed-resolution, cropped, 2D slice through a >=2D dataset. slice_info contains information about the orientation of the slice, view limits, and resolution. Similar to the concept of a fixed resolution buffer in yt.

data.slab(component_id, slab_info, aggregation_func): Similar to slice, but with a range of values on the dimensions perpendicular to the slice. The slab is collapsed using aggregation_func (e.g., a max projection, sum projection, ...) to produce a slice.

data.stats(component_id): Summary statistics for a particular component, including min, max, median, std, mean, 5/95% percentile, number of finite records.

Only a subset of these features are needed to do everything Glue does at the moment:

  • the reducer function for aggregate isn't needed
  • slab isn't needed

Basic operations on subsets

Subsets would also provide the aggregate and stats functions.

How are subsets defined?

The complexity of a subset determines how difficult it is to perform aggregations and stats in realtime. This list enumerates the various ways subsets can be defined, in increasing order of complexity:

  • A simple inequality on one component: x > 5
  • Boolean combinations of the above: x > 5 & y < 10
  • 2D polygon constraint on 2 components
  • Boolean combinations of the above

Note also that some components in a dataset are derived fields, created by passing 1 or more other components through an arbitrary transformation function

Other ideas

For expensive operations, you can imagine yielding a sequence of increasingly-accurate results. Glue could render the "first impression" of a histogram immediately, and then improve the rendering over time as long as the user wants to wait.