Skip to content

Commit

Permalink
Rename data cube indexing to pre-aggregation. (uwdata#566)
Browse files Browse the repository at this point in the history
* feat!: Rename DataCubeIndexer to PreAggregator.

* filterIndexable -> optimizable, and update some more mentions of indexes

* optimizable -> filterStable

---------

Co-authored-by: Dominik Moritz <domoritz@gmail.com>
  • Loading branch information
jheer and domoritz authored Oct 24, 2024
1 parent 6c50738 commit 56756b0
Show file tree
Hide file tree
Showing 34 changed files with 217 additions and 194 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ _Note_: For convenience, the `vgplot` package re-exports much of the `mosaic-cor

### Core Components

- [`mosaic-core`](https://github.com/uwdata/mosaic/tree/main/packages/core): The core Mosaic components. A central coordinator, parameters and selections for linking scalar values or query predicates (respectively) across Mosaic clients, and filter groups with optimized index management. The Mosaic coordinator can send queries either over the network to a backing server (`socket` and `rest` clients) or to a client-side [DuckDB-WASM](https://github.com/duckdb/duckdb-wasm) instance (`wasm` client).
- [`mosaic-core`](https://github.com/uwdata/mosaic/tree/main/packages/core): The core Mosaic components. A central coordinator, parameters and selections for linking scalar values or query predicates (respectively) across Mosaic clients, and filter groups with materialized views of pre-aggregated data. The Mosaic coordinator can send queries either over the network to a backing server (`socket` and `rest` clients) or to a client-side [DuckDB-WASM](https://github.com/duckdb/duckdb-wasm) instance (`wasm` client).
- [`mosaic-sql`](https://github.com/uwdata/mosaic/tree/main/packages/sql): An API for convenient construction and analysis of SQL queries. Query objects then coerce to SQL query strings.
- [`mosaic-inputs`](https://github.com/uwdata/mosaic/tree/main/packages/inputs): Standalone data-driven components such as input menus, text search boxes, and sortable, load-on-scroll data tables.
- [`mosaic-plot`](https://github.com/uwdata/mosaic/tree/main/packages/plot): An interactive grammar of graphics implemented on top of [Observable Plot](https://github.com/observablehq/plot). Marks (plot layers) serve as individual Mosaic clients. These marks can push data processing (binning, hex binning, regression) and optimizations (such as M4 for line/area charts) down to the database. This package also provides interactors for linked selection, filtering, and highlighting using Mosaic Params and Selections.
Expand Down
36 changes: 18 additions & 18 deletions dev/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -85,20 +85,20 @@
<input id="query-log" type="checkbox" />
</div>
<div>
Query Cache:
Cache Queries:
<input id="cache" type="checkbox" checked />
</div>
<div>
Query Consolidation:
Consolidate Queries:
<input id="consolidate" type="checkbox" checked />
</div>
<div>
Data Cube Indexes:
<input id="index" type="checkbox" checked />
Pre-aggregate:
<input id="preagg" type="checkbox" checked />
</div>
<div>
Active Index State:
<button id="index-state">Log</button>
Pre-aggregate State:
<button id="preagg-state">Log</button>
</div>
</div>
</details>
Expand All @@ -115,32 +115,32 @@
const qlogToggle = document.querySelector('#query-log');
const cacheToggle = document.querySelector('#cache');
const consolidateToggle = document.querySelector('#consolidate');
const indexToggle = document.querySelector('#index');
const indexState = document.querySelector('#index-state');
const preaggToggle = document.querySelector('#preagg');
const preaggState = document.querySelector('#preagg-state');

connectorMenu.addEventListener('change', setConnector);
exampleMenu.addEventListener('change', reload);
sourceMenu.addEventListener('change', reload);
qlogToggle.addEventListener('input', setQueryLog);
cacheToggle.addEventListener('input', setCache);
consolidateToggle.addEventListener('input', setConsolidate);
indexToggle.addEventListener('input', setIndex);
indexState.addEventListener('click', () => {
const { indexes } = vg.coordinator().dataCubeIndexer || {};
if (indexes) {
preaggToggle.addEventListener('input', setPreAggregate);
preaggState.addEventListener('click', () => {
const { entries } = vg.coordinator().preaggregator || {};
if (entries) {
console.warn(
'Data Cube Index Entries',
Array.from(indexes.values())
'Pre-aggregate Entries',
Array.from(entries.values())
);
} else {
console.warn('No Active Data Cube Index');
console.warn('No Pre-aggregate Entries');
}
});

setQueryLog();
setCache();
setConsolidate();
setIndex();
setPreAggregate();
setConnector();

async function setConnector() {
Expand All @@ -160,8 +160,8 @@
vg.coordinator().manager.consolidate(consolidateToggle.checked);
}

function setIndex() {
vg.coordinator().dataCubeIndexer.enabled = indexToggle.checked;
function setPreAggregate() {
vg.coordinator().preaggregator.enabled = preaggToggle.checked;
}

function reload() {
Expand Down
6 changes: 3 additions & 3 deletions docs/api/core/client.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,11 @@ Create a new client instance. If provided, the [Selection](./selection)-valued _
Property getter for the Selection that should filter this client.
The [coordinator](./coordinator) uses this property to provide automatic updates to the client upon selection changes.

## filterIndexable
## filterStable

`client.filterIndexable`
`client.filterStable`

Property getter for a Boolean value indicating if the client query can be safely indexed using a pre-aggregated data cube.
Property getter for a Boolean value indicating if the client query can be safely optimized using a pre-aggregated materialized view.
This property should return true if changes to the `filterBy` selection do not change the groupby (e.g., binning) values of the client query.

The `MosaicClient` base class will always return `true`.
Expand Down
2 changes: 1 addition & 1 deletion docs/api/core/coordinator.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ Create a new Mosaic Coordinator to manage all database communication for clients
* _logger_: The logger to use, defaults to `console`.
* _cache_: Boolean flag to enable/disable query caching (default `true`).
* _consolidate_ Boolean flag to enable/disable query consolidation (default `true`).
* _indexes_: Data cube indexer options object. The _enabled_ flag (default `true`) determines if data cube indexes should be used when possible. The _schema_ option (default `'mosaic'`) indicates the database schema in which data cube index tables should be created.
* _preagg_: Pre-aggregation options object. The _enabled_ flag (default `true`) determines if pre-aggregation optimizations should be used when possible. The _schema_ option (default `'mosaic'`) indicates the database schema in which materialized view tables should be created for pre-aggregated data.

## databaseConnector

Expand Down
2 changes: 1 addition & 1 deletion docs/core/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ Finally, clients may expose a `filterBy` Selection property. The predicates prov
The _coordinator_ is responsible for managing client data needs. Clients are registered via the coordinator `connect(client)` method, and similarly removed using `disconnect()`. Upon registration, the event lifecycle begins.
In addition to the `fields` and `query` calls described above, the coordinator checks if a client exposes a `filterBy` property, and if so, adds the client to a _filter group_: a set of clients that share the same `filterBy` selection.
Upon changes to this selection (e.g., due to interactions such as brushing or zooming), the coordinator collects updated queries for all corresponding clients, queries the data source, and updates clients in turn.
The Coordinator additionally performs optimizations including caching and data cube indexing.
The Coordinator additionally performs optimizations including caching and pre-aggregation.

[Coordinator API Reference](/api/core/coordinator)

Expand Down
2 changes: 1 addition & 1 deletion docs/examples/flights-10m.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
# Cross-Filter Flights (10M)

Histograms showing arrival delay, departure time, and distance flown for 10 million flights.
Once loaded, automatically-generated indexes enable efficient cross-filtered selections.
Once loaded, automatic pre-aggregation optimizations enable efficient cross-filtered selections.

_You may need to wait a few seconds for the dataset to load._

Expand Down
2 changes: 1 addition & 1 deletion docs/examples/linear-regression-10m.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

# Linear Regression 10M

A linear regression plot predicting flight arrival delay based on the time of departure, over 10 million flight records. Regression computation is performed in the database, with optimized selection updates using data cube indexes. The area around a regression line shows a 95% confidence interval. Select a region to view regression results for a data subset.
A linear regression plot predicting flight arrival delay based on the time of departure, over 10 million flight records. Regression computation is performed in the database, with optimized selection updates using pre-aggregated materialized views. The area around a regression line shows a 95% confidence interval. Select a region to view regression results for a data subset.

<Example spec="/specs/yaml/linear-regression-10m.yaml" />

Expand Down
2 changes: 1 addition & 1 deletion docs/public/specs/json/flights-10m.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"meta": {
"title": "Cross-Filter Flights (10M)",
"description": "Histograms showing arrival delay, departure time, and distance flown for 10 million flights.\nOnce loaded, automatically-generated indexes enable efficient cross-filtered selections.\n\n_You may need to wait a few seconds for the dataset to load._\n"
"description": "Histograms showing arrival delay, departure time, and distance flown for 10 million flights.\nOnce loaded, automatic pre-aggregation optimizations enable efficient cross-filtered selections.\n\n_You may need to wait a few seconds for the dataset to load._\n"
},
"data": {
"flights10m": "SELECT GREATEST(-60, LEAST(ARR_DELAY, 180))::DOUBLE AS delay, DISTANCE AS distance, DEP_TIME AS time FROM 'https://idl.uw.edu/mosaic-datasets/data/flights-10m.parquet'"
Expand Down
2 changes: 1 addition & 1 deletion docs/public/specs/json/linear-regression-10m.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"meta": {
"title": "Linear Regression 10M",
"description": "A linear regression plot predicting flight arrival delay based on the time of departure, over 10 million flight records. Regression computation is performed in the database, with optimized selection updates using data cube indexes. The area around a regression line shows a 95% confidence interval. Select a region to view regression results for a data subset.\n"
"description": "A linear regression plot predicting flight arrival delay based on the time of departure, over 10 million flight records. Regression computation is performed in the database, with optimized selection updates using pre-aggregated materialized views. The area around a regression line shows a 95% confidence interval. Select a region to view regression results for a data subset.\n"
},
"data": {
"flights10m": "SELECT GREATEST(-60, LEAST(ARR_DELAY, 180))::DOUBLE AS delay, DISTANCE AS distance, DEP_TIME AS time FROM 'https://idl.uw.edu/mosaic-datasets/data/flights-10m.parquet'"
Expand Down
2 changes: 1 addition & 1 deletion docs/public/specs/yaml/flights-10m.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ meta:
title: Cross-Filter Flights (10M)
description: |
Histograms showing arrival delay, departure time, and distance flown for 10 million flights.
Once loaded, automatically-generated indexes enable efficient cross-filtered selections.
Once loaded, automatic pre-aggregation optimizations enable efficient cross-filtered selections.
_You may need to wait a few seconds for the dataset to load._
data:
Expand Down
2 changes: 1 addition & 1 deletion docs/public/specs/yaml/linear-regression-10m.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ meta:
A linear regression plot predicting flight arrival delay based on
the time of departure, over 10 million flight records.
Regression computation is performed in the database, with optimized
selection updates using data cube indexes.
selection updates using pre-aggregated materialized views.
The area around a regression line shows a 95% confidence interval.
Select a region to view regression results for a data subset.
data:
Expand Down
2 changes: 1 addition & 1 deletion docs/what-is-mosaic/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ Next let's visualize over 200,000 flight records. The first histogram shows flig

<Example spec="/specs/yaml/crossfilter.yaml" />

When the selection changes we need to filter the data and recount the number of records in each bin. The Mosaic coordinator analyzes these queries and automatically optimizes updates by building indexes of pre-aggregated data ("data cubes") in the database, binned at the level of input pixels for the currently active view.
When the selection changes we need to filter the data and recount the number of records in each bin. The Mosaic coordinator analyzes these queries and automatically optimizes updates by building tables (["materialized views"](https://en.wikipedia.org/wiki/Materialized_view)) of pre-aggregated data in the database, binned at the level of input pixels for the currently active view.

While 200,000 points will stress many web-based visualization tools, Mosaic doesn't break a sweat. Now go ahead and try this with [10 million records](/examples/flights-10m)!

Expand Down
8 changes: 4 additions & 4 deletions docs/why-mosaic/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -129,8 +129,8 @@ DuckDB-WASM in the browser fares well, though is limited (compared to a DuckDB s
<div style="display: flex; flex-flow: row nowrap; justify-content: flex-start; align-items: flex-start;"><span style="display: inline-block; width: 35px;"></span><div class="legend"><div class="plot-why-swatches plot-why-swatches-wrap"><span class="plot-why-swatch"><svg width="15" height="15" fill="#e15759"><rect width="100%" height="100%"></rect></svg>Vega(-Lite)</span><span class="plot-why-swatch"><svg width="15" height="15" fill="#ff9da6"><rect width="100%" height="100%"></rect></svg>VegaFusion</span><span class="plot-why-swatch"><svg width="15" height="15" fill="#f28e2c"><rect width="100%" height="100%"></rect></svg>Observable Plot</span><span class="plot-why-swatch"><svg width="15" height="15" fill="#4e79a7"><rect width="100%" height="100%"></rect></svg>Mosaic WASM</span><span class="plot-why-swatch"><svg width="15" height="15" fill="#76b7b2"><rect width="100%" height="100%"></rect></svg>Mosaic Local</span></div></div></div>

When it comes to interaction, Mosaic really shines!
For many forms of aggregated data, the coordinator will automatically pre-aggregate data into smaller "data cube" indexes to support real-time interaction with billion+ element databases.
The figure below shows benchmark results for index-optimized interactive updates.
For many forms of aggregated data, the coordinator will automatically pre-aggregate data into smaller tables ("materialized views") to support real-time interaction with billion+ element databases.
The figure below shows benchmark results for optimized interactive updates.
Even with billions of rows, Mosaic with a server-side DuckDB instance maintains interactive response rates.

<svg xmlns="http://www.w3.org/2000/svg" class="plot-why" fill="currentColor" font-family="system-ui, sans-serif" font-size="10" text-anchor="middle" width="420" height="115" viewBox="0 0 420 115">
Expand Down Expand Up @@ -173,8 +173,8 @@ Even with billions of rows, Mosaic with a server-side DuckDB instance maintains
</svg>
<div style="display: flex; flex-flow: row nowrap; justify-content: flex-start; align-items: flex-start;"><span style="display: inline-block; width: 40px;"></span><div class="legend"><div class="plot-why-swatches plot-why-swatches-wrap"><span class="plot-why-swatch"><svg width="15" height="15" fill="#ff9da6"><rect width="100%" height="100%"></rect></svg>VegaFusion</span><span class="plot-why-swatch"><svg width="15" height="15" fill="#4e79a7"><rect width="100%" height="100%"></rect></svg>Mosaic WASM</span><span class="plot-why-swatch"><svg width="15" height="15" fill="#76b7b2"><rect width="100%" height="100%"></rect></svg>Mosaic Local</span><span class="plot-why-swatch"><svg width="15" height="15" fill="#59a14f"><rect width="100%" height="100%"></rect></svg>Mosaic Remote</span></div></div></div>

If not already present, Mosaic will create data cube index tables when the mouse cursor enters a view.
For very large data sets with longer data cube construction times, precomputation and server-side caching are supported.
If not already present, Mosaic will build pre-aggregated data tables when the mouse cursor enters a view.
For very large data sets with longer pre-aggregation times, precomputation and server-side caching are supported.

Other tasks, like changing a color encoding or adjusting a smoothing parameter, can be carried out quickly in the browser alone, including over aggregated data. Mosaic clients have the flexibility of choosing what works best.

Expand Down
2 changes: 1 addition & 1 deletion packages/core/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# mosaic-core

The core Mosaic components: a central coordinator, parameters (`Param`) and selections (`Selection`) for linking scalar values or query predicates (respectively) across Mosaic clients, and filter groups with optimized index management. The Mosaic coordinator can send queries either over the network to a backing server (`socket` and `rest` clients) or to a client-side [DuckDB-WASM](https://github.com/duckdb/duckdb-wasm) instance (`wasm` client).
The core Mosaic components: a central coordinator, parameters (`Param`) and selections (`Selection`) for linking scalar values or query predicates (respectively) across Mosaic clients, and filter groups with materialized views of pre-aggregated data. The Mosaic coordinator can send queries either over the network to a backing server (`socket` and `rest` clients) or to a client-side [DuckDB-WASM](https://github.com/duckdb/duckdb-wasm) instance (`wasm` client).

The `mosaic-core` facilities are included as part of the [vgplot](https://github.com/uwdata/mosaic/tree/main/packages/vgplot) API.
Loading

0 comments on commit 56756b0

Please sign in to comment.