On partition filters, parameters and use cases. #421

agorajek · 2023-03-29T19:00:31Z

agorajek
Mar 29, 2023
Maintainer

My goal here is to make us think about these ^ ideas and how useful or necessary they are in the early days of DJ. I was wondering if I should add my thoughts to #407 because they are related, but I wanted to highlight the filter/parameter side of things here.

Use cases:

Full or incremental (any cadence) materialization of nodes.
Custom materialization with partition filters.
Querying of nodes with partition filters.

Let's consider a transform node:

node:
  type: transform
  name: t1
  sql: ...

  schema:
    user_id: int
    group_id: int
    date_int: int
    ...

  partitions:
    group_id:
       temporal: false
       required: true
    date_int:
       temporal: true    
  
  materialization:
    schedule: daily|hourly|weekly|etc  # or cron spec
    partitions:
      # partition_column: DJ query or expression
      date_int: CAST(FORMAT(NOW(), "YYYMMDD") AS INT)
      group_id: SELECT some_column FROM some_node WHERE ... # or simply `node.column`

  availability_state:
    vtts: ...
    min_partition: ...
    max_partition: ...
    partitions:
      # keyed on a list of all required partitions
      ["12345"]: (<vtts>, <min_partition>, <max_partition>) 
      ["54321"]: (<vtts>, <min_partition>, <max_partition>)
      ...

The above model has all we need to build materialization jobs and to require the query builder to provide (or ask the user for) values for the required partition filters.

I could think of addition parameters section at the node level that would not need to be tied to the partition columns (in any way) but this seems like a nice-to-have option that can be added later on (if at all at this level).

  parameters:
    <parameter-name>:
      type: int

Some related questions:

What other use case should be considered here?
Partitioning strategy is usually common across all/most tables in a given system. It would be good to provide global default settings for partitions, materialization or other aspects of the nodes. This would be not only a convenience feature but also a good standardization strategy.

agorajek · 2023-05-19T17:55:09Z

agorajek
May 19, 2023
Maintainer Author

After some discussion with @shangyian this morning we concluded that (in the short term) we should focus on the materialization use case wrt parameters and partitions. The use case of live querying data is separate and we can address that later.

We concluded that:

Parameters should only be defined on partition columns.
And that initially we should only allow simple parameter/partition conditions (value, list, range) which would be specified by the user during scheduling a materialization call, example:

  materialization:
    schedule: @daily
    partitions:
      date_int: 
        value: CAST(FORMAT(NOW(), "YYYMMDD") AS INT) # must evaluate to <same type as date_int>
        list: None
        range: (None, None)
      group_id: 
        value: None
        list: [123, 234, 345, 456] # must evaluate to ARRAY[<same type as group_id>]
        range: (None, None)

Few more thoughts on this:

This makes me think that the materialization schedule could potentially be defined with a list of specs, not just one spec. And we can make sure (down the road) during the scheduling API call that the values are non-overlapping for the non-temporal partitions.
This pattern should also allow for running backfill-in-place and keep track of it whether it was created by a user or by the Materialization Service (notice that backfills-on-the-side are always doable anyway).
Since we want to allow materialization params to be tied to partitions we don't have a problem of tracking multiple tables.
About Druid... we already consider generating cubes as a step-after-materialization-in-iceberg (correct?) so with that in mind we could add a special (or similar) availability state on the Cube nodes to represent the sets of partitions/dimensions materialized and copied to Druid. Let's talk about it...

0 replies

shangyian · 2023-05-23T07:50:12Z

shangyian
May 23, 2023
Collaborator

Parameters should only be defined on partition columns.

@agorajek do you think this statement can be restricted further, with parameters only being defined on a single partition column rather than potentially multiple columns?

A few other things we talked about yesterday (just jotting this down so that we don't forget):

Materialization configs should have names, with a sensible default provided by DJ based on the config's schedule
Having names allows us to let users update their materialization configs. This is more sensible than expecting them to delete and recreate
The materialization service needs to be able to generate the URL (or some id) of the scheduled job that's associated with a particular materialization config + node revision. This allows users to see the status of the current scheduled job.
When a node that already has materialization configs is updated, the logic should be as follows:
- If a node changes but its query does not change, we'll create a new node revision and just copy over the old materialization configs without creating any new scheduled jobs.
- If a node's query changes, we'll create a new node revision, copy over the old materialization configs, and reschedule new jobs using the Materialization Service, based on the new query.

0 replies

agorajek · 2023-05-23T15:33:55Z

agorajek
May 23, 2023
Maintainer Author

with parameters only being defined on a single partition column rather than potentially multiple columns?

@shangyian yes, I think so. Moreover I think we may not need to define any new parameter attributes (#fingers-crossed) and simply use the partitions as the guide-rails of the materialization. In the current world of partitioned tables, they are really equivalent. And in the future world of row-level table modifications, we can simply treat partitions as parameters.

Materialization configs should have names, with a sensible default provided by DJ based on the config's schedule.
Having names allows us to let users update their materialization configs. This is more sensible than expecting them to delete and recreate. The materialization service needs to be able to generate the URL (or some id) of the scheduled job that's associated with a particular materialization config + node revision. This allows users to see the status of the current scheduled job.

^ +1

When a node that already has materialization configs is updated, the logic should be as follows:

If a node changes but its query does not change, we'll create a new node revision and just copy over the old materialization configs without creating any new scheduled jobs.

^ +1

If a node's query changes, we'll create a new node revision, copy over the old materialization configs, and reschedule new jobs using the Materialization Service, based on the new query.

I think that eventually we'll need to support two modes:

large table / DE mode: when query changes, DJ provides a shadow copy of the changed node, lets the shadow materialization catchup with the current node and at some point let the owner flips-the-switch on them. This is when the risk of having any holes in the materialization is too big or that the new data still need to be vetted. (this is a nice project on its own)
small table / AE mode: what you described above: no risk of downstream nodes or readers to see the new data with holes.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On partition filters, parameters and use cases. #421

{{title}}

Replies: 3 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

On partition filters, parameters and use cases. #421

agorajek Mar 29, 2023 Maintainer

Replies: 3 comments

agorajek May 19, 2023 Maintainer Author

shangyian May 23, 2023 Collaborator

agorajek May 23, 2023 Maintainer Author

agorajek
Mar 29, 2023
Maintainer

agorajek
May 19, 2023
Maintainer Author

shangyian
May 23, 2023
Collaborator

agorajek
May 23, 2023
Maintainer Author