Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle BigQuery partitions when event_timestamp is not the partition column, deal with event_timestamp columns of DATE type #2530

Open
rory-nia opened this issue Apr 13, 2022 · 1 comment
Labels
Community Contribution Needed We want community to contribute good first issue Good for newcomers keep-open kind/feature New feature or request priority/p1

Comments

@rory-nia
Copy link

Is your feature request related to a problem? Please describe.
At a high level, I'd like to be able to install Feast, describe my sources and FVs, and immediately start working to get historical data and materialize features. However, by not being partition-aware and by not accounting for non-timestamp date columns, a large number of our tables are not usable without modification.

This feature request is specific to the GCP provider and BigQuery offline store. It has two parts: 1) relating to respecting partition when constructing date ranges in SQL, 2) relating to source tables with "timestamp" column not of TIMESTAMP type.

  1. In any table were the partition column is not the same column as the event_timestamp, Feast will ignore partitions and do a full table scan on each query. For large tables, this causes BQ to fail due to "insufficient resources" - not a slot/quote issue, but having to do with needing to marshal too much transient data. A common example with BQ is when doing DAY partitioning using ingestion time, the magic column _PARTITIONTIME is the partition column. Each record will contain its own actual event_timestamp column, but this column is not the partition column.

  2. For tables with an "event_timestamp" that is not a TIMESTAMP type, the SQL can break when there's a type mismatch, e.g. when the column is a DATE type. A common use case is a DAY partitioned summary table that performs aggregations over the day or possibly just takes the final value for each entity/date. These tables have a DATE column on each record. In this case, the DATE column is usually the partition column, but it is not the correct format for the date conditions in the Feast template SQL. I would prefer not to have to create a new column or a view on each table just to make Feast work.

Describe the solution you'd like
I'd like to be able to specify information about the partition column such that Feast could formulate SQL optimized to my table definition taking advantage of partitioning. This information would probably need to include the partition column name, type, and partition type at a minimum. I believe optimized SQL would not be much different from the existing queries but would require, in cases where the partition column is not the same as the event_timestamp column, to add an additional range condition using the partition column.

I'd also like to be able to indicate that my "event_timestamp" column is not a TIMESTAMP type - and have Feast adjust the SQL accordingly. Specifically, I'd like to specify in the Feast BigQuerySource definition that the "event_timestamp" column is the partition column and is of DATE type and have Feast make the corresponding adjustments to the templatized SQL.

Describe alternatives you've considered
To deal with (1), the only alternatives I can think of are a) creating a view that changes the partition column to the event_timestamp column for every table, b) creating a duplicate of each table using the event_timestamp column.

To deal with (2), similarly, either a) create a view with a new timestamp typed column to use as the event_timestamp, b) add a new timestamp column to the existing tables.

All these solutions require a good amount of work and/or creating duplicate tables/views. I would really like Feast to just do the right thing after I create a Source.

Additional context
Happy to provide any further details if the above is not clear.

@stale
Copy link

stale bot commented Sep 22, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Sep 22, 2022
@stale stale bot closed this as completed Oct 1, 2022
@adchia adchia reopened this Oct 1, 2022
@stale stale bot removed the wontfix This will not be worked on label Oct 1, 2022
@adchia adchia added wontfix This will not be worked on keep-open and removed wontfix This will not be worked on labels Oct 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community Contribution Needed We want community to contribute good first issue Good for newcomers keep-open kind/feature New feature or request priority/p1
Projects
Status: Done
Development

No branches or pull requests

3 participants