Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feast BigQuery offline store gets wrong argument types for event_timestamp #2532

Closed
KarolisKont opened this issue Apr 13, 2022 · 5 comments
Closed

Comments

@KarolisKont
Copy link

Expected Behavior

Retrieving training data from an offline store that is BigQuery by using get_historical_features method from Python SDK.

Current Behavior

Getting this error when running get_historical_features:

Traceback (most recent call last):
  File "get_offline_data.py", line 22, in <module>
    "item_user_stats_v1:is_high_risk_seller",
  File "/usr/local/lib/python3.7/site-packages/feast/infra/offline_stores/offline_store.py", line 77, in to_df
    features_df = self._to_df_internal()
  File "/usr/local/lib/python3.7/site-packages/feast/infra/offline_stores/bigquery.py", line 290, in _to_df_internal
    df = self._execute_query(query).to_dataframe(create_bqstorage_client=True)
  File "/usr/local/lib/python3.7/site-packages/feast/usage.py", line 280, in wrapper
    raise exc.with_traceback(traceback)
  File "/usr/local/lib/python3.7/site-packages/feast/usage.py", line 269, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/feast/infra/offline_stores/bigquery.py", line 357, in _execute_query
    block_until_done(client=self.client, bq_job=bq_job, timeout=timeout)
  File "/usr/local/lib/python3.7/site-packages/feast/infra/offline_stores/bigquery.py", line 415, in block_until_done
    raise bq_job.exception()
google.api_core.exceptions.BadRequest: 400 No matching signature for operator <= for argument types: TIMESTAMP, DATETIME. Supported signature: ANY <= ANY at [78:13]

The error happens on a query where it compares event timestamps (last line):

item_user_stats_v1__base AS (
   SELECT
       subquery.*,
       entity_dataframe.entity_timestamp,
       entity_dataframe.item_user_stats_v1__entity_row_unique_id
   FROM item_user_stats_v1__subquery AS subquery
   INNER JOIN item_user_stats_v1__entity_dataframe AS entity_dataframe
   ON TRUE
       AND subquery.event_timestamp <= entity_dataframe.entity_timestamp

Steps to reproduce

Use Feast project with GCP provider, where the offline store is BigQuery. After installing feast[gcp] (v0.19.4) packages you should get this google-cloud-bigquery==3.0.1 dependency version. I think it should brake also with newer versions of google-cloud-bigquery if it will be available.

Specifications

  • Version: v0.19.4
  • Platform: GCP
  • Subsystem:

Possible Solution

Quick solution: add upper bound of <3.0.0 here.

A better solution: check what is causing this event timestamp type mismatch and fix it 😊 .

@achals
Copy link
Member

achals commented Apr 13, 2022

@KarolisKont could you also share your feature view definition and the schema of the offline store bigquery table?

@KarolisKont
Copy link
Author

@KarolisKont could you also share your feature view definition and the schema of the offline store bigquery table?

Of course, I can @achals, but have one issue this Bigquery table has 109 fields 😅 . We don't use all fields in the feature view. So perhaps for you will be enough just those fields that are used feature view only?

Also perhaps forgot to mention that when I dropped the google-cloud-bigquery version to under 3.0.0 Python SDK can returns offline data and jobs in BigQuery don't have errors.

@achals
Copy link
Member

achals commented Apr 13, 2022

That's very interesting, yeah I just need the fields used in the feature view. Can you also provide the output of the table schema, after getting the table using the python API? Looking at this code snippet an example: https://googleapis.dev/python/bigquery/latest/usage/tables.html#getting-a-table

@KarolisKont
Copy link
Author

Sorry for the late response 😊 , but here is the info:
feature view definition:

from datetime import timedelta

from feast import BigQuerySource, Feature, FeatureView, ValueType

from marketplace_sandbox.config import BQ_POC_RERANKER_TABLE_REFERENCE
from marketplace_sandbox.entities import item_user
from marketplace_sandbox.utils import generate_data_bq_query

FEATURE_VIEW_NAME = "item_user_stats_v1"

entities = [item_user]

features = [
    Feature(
        name="total_items_sold",
        dtype=ValueType.INT32,
        labels={"owner": "VMIP"},
    ),
    Feature(
        name="is_high_risk_seller",
        dtype=ValueType.INT32,
        labels={"owner": "VMIP"},
    ),
]

batch_source = BigQuerySource(
    name=f"{BQ_POC_RERANKER_TABLE_REFERENCE}.{FEATURE_VIEW_NAME}",
    query=generate_data_bq_query(
        features=features,
        entities=entities,
        tables=[BQ_POC_RERANKER_TABLE_REFERENCE],
    ),
    event_timestamp_column="event_time",
)

item_user_stats_v1_fv = FeatureView(
    name=FEATURE_VIEW_NAME,
    entities=[entity.name for entity in entities],
    ttl=timedelta(weeks=4),
    features=features,
    batch_source=batch_source,
    owner="VMIP",
)

In this feature view used entity definition:

from feast import Entity, ValueType

item_user = Entity(
    name="item_user",
    join_key="item_user_id",
    value_type=ValueType.INT64,
    owner="VMIP",
)

In this feature view helper function to construct query:

from typing import Iterable

from feast import Entity, Feature


def generate_data_bq_query(
    features: Iterable[Feature],
    entities: Iterable[Entity],
    tables: Iterable[str],
) -> str:
    """BigQuery SQL query to execute to generate data for feature view.

    Args:
        features (Iterable[Feature]): all used features in feature view.
        entities (Iterable[Entity]): all used entities in feature view.
        tables (Iterable[str]): all BigQuery tables that contain used features
            and enities.

    Returns:
        str: BigQuery SQL query.
    """
    return (
        f"SELECT event_time, {', '.join(f.name for f in features)}, "
        f"{', '.join(entity.join_key for entity in entities)} "
        f"FROM `{', '.join(table for table in tables)}`"
    )

This feature view schema from BigQuery UI:

Field name | Type | Mode

item_user_id | INTEGER | NULLABLE
total_items_sold | INTEGER | NULLABLE
is_high_risk_seller | INTEGER | NULLABLE
event_time | TIMESTAMP | NULLABLE

The output of table schema using python API (google-cloud-bigquery==3.0.1):

Table schema: [SchemaField('item_user_id', 'INTEGER', 'NULLABLE', None, (), None), SchemaField('event_time', 'TIMESTAMP', 'NULLABLE', None, (), None), SchemaField('total_items_sold', 'INTEGER', 'NULLABLE', None, (), None), SchemaField('is_high_risk_seller', 'INTEGER', 'NULLABLE', None, (), None)]

The output of table schema using python API (google-cloud-bigquery==2.34.2):

able schema: [SchemaField('item_user_id', 'INTEGER', 'NULLABLE', None, (), None), SchemaField('event_time', 'TIMESTAMP', 'NULLABLE', None, (), None), SchemaField('total_items_sold', 'INTEGER', 'NULLABLE', None, (), None), SchemaField('is_high_risk_seller', 'INTEGER', 'NULLABLE', None, (), None)]

@KarolisKont
Copy link
Author

OK seems that my proposed solution was valid for this particular issue. Thanks @achals 🙇

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants