Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add offline_store config #1552

Merged
merged 5 commits into from
May 19, 2021

Conversation

tsotnet
Copy link
Collaborator

@tsotnet tsotnet commented May 10, 2021

What this PR does / why we need it: This PR adds offline_store config to feature_store.yaml. Specifically, it makes it possible to override temporary entity dataframe bigquery dataset name.

Here is an example yaml file:

project: master_kitten
registry: data/registry.db
provider: gcp
offline_store:
    type: bigquery
    dataset: foo

Which issue(s) this PR fixes:

Fixes #

Does this PR introduce a user-facing change?:

Add offline_store config to feature_store.yaml

Signed-off-by: Tsotne Tabidze <tsotne@tecton.ai>
Copy link
Collaborator

@jklegar jklegar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern here is that the local provider works with both local and BQ offline stores, as does the GCP provider, so having the offline_store type option might be confusing to users especially since it doesn't affect interaction with the offline store. Not sure if that's already been discussed, and not sure what the better alternative would be though. Thoughts?

@woop
Copy link
Member

woop commented May 13, 2021

oes the GCP provider, so having the offline_store type option might be confusing to users especially since it doesn't affect interaction with the offline store. Not sure if that's already been discussed, and not sure what the better alternative would be though. Thoughts?

Ultimately it seems like we have to support decoupling storage (offline, online) from the provider if we want to have reuse, even if the provider still has to opt-in (or can override) the storage implementation. So at first glance, it seems natural to have the offline_store object to me.

since it doesn't affect interaction with the offline store

What is the "it" here, the "offline_store" object, or the "provider" option?

@jparthasarthy
Copy link
Collaborator

+1 to Willem's question, I am not sure what the "it" here is. This seems like a fine change to our API, but I might be missing something here:

  1. You choose a provider. Providers opt into supporting different offline stores
  2. You choose an offline store with the type parameter. Right now, there is only one offline store, BQ. There should be one default offline store per provider.
  3. Additional configs to the offline store go in additional keyword parameters.

@tsotnet, is there a default offline store type for each provider? I just want to check, but I don't think users should have to specify the type field in the happy path case. There is only one right now after all.

@jklegar
Copy link
Collaborator

jklegar commented May 14, 2021

"it" here is the offline_store type - users can set the type to be bigquery and then use file sources or both file and bigquery sources, or vice versa

@tsotnet
Copy link
Collaborator Author

tsotnet commented May 14, 2021

@tsotnet, is there a default offline store type for each provider? I just want to check, but I don't think users should have to specify the type field in the happy path case. There is only one right now after all.

@jay there is a default offline store type. As you'd expect FileOfflineStore for local provider and BigqueryOfflineStore for gcp provider.

"it" here is the offline_store type - users can set the type to be bigquery and then use file sources or both file and bigquery sources, or vice versa

@jklegar There are 2 directions we can go from here:

  1. During feast apply validate that all data sources are of the expected type.
  2. Enable multiple offline stores to be used in one GCP project. In this case we'll need to make offline_store field a list instead of struct in feature_store.yaml. So for example:
project: master_kitten
registry: data/registry.db
provider: gcp
offline_store:
    - type: bigquery
      entity_dataset_name: foo
    - type: local
      imaginary_config: imaginary_value

I think in a long term we should support multiple offline stores (and even multiple online stores). But it's not as simple. Currently we partially support this: we can have different data sources in a project, but when getting a historical data all requested features need to be from the same offline store. I recommend a middle ground: enable defining multiple offline store configs, but the behavior of getting a historical data stays the same for now.

Thoughts?

Tsotne Tabidze added 3 commits May 18, 2021 16:04
Signed-off-by: Tsotne Tabidze <tsotne@tecton.ai>
…to dataset

Signed-off-by: Tsotne Tabidze <tsotne@tecton.ai>
Signed-off-by: Tsotne Tabidze <tsotne@tecton.ai>
@codecov-commenter
Copy link

codecov-commenter commented May 19, 2021

Codecov Report

Merging #1552 (27128b7) into master (f050976) will decrease coverage by 0.12%.
The diff coverage is 87.91%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1552      +/-   ##
==========================================
- Coverage   83.71%   83.59%   -0.13%     
==========================================
  Files          65       65              
  Lines        5632     5680      +48     
==========================================
+ Hits         4715     4748      +33     
- Misses        917      932      +15     
Flag Coverage Δ
integrationtests 83.51% <87.91%> (-0.13%) ⬇️
unittests 77.32% <76.47%> (-1.34%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
sdk/python/feast/infra/provider.py 83.62% <ø> (-0.14%) ⬇️
sdk/python/tests/test_cli_local.py 100.00% <ø> (ø)
sdk/python/tests/test_e2e_local.py 100.00% <ø> (ø)
sdk/python/feast/repo_operations.py 30.10% <25.00%> (-0.12%) ⬇️
sdk/python/feast/infra/offline_stores/helpers.py 75.00% <60.00%> (-16.67%) ⬇️
sdk/python/feast/errors.py 62.85% <66.66%> (+0.35%) ⬆️
sdk/python/feast/repo_config.py 95.14% <93.47%> (-2.08%) ⬇️
sdk/python/feast/infra/gcp.py 82.25% <100.00%> (+1.34%) ⬆️
sdk/python/feast/infra/local.py 96.87% <100.00%> (-0.10%) ⬇️
sdk/python/feast/infra/offline_stores/bigquery.py 92.19% <100.00%> (+0.05%) ⬆️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f050976...27128b7. Read the comment docs.

Signed-off-by: Tsotne Tabidze <tsotne@tecton.ai>
@woop woop added the kind/feature New feature or request label May 19, 2021
@woop
Copy link
Member

woop commented May 19, 2021

/lgtm

@feast-ci-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tsotnet, woop

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@feast-ci-bot feast-ci-bot merged commit f0c8521 into feast-dev:master May 19, 2021
woop pushed a commit that referenced this pull request May 19, 2021
* Add offline_store config

Signed-off-by: Tsotne Tabidze <tsotne@tecton.ai>

* Enforce offline_store during feast apply, rename entity_dataset_name to dataset

Signed-off-by: Tsotne Tabidze <tsotne@tecton.ai>

* Remove ugly getattr since it's unnecessary anymore

Signed-off-by: Tsotne Tabidze <tsotne@tecton.ai>

* Rename Bigquery to BigQuery

Signed-off-by: Tsotne Tabidze <tsotne@tecton.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants