Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to provide a custom feature_store.yaml during CLI operations #1556

Closed
MattDelac opened this issue May 12, 2021 · 11 comments
Closed

Ability to provide a custom feature_store.yaml during CLI operations #1556

MattDelac opened this issue May 12, 2021 · 11 comments
Labels
keep-open kind/feature New feature or request

Comments

@MattDelac
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
We often want to run feast apply (or other CLI operations) on different GCP projects.

Therefore it would be nice if we could point to a specific "feature_store.yaml" when we use the CLI

Describe the solution you'd like
Something easy like feast apply --conf feature_store_prod.yaml. By default --conf would be feature_store.yaml

Describe alternatives you've considered
Copying a specific yaml to feature_store.yaml when we need to perform CLI operations on different environments

Additional context
Add any other context or screenshots about the feature request here.

@woop
Copy link
Member

woop commented May 12, 2021

Hey @MattDelac

Do you think #1509 (with separate repositories) would address this problem?

@MattDelac
Copy link
Collaborator Author

Hey @MattDelac

Do you think #1509 (with separate repositories) would address this problem?

I don't think so as you are staying in the same repository.

It's just that we would like the flexibility to have multiple configurations

conf/
conf/feature_store_adhoc.yaml
conf/feature_store_prod.yaml
conf/feature_store_local.yaml

For example, this would be useful to let the user confirm that their new FeatureView is properly applied by doing feast apply --conf conf/feature_store_local.yaml

@woop
Copy link
Member

woop commented May 25, 2021

@MattDelac Some options that I can imagine

Option 1: One repo, one config

This is what we have today. The idea is that the feature_store.yaml anchors a configuration repository. It tells us where to scan for all feature definitions within the repository (and what the root folder is). If you have multiple environments then the idea is that you will have something like

.
├── prod
│   ├── feature_store.yaml
│   └── my_feature_def.py
└── staging
    ├── feature_store.yaml
    └── my_feature_def.py

but I think its possible to do

.
├── common
│   └── my_feature_def.py
├── prod
│   ├── feature_store.yaml
│   └── my_feature_def.py
└── staging
    ├── feature_store.yaml
    └── my_feature_def.py

so your prod and staging would pull definitions from common or another python package.

Option 2: One repo, many configs

Alternatively, we could make it possible to specify a remote configuration file. My main concern with that is that it could be unintuitive how it would function. Would we still consider it to be the root of a feature repo?

When I see a command like feast apply --conf conf/feature_store_local.yaml then I dont think there is anything special about the --conf file. But in reality, that conf file location is important since we will use its location to scan for feature definition files.

@shihgianlee
Copy link

We tried to organize our code with common which is suggested in Option 1. Personally, I like to group the relevant modules under a package/folder, i.e. prod, staging. We have GCP projects created for dev, qa and prod. In our repo, we have dev, qa and prod that points to the corresponding GCP projects. From CLI, we should be able to execute feast apply -c dev/.

I may not have a good understanding of the problem statement. What benefit does one repo with multiple feature store definitions give us if we can structure our repo to match GCP projects?

@MattDelac
Copy link
Collaborator Author

We tried to organize our code with common which is suggested in Option 1. Personally, I like to group the relevant modules under a package/folder, i.e. prod, staging. We have GCP projects created for dev, qa and prod. In our repo, we have dev, qa and prod that points to the corresponding GCP projects. From CLI, we should be able to execute feast apply -c dev/.

Same things on our side !

We basically have

.
├── config
   ├── feature_store_prod.yaml
   ├── feature_store_dev.yaml
└── features
   └── my_feature_def.py
   └── my_feature_def_2.py

Then once we merge a new PR, our CD tool is going to spin two jobs that basically do

update_registry_in_prod:
run:
  - cp config/feature_store_production.yaml ./feature_store.yaml
  - feast apply

update_registry_in_dev:
run:
  - cp config/feature_store_dev.yaml ./feature_store.yaml
  - feast apply

That's where I should be able to not copy the files and directly do feast apply -c config/feature_store_production.yaml

Also to give you more details, in our code we change the GCP project of our table_ref based on the config (if it's prod or development)

We have something like

table_ref: str = f"{get_bigquery_project()}.{BIGQUERY_SCHEMA}.{entity}_{feature_view}"

So the two registry (prod & dev) does not contain exactly the same information (as the table_ref will be different)

@woop
Copy link
Member

woop commented Jul 5, 2021

So more tangibly @MattDelac, are you suggesting that all parameterization should happen in the feature_store.yaml and that you would only have a single feature repo, and that feature repo would then have conditional logic based on this configuration?

I'm just trying to figure out what the most natural approach is here for users.

@MattDelac
Copy link
Collaborator Author

So more tangibly @MattDelac, are you suggesting that all parameterization should happen in the feature_store.yaml and that you would only have a single feature repo, and that feature repo would then have conditional logic based on this configuration?

Yes

@woop
Copy link
Member

woop commented Jul 5, 2021

Digression warning

One of the things I have been thinking about a lot is the philosophy behind Black. The idea is basically that we should stop thinking about formatting and just let a tool handle it. The reason I think this may apply to Feast is because we could also let Feast take a more opinionated approach to managing a feature repository.

Let's take feature inferencing for instance. Today, you have something like

driver_hourly_stats_view = FeatureView(
    name="driver_hourly_stats",
    entities=["driver_id"],
    ttl=timedelta(days=1),
    input=driver_hourly_stats
)

after which you should run

feast apply

which infers features and adds them to the registry. The repo itself is generalized and light weight. At first glance this sounds great, but I have been thinking about whether this is actually a good practice. How does a user constrain the schema of a feature view? They should add specific features to the features argument, but then why don't we follow the same approach for inferencing? I think it may make more sense to do something like

feast discover

which infers schemas for defined feature views and updates them in the repository like

driver_hourly_stats_view = FeatureView(
    name="driver_hourly_stats",
    entities=["driver_id"],
    ttl=timedelta(days=1),
    features=[
        Feature(name="conv_rate", dtype=ValueType.FLOAT),
        Feature(name="acc_rate", dtype=ValueType.FLOAT),
        Feature(name="avg_daily_trips", dtype=ValueType.INT64),
    ],
    input=driver_hourly_stats
)

A benefit of this approach is that we can version control all schema changes in git, and we have a consistent way to define features (all of it is in the repo, as opposed to some in the repo and some of them inferrred).

How does this relate to this particular issue? Well if we have a single repo then the user probably has conditional logic within their FeatureView, meaning Feast will probably have trouble updating/adding the FeatureView in the repo. Also, if we go the single repo (or folder) route, then it's not possible to easily diff different commits and see the changes over time. Either inside a single environment (prod/staging) or across.

Don't feel too strongly, but just some things on my mind.

@MattDelac
Copy link
Collaborator Author

One of the things I have been thinking about a lot is the philosophy behind Black. The idea is basically that we should stop thinking about formatting and just let a tool handle it. The reason I think this may apply to Feast is because we could also let Feast take a more opinionated approach to managing a feature repository.

I am clearly not against a more opinionated approach. It might be hard though as Feast is trying to be a tool which let the users connecting OfflineStore & OnlineStore (through Provider)
The big difference with a linter is that it makes sense to have one config for a given repo. Feast is here to help managing data, thus I believe that the flexibility of different environment (prod, adhoc, staging, etc.) is very important.

which infers schemas for defined feature views and updates them in the repository like

Ho I see what you mean here and it might a good approach. The problem (at least for us) is that our FS repo is also our source of truth about which features are published and which are not. Moreover we add extra information like

  • Team owning a FeatureView
  • Description of a Feature

I don't know if we could easily infer the description of the Features with other OfflineStore than BigQuery (eg Presto). Even if all OfflineStore supports it, it means that it's the responsability of the upstream pipeline to properly document a FeatureView. This will be harder to enforce as we would need to create this logic in all of our upstream tools.

Well if we have a single repo then the user probably has conditional logic within their FeatureView, meaning Feast will probably have trouble updating/adding the FeatureView in the repo

I mean it depends how we can save metadata. It sounds like adding tags to FeatureViews gives a lot of flexibility to the user. This gives them the creativity to "tweak" Feast to make it work on s specific environment (each company is different). Keeping track of those tags (or another form of metadata) should be trivial I believe and is key.

Also, if we go the single repo (or folder) route, then it's not possible to easily diff different commits and see the changes over time.

I don't understand what you mean here

Don't feel too strongly, but just some things on my mind.

Same on my side. I really enjoy this chat as it helps me think out of the box 🙂

@stale
Copy link

stale bot commented Nov 6, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Nov 6, 2021
@stale stale bot closed this as completed Nov 14, 2021
@woop woop reopened this Nov 17, 2021
@stale stale bot removed the wontfix This will not be worked on label Nov 17, 2021
@woop woop added the keep-open label Nov 17, 2021
@adchia adchia added the kind/feature New feature or request label Jan 7, 2022
@achals
Copy link
Member

achals commented Aug 26, 2022

Closed by #3077

@achals achals closed this as completed Aug 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keep-open kind/feature New feature or request
Projects
Status: Done
Development

No branches or pull requests

5 participants