Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running feast materialize-incremental for an end date far in future breaks incremental materializations up to that date #4222

Open
samhallam-reverb opened this issue May 23, 2024 · 9 comments

Comments

@samhallam-reverb
Copy link

Expected Behavior

When running feast materialize-incremental it should not be possible for a materialize to run starting from a date that hasn't occurred yet. Would expect materialize-incremental to choose min(now, most_recent_materialization_date)

Current Behavior

When running feast materialize-incremental for a future date, say 2025-04-08T00:00:00, it will set that date as the next start date. This breaks all future materialize-incremental commands up until that date, if you run daily materializations they will no longer function until the actual date goes beyond this erroneously entered start date. Because the data is stored as a serialized protobuf in the SQL-based registry this is also non-trivial to change.

You can see code here showing that there is no validation checks for the start date, merely the latest date that has ever run.

Steps to reproduce

Run a materialize-incremental command for a date far in the future.

Run a second materialize-incremental command for a date less than the last start date. Watch Feast always find no date to materialize.

Specifications

  • Version: 0.34.1
  • Platform: Mac OSX
  • Subsystem: Sonoma 14.5

Possible Solution

Refactor the above function to choose min(now, most_recent_materialization_date). Happy to contribute to this fix.

@shuchu
Copy link
Collaborator

shuchu commented May 24, 2024

@samhallam-reverb did you try the 0.37.0 version?

also, welcome to help us fix this issue since you almost know the fix.

@tokoko
Copy link
Collaborator

tokoko commented May 24, 2024

I'm not sure about this... Currently materialize commands only take into account values that are in event_timestamp column, never a clock time. Bringing now into this equation means we'll have to handle timezone differences between the two. Also, isn't there a possibility someone might want to prepopulate values into the future in the offline store, maybe you're calculating monthly feature values that you want to take effect from the start of the next month, idk...

To me, this feels like something that needs to be handled by the user, the user passes the end date after all and can always cap it with the current time beforehand.

@samhallam-reverb
Copy link
Author

@shuchu Confirmed this happens in both 0.37.0 and 0.38.0.

@tokoko Understand the concerns completely, both on clock time and the possible use case for future-dated features.

The main issue with the current setup is there isn't an easy way to override the start date if you did typo the previous end date, which is how we found the issue.

Currently, if you accidentally materialize for say, a year ahead, all of your future materializations with correct end dates can't work. There's no real way to solve this other than blowing away the registry and starting again AFAICT.

Wouldn't necessarily have to compare with clock time, could simply check that start_date is not less than end_date for the materialization, there's no situation where a materialization could run in that scenario. In that case you could prompt for a new start_date value, or just generally support it as an optional arg to materialize-incremental.

@tokoko
Copy link
Collaborator

tokoko commented May 25, 2024

@samhallam-reverb Sure, looks like it always looks for a max value in it's history of materialization timestamps, so it seems impossible to be rectified by normal means. Registry API also only supports apply_materialization that adds to the list of timestamp pairs, never removes them.

tbh, I'm not sure I got what you meant in the last paragraph. Unless you either look at a clock time or peek into the data, it will be hard to come up with any sort of logic and I'm not sure what that logic should be either.

What if we do away with storing materialization_intervals history in the registry altogether? I'll have to double check, but pretty sure it's only used for querying the max end_date timestamp that has occurred, so we might as well switch to storing just that. In that case, apply_materialization would simply overwrite the erroneous value and allow for the problem to be solved.

P.S. There is another issue somewhere where a user did a very frequent materialize-incremental calls (once an hour, i think) and ended up bloating the registry as every materialize call added to the Feature View proto size in the registry. Switching to storing a single value would be a fix for that issue as well.

@shuchu
Copy link
Collaborator

shuchu commented May 26, 2024

@samhallam-reverb it seems 0.37 and 0.38 have the same behavior as 0.34

@samhallam-reverb
Copy link
Author

@tokoko That makes sense to me, I don't know the context as to why the current schema was designed but it also seems that storing the materialization runs as individual rows would solve the problem too.

Happy to help contribute to either of those solutions if they seem acceptable.

@tokoko
Copy link
Collaborator

tokoko commented Jun 1, 2024

After thinking about it some more, I'm between these 2 approaches:

  • As I outlined above, we mostly keep the current workflow, but switch from storing materialization_intervals to storing just a single timestamp designating that last ingested value. This fixes the issue in this ticket and also protects us from registry objects growing huge when a user runs materialize-incremental frequently.
  • We can also scrap materialization info from the registry altogether and instead move the responsibility for storing those values to OnlineStore API (add 2 new methods for storing and retrieving last_materialized timestamp). One benefit of this is that registries will become less critical as long as you have object definitions intact in the git repo. Even if you nuke the registry, you can always rebuild with a feast apply. Another big benefit is that this will simplify supporting multiple online stores at the same time as the materialization information will be owned by individual online stores themselves. The obvious downside is that it's breaking pluggable OnlineStore API, but imho that's bound to happen before we reach 1.0 release, so now is the best time to do it.

@franciscojavierarceo @HaoXuAI @jeremyary wdyt?

@franciscojavierarceo
Copy link
Member

@tokoko I think keeping the current registry is the right thing as that metadata is important for reproducibility. I'd rather have to deal with the complexities of implementing timezones. From reviewing the code, the start date and end date are already timezone aware.

        for feature_view in feature_views_to_materialize:
            start_date = feature_view.most_recent_end_time
            ....
            start_date = utils.make_tzaware(start_date)
            end_date = utils.make_tzaware(end_date)

From the feature_store.py file

@franciscojavierarceo
Copy link
Member

And it looks like it already uses datetime.utcnow() in the ttl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants