-
Notifications
You must be signed in to change notification settings - Fork 998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Cassandra Store #360
Add Cassandra Store #360
Conversation
Hi @smadarasmi. Thanks for your PR. I'm waiting for a gojek member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
References #335. We're tracking BEAM-6911 to upgrade Beam's CassandraIO to the latest generation of the driver, which drops Guava dependency altogether 🎉 It also has a new DAO API which I think may open enhancing the CassandraIO in ways that could avoid the hardcoded Assuming it gets done, that'd come along with a future Beam release so Feast in turn can reap the benefits later. Using Guava 26 in Feast may run a risk of problems integrating with Beam, so the downgrade here might be prudent anyway. This is obviously a big PR. We're happy to discuss any questions further by whatever medium is efficient. |
@smadarasmi @ches super impressed! Please give us a little bit of time to process it. We are currently working on Would it make sense to do the review of this PR soonish, but only merge in after project namespacing? I'd feel a little bit more comfortable with you updating the Cassandra code than us. Which reminds me, we can add you as codeowners parts of the codebase that you created. That would allow you to make changes without having to wait for our approval. |
Ok, this is great. I wonder if we can nudge anyone from the Beam team to look at this sooner. |
/ok-to-test |
@woop Do you know why the checks fail, but when I open the link it shows all tests are passing? I don't have permission to view the build logs. |
/retest |
@smadarasmi yeah the failures are not propagating properly; in the end to end tests core fails to start up. My guess is that you are missing a |
I have created a branch v0.3-branch where we continue to cut patches from. We will officially stop doing that from master. @ches, since you indicated that you will be staying on 0.3 for a while, can I suggest that we target this Cassandra release to this v0.3 branch? The rest of the folks will continue on master for async job management and project namespacing. I am updating the merge target. Please let me know if you have concerns. If time permits, we can try to get Cassandra updated to changes for 0.4, but I don't really want to hold up 0.4 for this yet. I think we can always bring it in later as a patch (not ideal) or in 0.5. By the way, @voonhous will be spiking this PR to see if it's functional and get it merged in. |
@zhilingc Seems like they are passing now, let me know if you still think there are some issues that needs to be fixed. @woop Yes I think we can target this for 0.3 for now. I think updating this for 0.4 shouldn't be much work at all. Since currently the Cassandra key consists of concatenated feature set name (and version) + entities, I think we just need to update the key to add project namespace to it. |
for (FeatureSetSpec spec : specs.values()) { | ||
String featureSetName = spec.getName() + ":" + spec.getVersion(); | ||
if (spec.getMaxAge() != null && spec.getMaxAge().getSeconds() > 0) { | ||
maxAges.put(featureSetName, Math.toIntExact(spec.getMaxAge().getSeconds())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one thing: even though its a field in the feature set, our intention for maxAge is to be a query time concept (that can change, depending on the user's use case), rather than an ingestion time one, so I'm not entirely sure if using it as the TTL is the best choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense about maxAge being a query time concept, where each team using the same features may have different expiration values.
How would you propose we do this? Redis currently sets TTL to null (no TTL). Do you have any thoughts or work in progress regarding this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense about maxAge being a query time concept, where each team using the same features may have different expiration values.
How would you propose we do this? Redis currently sets TTL to null (no TTL). Do you have any thoughts or work in progress regarding this?
Seems like two options
- Don't expire (this will accumulate bad state over time)
- Expire, but force users to reingest
Is it possible to leave this up to whoever sets up the infra, or does it have to be defined as part of schemas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We configure our redis deployments as LRU caches in lieu of a TTL, but afaik there isn't such an option for Cassandra :/ There isn't really an alternative here I guess. It shouldn't be a problem as long as its documented that the feature set default max age is used for TTL when using this store.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll think about where to add the documentation for this.
Not sure if a good idea to not have the TTL setup by infra. Some features such as embeddings may not want expiration for a long time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some documentation in Store.proto. I think it makes sense to put it there as users of the Cassandra would check the proto file to see the required and supported configurations in Cassandra anyway.
/retest |
ingestion/src/test/java/feast/ingestion/util/CassandraStoreUtilIT.java
Outdated
Show resolved
Hide resolved
* Downgraded Guava to 25 * Beam 2.16 uses Cassandra 3.4.0 (So we cannot use Cassandra 4.x which shades Guava) * Cassandra 3.4.0 uses Guava version 16.0 but has a compatibility check to use a different class when we use version > 19.0. * Guava version 26 (version previously used) has breaking change to method used in compatibility check in Cassandra's dependency, hence version 25 * Using Cassandra's internal field 'writetime' to handle out of order arrivals. When older records where the primary key already exist in Cassandra are ingested, they are set as tombstones in Cassandra and ignored on retrieval. * Aware that this way of handling out of order arrival is specific to Cassandra, but until we have a general way to handle out of order arrivals we need to do it this way * Cassandra's object mapper requires stating table's name along with @table annotation * table_name is still part of CassandraConfig for use in serving module * if user registers CassandraConfig with a different table name other than "feature_store", this will throw an exception
* Abstracted OnlineServingService for common implementation of online serving stores * Complete tests remain in RedisServingServiceTest while Cassandra tests only contain basic tests for writes, and some other implementation specific to Cassandra
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: smadarasmi, woop The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* create cassandra store for registration and ingestion * Downgraded Guava to 25 * Beam 2.16 uses Cassandra 3.4.0 (So we cannot use Cassandra 4.x which shades Guava) * Cassandra 3.4.0 uses Guava version 16.0 but has a compatibility check to use a different class when we use version > 19.0. * Guava version 26 (version previously used) has breaking change to method used in compatibility check in Cassandra's dependency, hence version 25 * Using Cassandra's internal field 'writetime' to handle out of order arrivals. When older records where the primary key already exist in Cassandra are ingested, they are set as tombstones in Cassandra and ignored on retrieval. * Aware that this way of handling out of order arrival is specific to Cassandra, but until we have a general way to handle out of order arrivals we need to do it this way * Cassandra's object mapper requires stating table's name along with @table annotation * table_name is still part of CassandraConfig for use in serving module * if user registers CassandraConfig with a different table name other than "feature_store", this will throw an exception * add cassandra serving service * Abstracted OnlineServingService for common implementation of online serving stores * Complete tests remain in RedisServingServiceTest while Cassandra tests only contain basic tests for writes, and some other implementation specific to Cassandra * update documentation to reflect current API and add cassandra store to docs * add default expiration to cassandra config for when featureset does not have max age * docs update, spotless check, and bug fix on cassandra schema
* create cassandra store for registration and ingestion * Downgraded Guava to 25 * Beam 2.16 uses Cassandra 3.4.0 (So we cannot use Cassandra 4.x which shades Guava) * Cassandra 3.4.0 uses Guava version 16.0 but has a compatibility check to use a different class when we use version > 19.0. * Guava version 26 (version previously used) has breaking change to method used in compatibility check in Cassandra's dependency, hence version 25 * Using Cassandra's internal field 'writetime' to handle out of order arrivals. When older records where the primary key already exist in Cassandra are ingested, they are set as tombstones in Cassandra and ignored on retrieval. * Aware that this way of handling out of order arrival is specific to Cassandra, but until we have a general way to handle out of order arrivals we need to do it this way * Cassandra's object mapper requires stating table's name along with @table annotation * table_name is still part of CassandraConfig for use in serving module * if user registers CassandraConfig with a different table name other than "feature_store", this will throw an exception * add cassandra serving service * Abstracted OnlineServingService for common implementation of online serving stores * Complete tests remain in RedisServingServiceTest while Cassandra tests only contain basic tests for writes, and some other implementation specific to Cassandra * update documentation to reflect current API and add cassandra store to docs * add default expiration to cassandra config for when featureset does not have max age * docs update, spotless check, and bug fix on cassandra schema
* create cassandra store for registration and ingestion * Downgraded Guava to 25 * Beam 2.16 uses Cassandra 3.4.0 (So we cannot use Cassandra 4.x which shades Guava) * Cassandra 3.4.0 uses Guava version 16.0 but has a compatibility check to use a different class when we use version > 19.0. * Guava version 26 (version previously used) has breaking change to method used in compatibility check in Cassandra's dependency, hence version 25 * Using Cassandra's internal field 'writetime' to handle out of order arrivals. When older records where the primary key already exist in Cassandra are ingested, they are set as tombstones in Cassandra and ignored on retrieval. * Aware that this way of handling out of order arrival is specific to Cassandra, but until we have a general way to handle out of order arrivals we need to do it this way * Cassandra's object mapper requires stating table's name along with @table annotation * table_name is still part of CassandraConfig for use in serving module * if user registers CassandraConfig with a different table name other than "feature_store", this will throw an exception * add cassandra serving service * Abstracted OnlineServingService for common implementation of online serving stores * Complete tests remain in RedisServingServiceTest while Cassandra tests only contain basic tests for writes, and some other implementation specific to Cassandra * update documentation to reflect current API and add cassandra store to docs * add default expiration to cassandra config for when featureset does not have max age * docs update, spotless check, and bug fix on cassandra schema
Notes/Comments:
OnlineServingService
. Redis implementation will have complete tests, but Cassandra will not as those should already be covered in Redis since both extends OnlineServingService.Session
andResultSet
will be a pain/cc @ches