Materialize rehydration of Kafka/Kinesis from S3? #9154
-
Example scenario: I have data stored in Kafka or Kinesis for only 7 days before it is automatically purged, but I also copy all messages into S3 for archival in close to real time. Would I be able to tell Materialize that I want to to read from Kafka/Kinesis while it can, however if it were to restart, backfill the >7day old messages from S3, without pulling all of the current info from S3? Specifically I want to avoid joining S3 on data less than 7 days old, or if we had already gotten it from Kinesis/Kafka (i.e. older than 7 days, but we have not restarted so we still have the relevant kafka/kinesis data in memory). Put another way I only want to rehydrate from S3 if we restart. If this is not clear I apologize, let me know how I can clarify! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 4 replies
-
Hi there, This should be doable with a The SELECT * FROM s3_source
WHERE event_timestamp < (SELECT min(event_timestamp) FROM kinesis_source)
UNION ALL
SELECT * FROM kinesis_source
That way, even if you restart the historical data will be pulled from the S3 bucket, but the latest records would come from Kenesis. Here is a quite useful case study on the same topic: https://github.com/dbanalyticsco/materialize-backfill I believe that we are planning to post a tutorial based on the above case study. Hope that this helps! Let me know if you have any questions. |
Beta Was this translation helpful? Give feedback.
Hi there,
This should be doable with a
UNION ALL
statement where you would filter out the latest records from the S3 bucket and only pull all the S3 data that is not available in Kenesis/Kafka.The
SELECT
statement would look something like this:UNION ALL
to union the S3 data with all currently available records from Kenesis.That way, even if you restart the historical data…