Add PubSub Notification System for Dataset Updates #227

not-Karot · 2023-06-12T10:43:50Z

Would be great to implement a Publish-Subscribe notification system (Apache Kafka) to enable users to subscribe to their datasets of interest. This would facilitate real-time notifications when the subscribed datasets are updated or when new data becomes available.

TomAugspurger · 2023-06-12T13:32:32Z

Agreed! If you don't mind my asking a few questions:

Are you specifically interested in getting notifications every time new data is added to a dataset? Would you be interested in setting some kind of filters to only receive notifications if some condition is met?
What kind of information would you like in you like in the notification? A STAC item? Anything else?
What kind of delivery destinations would you like? HTTP webhooks, Kafka specifically, ...?

not-Karot · 2023-06-12T14:35:56Z

Yes, notifications for new data additions are crucial. The minimum requirement would be to subscribe to a specific dataset using its catalog ID and to select an area of interest. While additional filters are appreciated, they might not be essential in the initial phase, such as cloud filtering or even all the filters already provided by the STAC searcher. By the way, I don't know how new data are uploaded to the catalog internally, so more info or filters might be needed to make it work properly.
It's essential that the notification includes the STAC information of the new dataset to bypass the need for a separate search (in the case they already have been filtered when subscribing), enabling immediate access and analysis. If filters are not yet implemented in subscribing phase, consumer will need an optimized way to preview the newly added data and double-check if it fits his requirements, (i.e. new data can be added but the % of cloud coverage is higher than my lower bound, so they are not interesting for the use case). The information on lost data or missing one would be appreciated as well.
Kafka is a preferred delivery option due to its efficiency in handling streaming data. Additionally, HTTP webhooks and custom triggers would be valuable alternatives for more integration and data consumption flexibility.

Just to present better the scenario: right now to check for new data you need to make two searches and compare the results. A nice added feature is to provide a more efficient way to be notified.

TomAugspurger · 2023-06-12T15:19:07Z

and to select an area of interest.

That was the main thing I was interested in getting feedback on. Glad to hear that you consider it important :)

Kafka is a preferred delivery option due to its efficiency in handling streaming data.

Also good to know.

Right now, the way we're thinking about this is that users would register a STAC search with a hypothetical notifications endpoint. The registration would include both the search criteria and some endpoint to deliver the notification to when a new item matches the search criteria.

Whenever a new STAC item is ingested, we'd check if it matches your search criteria. If it does, we'd deliver the STAC item (plus a bit of extra context, like the ID of the search that it matched) as an event to your system.

This is all just in discussion stage at the moment, but feel free to add additional suggestions here. I'll update this thread if / when we get this implemented.

not-Karot · 2023-06-12T18:11:34Z

Thank you for sharing the approach that's currently under discussion.

Registering the STAC search with notifications sounds like the ultimate solution for me, but I have some concerns/questions about registering an endpoint:

Utilizing an endpoint could create coupling between the systems. If the communication is synchronous, there could be potential delays or bottlenecks, especially with high volumes of data. Asynchronous communication might alleviate this issue, but it would still require careful consideration for scalability and fault tolerance. What happens if I want to add more workers listening to the topic?
Additionally, in a scenario with a large number of notifications, the burden of scaling requests and handling distribution would fall on the user. With a Kafka-based system, these aspects would be more transparent to the user.

Of course, users could potentially add an additional layer by implementing a publisher on the endpoint and managing the distribution themselves. This might allow for handling more use cases on the side of Microsoft Planetary. However, I initially envisioned a slightly different approach. But, I'm open to further discussions and willing to explore various possibilities. My main aim is to ensure an efficient and scalable solution that would benefit the community.

ghidalgo3 · 2024-11-08T18:21:47Z

Closed due to inactivity, feel free to reopen if you would like to continue this discussion.

ghidalgo3 closed this as completed Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PubSub Notification System for Dataset Updates #227

Add PubSub Notification System for Dataset Updates #227

not-Karot commented Jun 12, 2023

TomAugspurger commented Jun 12, 2023

not-Karot commented Jun 12, 2023

TomAugspurger commented Jun 12, 2023

not-Karot commented Jun 12, 2023

ghidalgo3 commented Nov 8, 2024

Add PubSub Notification System for Dataset Updates #227

Add PubSub Notification System for Dataset Updates #227

Comments

not-Karot commented Jun 12, 2023

TomAugspurger commented Jun 12, 2023

not-Karot commented Jun 12, 2023

TomAugspurger commented Jun 12, 2023

not-Karot commented Jun 12, 2023

ghidalgo3 commented Nov 8, 2024