Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PubSub Notification System for Dataset Updates #227

Closed
not-Karot opened this issue Jun 12, 2023 · 5 comments
Closed

Add PubSub Notification System for Dataset Updates #227

not-Karot opened this issue Jun 12, 2023 · 5 comments

Comments

@not-Karot
Copy link

Would be great to implement a Publish-Subscribe notification system (Apache Kafka) to enable users to subscribe to their datasets of interest. This would facilitate real-time notifications when the subscribed datasets are updated or when new data becomes available.

@TomAugspurger
Copy link

Agreed! If you don't mind my asking a few questions:

  1. Are you specifically interested in getting notifications every time new data is added to a dataset? Would you be interested in setting some kind of filters to only receive notifications if some condition is met?
  2. What kind of information would you like in you like in the notification? A STAC item? Anything else?
  3. What kind of delivery destinations would you like? HTTP webhooks, Kafka specifically, ...?

@not-Karot
Copy link
Author

  1. Yes, notifications for new data additions are crucial. The minimum requirement would be to subscribe to a specific dataset using its catalog ID and to select an area of interest. While additional filters are appreciated, they might not be essential in the initial phase, such as cloud filtering or even all the filters already provided by the STAC searcher. By the way, I don't know how new data are uploaded to the catalog internally, so more info or filters might be needed to make it work properly.
  2. It's essential that the notification includes the STAC information of the new dataset to bypass the need for a separate search (in the case they already have been filtered when subscribing), enabling immediate access and analysis. If filters are not yet implemented in subscribing phase, consumer will need an optimized way to preview the newly added data and double-check if it fits his requirements, (i.e. new data can be added but the % of cloud coverage is higher than my lower bound, so they are not interesting for the use case). The information on lost data or missing one would be appreciated as well.
  3. Kafka is a preferred delivery option due to its efficiency in handling streaming data. Additionally, HTTP webhooks and custom triggers would be valuable alternatives for more integration and data consumption flexibility.

Just to present better the scenario: right now to check for new data you need to make two searches and compare the results. A nice added feature is to provide a more efficient way to be notified.

@TomAugspurger
Copy link

and to select an area of interest.

That was the main thing I was interested in getting feedback on. Glad to hear that you consider it important :)

Kafka is a preferred delivery option due to its efficiency in handling streaming data.

Also good to know.


Right now, the way we're thinking about this is that users would register a STAC search with a hypothetical notifications endpoint. The registration would include both the search criteria and some endpoint to deliver the notification to when a new item matches the search criteria.

Whenever a new STAC item is ingested, we'd check if it matches your search criteria. If it does, we'd deliver the STAC item (plus a bit of extra context, like the ID of the search that it matched) as an event to your system.

This is all just in discussion stage at the moment, but feel free to add additional suggestions here. I'll update this thread if / when we get this implemented.

@not-Karot
Copy link
Author

Thank you for sharing the approach that's currently under discussion.

Registering the STAC search with notifications sounds like the ultimate solution for me, but I have some concerns/questions about registering an endpoint:

  • Utilizing an endpoint could create coupling between the systems. If the communication is synchronous, there could be potential delays or bottlenecks, especially with high volumes of data. Asynchronous communication might alleviate this issue, but it would still require careful consideration for scalability and fault tolerance. What happens if I want to add more workers listening to the topic?
  • Additionally, in a scenario with a large number of notifications, the burden of scaling requests and handling distribution would fall on the user. With a Kafka-based system, these aspects would be more transparent to the user.

Of course, users could potentially add an additional layer by implementing a publisher on the endpoint and managing the distribution themselves. This might allow for handling more use cases on the side of Microsoft Planetary. However, I initially envisioned a slightly different approach. But, I'm open to further discussions and willing to explore various possibilities. My main aim is to ensure an efficient and scalable solution that would benefit the community.

@ghidalgo3
Copy link

Closed due to inactivity, feel free to reopen if you would like to continue this discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants