Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] allow Kafka Producers and configs to be stored in Egeria #5804

Closed
2 tasks done
davidradl opened this issue Oct 6, 2021 · 13 comments
Closed
2 tasks done
Assignees
Labels
enhancement New feature or request no-issue-activity Issues automatically marked as stale because they have not had recent activity. triage New bug/issue which needs checking & assigning

Comments

@davidradl
Copy link
Member

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

Kafka Producers cannot be stored in Egeria

Expected Behavior

We currently have https://egeria-project.org/open-metadata-publication/website/open-metadata-types/0223-events-and-logs

The SubscriberList would be a list of consumers.

I propose :

  • we deprecate SubscriberList and add TopicSubscriber - the type name should be singular
  • deprecate SubscriberTopic relationship and introduce a new one with end TopicSubscriber. Maybe call is SubscriberTopicLink
  • add a new TopicPublisher and TopicPublisherLink
  • Consider changing the cardinality so that there is only one subscriber for a Topic. Unless subscribing to multiple topics is likely to be needed.
  • consider what identifies a consumer or publisher. I see group.id as a property. Is this the identifier that is important?
  • consider adding properties to the TopicSubscriber and TopicPublisher to hold the Kafka properties associated with the consumer and producer, or do these all go in as additional properties.
  • consider whether we need to be aware of consumer groups and partitions.

Alternatives

no

Any Further Information?

no

Would you be prepared to be assigned this issue to work on?

  • I can work on this
@davidradl davidradl added enhancement New feature or request triage New bug/issue which needs checking & assigning labels Oct 6, 2021
@davidradl davidradl self-assigned this Oct 6, 2021
@davidradl
Copy link
Member Author

@mandy-chessell @cmgrote @lpalashevski any feedback.

@cmgrote
Copy link
Member

cmgrote commented Oct 7, 2021

As discussed on our call, some initial thoughts:

  • Retain the existing structure and cardinality (many-to-many), but probably a good idea to rename SubscriberList entity to Subscriber
  • We may want to introduce a more abstract type between DataSet and Topic -- something like EventStream, of which Topic would be a subtype as well as perhaps others (like Queue)
  • Then probably we'd have the relationship between Subscriber and EventStream rather than directly at the Topic level

@mandy-chessell
Copy link
Contributor

mandy-chessell commented Oct 7, 2021

There are a number of misunderstandings described above and so this post is going to need to contradict the statements above.

Firstly, Kafka producers can be represented in the open metadata types. They are some form of SoftwareServerCapability within a Software server. For example, it maybe an Application or a SoftwareService. Here are some examples:

image

image

If the kafka consumer is a specific type of software server capability that is not currently represented then we should add a new subtype.

The Topic is an Asset. The relationship to link the Topic to the kafka producer's SoftwareServerCapability is ServerAssetUse:

image

For a Kafka producer the useType is set to MAINTAINS
For a kafka consumer the useType is set to USES

The SubscriberList is a DataSet that lists the subscribers as data/properties file. This is for a system that pushes events to its list of subscribers and it shows where this information is located. This is not the way that Kafka works but I would like to keep it along with TopicSubscribers for systems that operate in this way.

@davidradl
Copy link
Member Author

@mandy-chessell thank you for the clarifications - that makes sense.

@juergenhemelt
Copy link

I think that a SubscriberList as a group of subscribers is not assigned to a Topic. Only a Subscriber subscribes 0..n Topics and a Topic is subscribed by 0..m Subscribers.
Furthermore I don't think a Subscriber is a kind of DataSet. Typically a subscriber would be an active component which processes events it gets from the subscribed topics. A DataSet from my point of view is a passive component.

@davidradl
Copy link
Member Author

@xgadjhe thank you for you comments on this issue. It looks like I was not correct in talking about subscriberList, as this is not used by Egeria to represent Kafka subscription. I did not realise that when we talked earlier. Do you think that populating the SoftwareServerCapability as above would be sufficient for you or do you think it needs augmenting in some way?

@mandy-chessell
Copy link
Contributor

Sorry for the confusion @xgadjhe @davidradl @cmgrote. I did not think deeply enough about the context of David's question when I led him to the 0223 model. Here is some more information.

Topics, Processes, DeployedAPIs and DataSets are all types of Assets. We can link them together to form the lineage flow. However this does not describe the agent that is actively working with these assets. The SoftwareServerCapability is the definition of this agent when it is software. (Other agents could be people of course.) The SoftwareServerCapability is linked to the asset using the ServerAssetUse relationship to describe its role in working with the asset. This is expressed in the useType property.

The SoftwareServerCapability has an important role in linking the assets to the infrastructure that is supporting them through the SoftwareServerSupportedCapability relationship (0042).

@davidradl
Copy link
Member Author

@mandy-chessell thanks for the explanation. I am wondering whether we need a subclass of SoftwareServerCapability to be able to be more explicit / specific on its purpose - maybe a subscriptionService and publshedService, or to we think the fact its related to the the topic with the useType is enough.

@mandy-chessell
Copy link
Contributor

A subclass does not seem to work because software server capabilities are not just kafka producers or consumers. For example, many of the Egeria registered services are both kafka producers and consumers - similar to OMRS. We model them as subtypes of SoftwareService. They also support APIs and access data sources. What type should they be? Would we need 2 subtypes for every current subtype of SoftwareServerCapability in case they access an event service? - and do we do similar changes in case they have an API? Then what about the combinations?

It could be possible to add new relationship types between asset and software server capability to show producers and consumers but I would not recommend that unless there are specific properties that need to be stored. I would recommend that even if these relationships were added, the ServerAssetUse relationship was still established to prevent the need for special case code in the governance/lineage modules just for event systems.

@juergenhemelt
Copy link

juergenhemelt commented Oct 8, 2021

@mandy-chessell I agree that SoftwareServerCapability could be the correct place to represent consumers, producers and owner of topics but I think the model should be more precise when it comes to the relationships. A topic can have many producers and many consumers but only one owner. So we need to have 3 relationships - 2 with a many-to-many cardinality and one (the owning relationship) with a one-to-many cardinality between SoftwareServerCapabilty and Topic. I think the relationships should not be too generic or you end up with a model consisting of things having relationships to things.

@mandy-chessell
Copy link
Contributor

@xgadjhe The ServerAssetUse relationship is many-to-many. The useType parameter in that relationship distinguishes between the producer, consumer and owner. So if a topic has 2 producers, 1 consumer and 1 owner there are 4 relationships of type ServerAssetUse with the useType set as follows:

image

@davidradl
Copy link
Member Author

@mandy-chessell I see that the asset manager omas has ServerAssetUseType. I cannot see it being used. I cannot see an OMAS API that would allow the consumers and producers to be added and queried in Egeria. I assume this would be asset manager omas, rather than the data manager which is handling topics and schemas.

The AssetManagerElement in the Asset manager OMAS contains SoftwareServerCapability properties. I looks like we might need to extend this OMAS so that AssetManagerElement contains this relationship content, either by
1- add CRUD calls for the relationship
or
2- extend AssetManagerElement to contain an array of cutdown topics objects. something like
{
qualifiedName:
guid:
assetUse:
}
Maybe add more granular CRUD calls for the contents of the array,

The 2nd option looks more consumable - as the caller could then query the asset manager element and see the software server capability properties and the consumer and producer information.

Would it be reasonable for the data manager to be able to see the consumers and producers associated with the topic? I am not sure whether the persona would ever need this information.

Can you confirm this makes sense please @mandy-chessell .

@davidradl davidradl changed the title [Enhancement] allow Kafka Producer to be stored in Egeria [Enhancement] allow Kafka Producers and configs to be stored in Egeria Nov 21, 2021
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the no-issue-activity Issues automatically marked as stale because they have not had recent activity. label Jan 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request no-issue-activity Issues automatically marked as stale because they have not had recent activity. triage New bug/issue which needs checking & assigning
Projects
None yet
Development

No branches or pull requests

4 participants