Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce pulsar replicator #1582

Closed
rdhabalia opened this issue Apr 15, 2018 · 5 comments
Closed

Introduce pulsar replicator #1582

rdhabalia opened this issue Apr 15, 2018 · 5 comments
Assignees

Comments

@rdhabalia
Copy link
Contributor

Motivation

Pulsar already supports geo-replication that persists messages across multiple clusters of pulsar instances. Therefore, client can set replication clusters for a topic, and pulsar broker internally takes care of replication to all the clusters. However, sometimes application may want to replicate the same published messages to other external systems which is not part of pulsar-eco system such as AWS-Kinesis, DynamoDB. Therefore, right now, client-application has to take this extra burden to publish same messages for pulsar and other external systems.
Therefore, it will be useful to introduce server side replication that can replicate pulsar messages to external system without client intervention. Also server side replication should be extensible which can provide a plugin mechanism to add various replicators to support message-replication to different external systems.

Requirement

  • Replicate Pulsar message to external system (eg: Kinesis, Kafka, another Pulsar cluster)
  • Easy onboarding: client should be able to add replicator configuration with CLI/Admin api and it should auto start appropriate replicators.
  • Isolation from core message bus
  • Pluggable framework and extensible to support multiple external system
  • New connectors should be developed with minimal efforts
  • Operability and monitoring
    • API to control start and stop individual replicator
    • API to get replicator stats
  • Security
    • Replicator framework should provide pluggable mechanism to plugin KeyStore implementation that can store and fetch client’s credentials which will be required to connect external system.

PIP:
https://github.com/apache/incubator-pulsar/wiki/PIP-18:-Pulsar-Replicator

@rdhabalia
Copy link
Contributor Author

@srkukarni @merlimat
As per discussion, we want pulsar replicator to use pulsar-connector to replicate messages to external systems. It seems ReplicatorProducer-API in replicator framework is similar as Sink-API so, we will not require any significant change for Sink api and replicator should be using it with small changes in it.
Right now, Sink API accepts different Message type than pulsar Message which requires replicator to create additional unnecessary object. So, I created PR-#1632 to use Pulsar-message in Sink api.

Pulsar replicator implementation mainly touches 3 things:

  1. Onboarding api
  2. Replicator function that initializes connectors
  3. pulsar connector that actually publish messages to external system.

#1594 covers all above functionalities. so, once pulsar-connector PR is merged, I will rebase replicator-PR on top of it and later in separate PR, we can merge replicator and connector in one module.

Can you please let us know your thoughts.?

@OneCricketeer
Copy link

Not to start a tech fight here or anything, but do you think having such similar naming to Confluent Replicator (aka Kafka Connect) would be an issue for "Pulsar Replicator / Connect"?

Second point being at least for Kafka, the Connect API is for the interaction points between external systems (such as Dynamo or Kinesis). Confluent Replicator being a closed source version of that API between Kafka Clusters.

@merlimat
Copy link
Contributor

@Cricket007 The naming here was indeed a bit misleading since this is more around integrating heterogeneous systems with Pulsar.

Pulsar has always had "replicator" functionalities, in a much more advanced form compared to MirrorMaker or other proprietary solutions (http://pulsar.apache.org/docs/latest/admin/GeoReplication/).

Geo-replication targets at replication between Pulsar clusters. Because on both sides we have Pulsar brokers that talk native Pulsar protocol, we can achieve a lot of efficiencies.

Regarding the changes for this PR, the consensus has been to focus the efforts on a single "connector" framework, named "Pulsar-IO" which is scheduled for 2.1 release.

The work on Pulsar-IO framework address the problem of getting data in & out of Pulsar in the simplest possible way from a user standpoint:

  • No need to write code, just specify the configuration for sources and sinks
  • No need to "run" the connector, the system takes care of managing the connectors runtime

If you're interested, you can checkout the work in progress: https://github.com/apache/incubator-pulsar/tree/master/pulsar-io . There's also a PR with some in-progress documentation: #1749

@sijie
Copy link
Member

sijie commented Jun 13, 2018

@rdhabalia what is your status of this task?

@rdhabalia
Copy link
Contributor Author

@sijie I think we can close this one as we will be trying out pulsar-io framework here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants