-
Notifications
You must be signed in to change notification settings - Fork 643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS Kinesis KCL streams support #1667
Conversation
Check it out @julianhowarth |
FTP test are failing. All clear in Kinesis. |
I've improved the shard termination, as discussed in aserrallerios/kcl-akka-stream#13 (comment) with @julianhowarth. Please, take a careful look at this commit in particular. Summary: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for offering to move "back home" to Alpakka and sorry for letting you wait for some feedback.
I know too little about Kinesis so I have a few questions...
kinesis/src/main/scala/akka/stream/alpakka/kinesis/KinesisSchedulerSettings.scala
Outdated
Show resolved
Hide resolved
kinesis/src/main/scala/akka/stream/alpakka/kinesis/ShardProcessor.scala
Outdated
Show resolved
Hide resolved
kinesis/src/main/scala/akka/stream/alpakka/kinesis/scaladsl/KinesisSchedulerSource.scala
Outdated
Show resolved
Hide resolved
kinesis/src/main/scala/akka/stream/alpakka/kinesis/scaladsl/KinesisSchedulerSource.scala
Outdated
Show resolved
Hide resolved
kinesis/src/main/scala/akka/stream/alpakka/kinesis/scaladsl/KinesisSchedulerSource.scala
Outdated
Show resolved
Hide resolved
kinesis/src/main/scala/akka/stream/alpakka/kinesis/CommittableRecord.scala
Outdated
Show resolved
Hide resolved
kinesis/src/main/scala/akka/stream/alpakka/kinesis/CommittableRecord.scala
Outdated
Show resolved
Hide resolved
kinesis/src/main/scala/akka/stream/alpakka/kinesis/scaladsl/KinesisSchedulerSource.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dependency trouble... But Alpakka 1.1 will not be a long time away.
project/Dependencies.scala
Outdated
"software.amazon.awssdk" % "kinesis" % AwsSdk2Version, // ApacheV2 | ||
"software.amazon.awssdk" % "dynamodb" % AwsSdk2Version, // ApacheV2 | ||
"software.amazon.awssdk" % "cloudwatch" % AwsSdk2Version, // ApacheV2 | ||
"software.amazon.kinesis" % "amazon-kinesis-client" % "2.2.0", // Amazon Software License |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is where it all started... They changed to Apache 2 now https://github.com/awslabs/amazon-kinesis-client/releases
Even https://github.com/awslabs/amazon-kinesis-client/blob/master/LICENSE.txt
The dynamodb
and cloudwatch
dependencies are transitive to amazon-kinesis-client
so they don't need to be listed.
I'm more worried about is pulling in protobuf-java
as well. We can not have Alpakka Kinesis pulling it in in a patch version, so this PR needs to wait for Alpakka 1.1, I'm afraid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be great if you can pick this up again.
kinesis/src/main/scala/akka/stream/alpakka/kinesis/impl/SetupStage.scala
Outdated
Show resolved
Hide resolved
87579eb
to
9a1ebee
Compare
Moved to akka's |
1e0fe60
to
0f1b0be
Compare
I had to do a change: After the usage of In Scala 2.12 I can use So, I will drop the MV (use |
mqtt/jms tests failed. |
kinesis/src/main/scala/akka/stream/alpakka/kinesis/scaladsl/KinesisSchedulerSource.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't think of it before, but we'd want to move Kinesis to AWS SDK 2 before pushing this forward.
kinesis/src/main/scala/akka/stream/alpakka/kinesis/Errors.scala
Outdated
Show resolved
Hide resolved
Anything I can do to get this merged? |
Thank you for the ping. |
f6cbd08
to
f1d343c
Compare
Done. Build fails elsewhere. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I'd like to see a bit more about the differences of Kinesis Data Streams and KCL in the doc page.
Maybe borrow something from https://docs.aws.amazon.com/streams/latest/dev/developing-consumers-with-kcl.html
|
||
import scala.concurrent.duration._ | ||
|
||
final class KinesisSchedulerSourceSettings(val bufferSize: Int, val backpressureTimeout: FiniteDuration) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final class KinesisSchedulerSourceSettings(val bufferSize: Int, val backpressureTimeout: FiniteDuration) { | |
final class KinesisSchedulerSourceSettings private (val bufferSize: Int, val backpressureTimeout: FiniteDuration) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check changes in KinesisSchedulerSettings.
kinesis/src/main/scala/akka/stream/alpakka/kinesis/KinesisSchedulerSettings.scala
Outdated
Show resolved
Hide resolved
val retrievalConfig = | ||
configsBuilder.retrievalConfig | ||
.retrievalFactory( | ||
new SynchronousBlockingRetrievalFactory(streamName, kinesisClient, new SimpleRecordsFetcherFactory, 1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This constructor is deprecated, the new version takes kinesisRequestTimeout
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed unnecessary code from snippets.
implicit val executor = | ||
ExecutionContext.fromExecutorService(Executors.newFixedThreadPool(1000)) | ||
|
||
KinesisSchedulerSource(builder, schedulerSourceSettings).to(Sink.ignore) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example looks a bit disturbing, would it actually do anything useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a some stream logic.
kinesis/src/test/scala/akka/stream/alpakka/kinesis/KinesisSchedulerSourceSpec.scala
Outdated
Show resolved
Hide resolved
kinesis/src/main/scala/akka/stream/alpakka/kinesis/impl/KinesisSchedulerSourceStage.scala
Outdated
Show resolved
Hide resolved
kinesis/src/main/scala/akka/stream/alpakka/kinesis/impl/KinesisSchedulerSourceStage.scala
Outdated
Show resolved
Hide resolved
|
||
object KinesisSchedulerSourceSettings { | ||
|
||
val defaultInstance = new KinesisSchedulerSourceSettings(1000, 1.minute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return the default from parameterless apply
and create
methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I add a create
method in each companion obj here?
// This implementation will try to checkpoint every Record with the original | ||
// checkpointer. Other option would be to keep a reference of the latest | ||
// checkpointer passed to this instance using any of these methods: | ||
// * processRecords | ||
// * shutdownRequested | ||
// * shardEnded | ||
val checkpoint = (record: KinesisClientRecord) => | ||
processRecordsInput.checkpointer().checkpoint(record.sequenceNumber(), record.subSequenceNumber()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would that be something the user would want to decide?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean using the "original" checkpointer vs use the "latest" checkpointer?
I think it's a low-level decission that the library should take for the user. It looks to work both ways, although that can change in the future. I'll try to improve the code a bit btw.
kinesis/src/main/scala/akka/stream/alpakka/kinesis/KinesisSchedulerSettings.scala
Show resolved
Hide resolved
a681c70
to
b8115cc
Compare
Please have a look now. The documentation improvements are still missing. |
b8115cc
to
df101e6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting there, I dug a bit deeper and came up with more suggestions.
(FYI: I'll be off for the next 3 weeks.)
extends CommittableRecord(record, batchData, shardData) { | ||
private def checkpoint(): Unit = | ||
checkpointer.checkpoint(record.sequenceNumber(), record.subSequenceNumber()) | ||
private def checkpointAndRelease(): Unit = { checkpoint(); semaphore.release() } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private def checkpointAndRelease(): Unit = { checkpoint(); semaphore.release() } | |
private def checkpointAndRelease(): Unit = { | |
checkpoint() | |
semaphore.release() | |
} |
Would it make sense to move the if (isLatestRecord)
into a single checkpoint()
method?
new InternalCommittableRecord( | ||
record, | ||
batchData, | ||
isLatestRecord = processRecordsInput.isAtShardEnd && index + 1 == numberOfRecords |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds more like the "last" record.
callback: CommittableRecord => Unit | ||
) extends ShardRecordProcessor { | ||
|
||
private val semaphore = new Semaphore(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment about what the semaphore is blocking and why. Which dispatcher will it run on?
kinesis/src/main/scala/akka/stream/alpakka/kinesis/impl/KinesisSchedulerSourceStage.scala
Show resolved
Hide resolved
self = getStageActor(awaitingRecords) | ||
val newRecordCallback: CommittableRecord => Unit = { | ||
semaphore.tryAcquire(backpressureTimeout.length, backpressureTimeout.unit) | ||
self.ref ! NewRecord(_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is OK, but most other Alpakka connectors use async callbacks instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check the async callbacks logic now, is a bit less fair than before as I turned some of the intra-stage calls synchronous. I expect this to have higher performance.
This can still be changed to everything asynchronous to maximize fairness.
*/ | ||
def apply( | ||
schedulerBuilder: ShardRecordProcessorFactory => Scheduler, | ||
settings: KinesisSchedulerSourceSettings = KinesisSchedulerSourceSettings.defaultInstance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default parameters hinder bin-compatible evolvement. Passing the defaults explicitly is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure we want to remove default parameters? They are all over the Kinesis API...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please remove them on these new methods.
Flow[CommittableRecord] | ||
.map { | ||
case record if record.canBeCheckpointed => | ||
record | ||
.tryToCheckpoint() | ||
.recover({ | ||
case _: ShutdownException => Done | ||
}) | ||
.get | ||
case _ => Done | ||
} | ||
.addAttributes(Attributes(ActorAttributes.IODispatcher)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you move this to a ShardProcessor
object, the semaphore and the IO dispatcher show together and are easier to understand.
|
||
split | ||
.out(0) | ||
.map(_.max) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good to point to the ordering defined in CommittableRecord
.
} | ||
.addAttributes(Attributes(ActorAttributes.IODispatcher)) | ||
) ~> join.in0 | ||
split.out(1) ~> join.in1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you need to drop down into the DSL, use it all the way.
split.out(0) ~> checkpoint ~> join.in0
split.out(1) ~> join.in1
join.out ~> flatten ~> result
Flow[CommittableRecord] | ||
.groupBy(MAX_KINESIS_SHARDS, _.processorData.shardId) | ||
.groupedWithin(settings.maxBatchSize, settings.maxBatchWait) | ||
.via(GraphDSL.create() { implicit b => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put the DSL in a val
so you can give it a describing name.
The checkpoint coordination is tricky. Let me explain. The issue is that when the
As we are checkpointing in an asynchronous way, that needs to be coordinated using another method: we lock the I'll put a comment in semaphore, but the An alternative implementation to that is to give a grace period, to the final I'm open to suggestions here. |
9a8bfac
to
b24934b
Compare
Made all the checkpoint stuff a bit more explicit. Added comments and documentation to methods. Only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding more context about the required blocking.
*/ | ||
def apply( | ||
schedulerBuilder: ShardRecordProcessorFactory => Scheduler, | ||
settings: KinesisSchedulerSourceSettings = KinesisSchedulerSourceSettings.defaultInstance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please remove them on these new methods.
/** | ||
* Java API | ||
*/ | ||
def create(maxBatchSize: Int, maxBatchWait: FiniteDuration): KinesisSchedulerCheckpointSettings = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as below.
def apply(bufferSize: Int, backpressureTimeout: java.time.Duration): KinesisSchedulerSourceSettings = | ||
KinesisSchedulerSourceSettings(bufferSize, FiniteDuration.apply(backpressureTimeout.toMillis, MILLISECONDS)) | ||
def apply: KinesisSchedulerSourceSettings = KinesisSchedulerSourceSettings(1000, 1.minute) | ||
|
||
val defaultInstance: KinesisSchedulerSourceSettings = KinesisSchedulerSourceSettings.apply | ||
|
||
/** | ||
* Java API | ||
*/ | ||
def create(bufferSize: Int, backpressureTimeout: FiniteDuration): KinesisSchedulerSourceSettings = | ||
new KinesisSchedulerSourceSettings(bufferSize, backpressureTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reason for java.time.Duration
in apply
, but in the Java API.
You should create the instance in val defaultInstance
and return that from apply
.
kinesis/src/main/scala/akka/stream/alpakka/kinesis/impl/ShardProcessor.scala
Show resolved
Hide resolved
def create( | ||
schedulerBuilder: SchedulerBuilder | ||
): Source[CommittableRecord, CompletionStage[Scheduler]] = | ||
create(schedulerBuilder, KinesisSchedulerSourceSettings.defaultInstance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is the same as the default arguments in the Scala DSL. I think it is better to make passing the default instance look as nice as possible (maybe call it just defaults
?).
@aserrallerios Any chance to push this the last bit? |
Yes, let me see if I can get through it today. |
@ennru can you please remind me the suggestions regarding the |
project/Dependencies.scala
Outdated
) ++ Seq( | ||
"software.amazon.awssdk" % "kinesis" % AwsSdk2Version, // ApacheV2 | ||
"software.amazon.awssdk" % "firehose" % AwsSdk2Version, // ApacheV2 | ||
"software.amazon.awssdk" % "dynamodb" % AwsSdk2Version, // ApacheV2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DynamoDB is only used in tests, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is used by the KCL library, it's a transitive dependency. Probably I should remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, OK.
Might be better to pull it transitively, so please remove it.
I believe the docs are OK. |
a554894
to
ad21619
Compare
Rebased and squashed (I can push the old history if needed). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Long-time coming! Thank you for your work on this. |
Hi @aserrallerios. We're observing a transiently failing error for a Kinesis test case. You can find runtime details here in #2219. The test in question utilizes a lot of |
Will have a look later today. Thanks! |
Thanks a lot @aserrallerios! |
Adds the KCL Source and record checkpointer Flow/Sink.
Updated version of the old PR: #434