AWS Kinesis KCL streams support #1667

aserrallerios · 2019-04-26T13:56:55Z

Adds the KCL Source and record checkpointer Flow/Sink.

Updated version of the old PR: #434

aserrallerios · 2019-04-27T09:26:23Z

Check it out @julianhowarth

aserrallerios · 2019-04-27T15:29:26Z

FTP test are failing. All clear in Kinesis.

aserrallerios · 2019-05-03T06:47:28Z

I've improved the shard termination, as discussed in aserrallerios/kcl-akka-stream#13 (comment) with @julianhowarth.

Please, take a careful look at this commit in particular.

Summary:
Removed the grace period, so the ShardProcessor will never commit (and close) the Shard on shardEnded before the latest record emitted by that ShardProcessor has been committed. So, the only additional synchronization point is in the ShardProcessor. Committing a record will never block (but of course, it may fail if the lease is lost or there's any other unexpected failure).

ennru

Thank you for offering to move "back home" to Alpakka and sorry for letting you wait for some feedback.
I know too little about Kinesis so I have a few questions...

project/Dependencies.scala

kinesis/src/main/scala/akka/stream/alpakka/kinesis/KinesisSchedulerSettings.scala

kinesis/src/main/scala/akka/stream/alpakka/kinesis/ShardProcessor.scala

kinesis/src/main/scala/akka/stream/alpakka/kinesis/scaladsl/KinesisSchedulerSource.scala

kinesis/src/main/scala/akka/stream/alpakka/kinesis/CommittableRecord.scala

kinesis/src/main/scala/akka/stream/alpakka/kinesis/scaladsl/KinesisSchedulerSource.scala

ennru

Dependency trouble... But Alpakka 1.1 will not be a long time away.

ennru · 2019-05-15T07:45:36Z

project/Dependencies.scala

+      "software.amazon.awssdk" % "kinesis" % AwsSdk2Version, // ApacheV2
+      "software.amazon.awssdk" % "dynamodb" % AwsSdk2Version, // ApacheV2
+      "software.amazon.awssdk" % "cloudwatch" % AwsSdk2Version, // ApacheV2
+      "software.amazon.kinesis" % "amazon-kinesis-client" % "2.2.0", // Amazon Software License


The comment is where it all started... They changed to Apache 2 now https://github.com/awslabs/amazon-kinesis-client/releases
Even https://github.com/awslabs/amazon-kinesis-client/blob/master/LICENSE.txt

The dynamodb and cloudwatch dependencies are transitive to amazon-kinesis-client so they don't need to be listed.

I'm more worried about is pulling in protobuf-java as well. We can not have Alpakka Kinesis pulling it in in a patch version, so this PR needs to wait for Alpakka 1.1, I'm afraid.

ennru

Would be great if you can pick this up again.

kinesis/src/main/scala/akka/stream/alpakka/kinesis/impl/SetupStage.scala

aserrallerios · 2019-08-10T14:36:41Z

Moved to akka's Source.setup and removed the code that Await.result on every queue.offer.

aserrallerios · 2019-08-11T09:29:18Z

I had to do a change:

After the usage of Source.setup, the inner materialized value is wrapped in a Future[_]. As the inner MV was already a Future[Scheduler], the final MV now becomes Future[Future[Scheduler]].

In Scala 2.12 I can use Future.flatten to flatten it without ExecutionContext, but in older versions I cannot.

So, I will drop the MV (use NotUsed). If someone finds a way to conciliate a Future[_] MV with SetupStage, please, tell me and I will add the MV again.

aserrallerios · 2019-08-11T11:23:39Z

mqtt/jms tests failed.

kinesis/src/main/scala/akka/stream/alpakka/kinesis/scaladsl/KinesisSchedulerSource.scala

ennru

Didn't think of it before, but we'd want to move Kinesis to AWS SDK 2 before pushing this forward.

docs/src/main/paradox/kinesis.md

project/Dependencies.scala

kinesis/src/main/scala/akka/stream/alpakka/kinesis/Errors.scala

aserrallerios · 2019-12-17T13:10:18Z

Anything I can do to get this merged?

ennru · 2019-12-17T13:38:11Z

Thank you for the ping.
Now that the whole Kinesis connector on AWS SDK v2 it would be great to warm this up.
Please resolve the conflicts, I'll have another look.

aserrallerios · 2019-12-17T22:21:52Z

Done. Build fails elsewhere.

ennru

Looks good, I'd like to see a bit more about the differences of Kinesis Data Streams and KCL in the doc page.
Maybe borrow something from https://docs.aws.amazon.com/streams/latest/dev/developing-consumers-with-kcl.html

ennru · 2019-12-18T08:48:40Z

kinesis/src/main/scala/akka/stream/alpakka/kinesis/KinesisSchedulerSettings.scala

+
+import scala.concurrent.duration._
+
+final class KinesisSchedulerSourceSettings(val bufferSize: Int, val backpressureTimeout: FiniteDuration) {


Suggested change

final class KinesisSchedulerSourceSettings(val bufferSize: Int, val backpressureTimeout: FiniteDuration) {

final class KinesisSchedulerSourceSettings private (val bufferSize: Int, val backpressureTimeout: FiniteDuration) {

Please check changes in KinesisSchedulerSettings.

kinesis/src/main/scala/akka/stream/alpakka/kinesis/KinesisSchedulerSettings.scala

ennru · 2019-12-18T09:49:54Z

kinesis/src/test/scala/docs/scaladsl/KclSnippets.scala

+      val retrievalConfig =
+        configsBuilder.retrievalConfig
+          .retrievalFactory(
+            new SynchronousBlockingRetrievalFactory(streamName, kinesisClient, new SimpleRecordsFetcherFactory, 1000)


This constructor is deprecated, the new version takes kinesisRequestTimeout.

I removed unnecessary code from snippets.

ennru · 2019-12-18T09:52:06Z

kinesis/src/test/scala/docs/scaladsl/KclSnippets.scala

+  implicit val executor =
+    ExecutionContext.fromExecutorService(Executors.newFixedThreadPool(1000))
+
+  KinesisSchedulerSource(builder, schedulerSourceSettings).to(Sink.ignore)


This example looks a bit disturbing, would it actually do anything useful?

Added a some stream logic.

kinesis/src/test/scala/akka/stream/alpakka/kinesis/KinesisSchedulerSourceSpec.scala

kinesis/src/main/scala/akka/stream/alpakka/kinesis/impl/KinesisSchedulerSourceStage.scala

ennru · 2019-12-18T10:22:15Z

kinesis/src/main/scala/akka/stream/alpakka/kinesis/KinesisSchedulerSettings.scala

+
+object KinesisSchedulerSourceSettings {
+
+  val defaultInstance = new KinesisSchedulerSourceSettings(1000, 1.minute)


Return the default from parameterless apply and create methods.

Should I add a create method in each companion obj here?

ennru · 2019-12-18T10:25:55Z

kinesis/src/main/scala/akka/stream/alpakka/kinesis/impl/ShardProcessor.scala

+    // This implementation will try to checkpoint every Record with the original
+    // checkpointer. Other option would be to keep a reference of the latest
+    // checkpointer passed to this instance using any of these methods:
+    // * processRecords
+    // * shutdownRequested
+    // * shardEnded
+    val checkpoint = (record: KinesisClientRecord) =>
+      processRecordsInput.checkpointer().checkpoint(record.sequenceNumber(), record.subSequenceNumber())


Would that be something the user would want to decide?

Do you mean using the "original" checkpointer vs use the "latest" checkpointer?

I think it's a low-level decission that the library should take for the user. It looks to work both ways, although that can change in the future. I'll try to improve the code a bit btw.

kinesis/src/main/scala/akka/stream/alpakka/kinesis/KinesisSchedulerSettings.scala

aserrallerios · 2019-12-19T09:32:33Z

Please have a look now. The documentation improvements are still missing.

ennru

Getting there, I dug a bit deeper and came up with more suggestions.
(FYI: I'll be off for the next 3 weeks.)

ennru · 2019-12-20T07:53:41Z

kinesis/src/main/scala/akka/stream/alpakka/kinesis/impl/ShardProcessor.scala

+      extends CommittableRecord(record, batchData, shardData) {
+    private def checkpoint(): Unit =
+      checkpointer.checkpoint(record.sequenceNumber(), record.subSequenceNumber())
+    private def checkpointAndRelease(): Unit = { checkpoint(); semaphore.release() }


Suggested change

private def checkpointAndRelease(): Unit = { checkpoint(); semaphore.release() }

private def checkpointAndRelease(): Unit = {

checkpoint()

semaphore.release()

}

Would it make sense to move the if (isLatestRecord) into a single checkpoint() method?

ennru · 2019-12-20T07:54:50Z

kinesis/src/main/scala/akka/stream/alpakka/kinesis/impl/ShardProcessor.scala

+          new InternalCommittableRecord(
+            record,
+            batchData,
+            isLatestRecord = processRecordsInput.isAtShardEnd && index + 1 == numberOfRecords


This sounds more like the "last" record.

ennru · 2019-12-20T07:57:33Z

kinesis/src/main/scala/akka/stream/alpakka/kinesis/impl/ShardProcessor.scala

+    callback: CommittableRecord => Unit
+) extends ShardRecordProcessor {
+
+  private val semaphore = new Semaphore(1)


Please add a comment about what the semaphore is blocking and why. Which dispatcher will it run on?

kinesis/src/main/scala/akka/stream/alpakka/kinesis/impl/KinesisSchedulerSourceStage.scala

ennru · 2019-12-20T08:14:31Z

kinesis/src/main/scala/akka/stream/alpakka/kinesis/impl/KinesisSchedulerSourceStage.scala

+      self = getStageActor(awaitingRecords)
+      val newRecordCallback: CommittableRecord => Unit = {
+        semaphore.tryAcquire(backpressureTimeout.length, backpressureTimeout.unit)
+        self.ref ! NewRecord(_)


This is OK, but most other Alpakka connectors use async callbacks instead.

Please check the async callbacks logic now, is a bit less fair than before as I turned some of the intra-stage calls synchronous. I expect this to have higher performance.

This can still be changed to everything asynchronous to maximize fairness.

ennru · 2019-12-20T08:18:56Z

kinesis/src/main/scala/akka/stream/alpakka/kinesis/scaladsl/KinesisSchedulerSource.scala

+   */
+  def apply(
+      schedulerBuilder: ShardRecordProcessorFactory => Scheduler,
+      settings: KinesisSchedulerSourceSettings = KinesisSchedulerSourceSettings.defaultInstance


Default parameters hinder bin-compatible evolvement. Passing the defaults explicitly is better.

Are you sure we want to remove default parameters? They are all over the Kinesis API...

Yes, please remove them on these new methods.

ennru · 2019-12-20T08:22:52Z

kinesis/src/main/scala/akka/stream/alpakka/kinesis/scaladsl/KinesisSchedulerSource.scala

+            Flow[CommittableRecord]
+              .map {
+                case record if record.canBeCheckpointed =>
+                  record
+                    .tryToCheckpoint()
+                    .recover({
+                      case _: ShutdownException => Done
+                    })
+                    .get
+                case _ => Done
+              }
+              .addAttributes(Attributes(ActorAttributes.IODispatcher))


If you move this to a ShardProcessor object, the semaphore and the IO dispatcher show together and are easier to understand.

ennru · 2019-12-20T08:26:30Z

kinesis/src/main/scala/akka/stream/alpakka/kinesis/scaladsl/KinesisSchedulerSource.scala

+
+        split
+          .out(0)
+          .map(_.max)


Might be good to point to the ordering defined in CommittableRecord.

ennru · 2019-12-20T08:33:45Z

kinesis/src/main/scala/akka/stream/alpakka/kinesis/scaladsl/KinesisSchedulerSource.scala

+              }
+              .addAttributes(Attributes(ActorAttributes.IODispatcher))
+          ) ~> join.in0
+        split.out(1) ~> join.in1


If you need to drop down into the DSL, use it all the way.

split.out(0) ~> checkpoint ~> join.in0 split.out(1) ~> join.in1 join.out ~> flatten ~> result

ennru · 2019-12-20T08:34:46Z

kinesis/src/main/scala/akka/stream/alpakka/kinesis/scaladsl/KinesisSchedulerSource.scala

+    Flow[CommittableRecord]
+      .groupBy(MAX_KINESIS_SHARDS, _.processorData.shardId)
+      .groupedWithin(settings.maxBatchSize, settings.maxBatchWait)
+      .via(GraphDSL.create() { implicit b =>


Put the DSL in a val so you can give it a describing name.

aserrallerios · 2019-12-20T17:41:02Z

The checkpoint coordination is tricky. Let me explain.

The issue is that when the ShardProcessor notifies that the shard has ended, you must checkpoint before returning control to the caller. That must happen synchronously, of course.

    /**
     * The checkpointer used to record that the record processor has completed the shard.
     *
     * The record processor <b>must</b> call {@link RecordProcessorCheckpointer#checkpoint()} before returning from
     * {@link ShardRecordProcessor#shardEnded(ShardEndedInput)}. Failing to do so will trigger the Scheduler to retry
     * shutdown until a successful checkpoint occurs.
     */

As we are checkpointing in an asynchronous way, that needs to be coordinated using another method: we lock the shardEnded method until all "regular" records have been checkpointed using the stream mechanism.

I'll put a comment in semaphore, but the shardEnded method must block, otherwise we're "closing" the shard before the "regular" records are consumed.

An alternative implementation to that is to give a grace period, to the final shardEnded checkpoint, but that'd potentially still leave records unconsumed.

I'm open to suggestions here.

aserrallerios · 2019-12-21T11:03:32Z

Made all the checkpoint stuff a bit more explicit. Added comments and documentation to methods.

Only kinesis.md part is missing.

ennru

Thank you for adding more context about the required blocking.

ennru · 2020-01-13T10:01:54Z

kinesis/src/main/scala/akka/stream/alpakka/kinesis/scaladsl/KinesisSchedulerSource.scala

+   */
+  def apply(
+      schedulerBuilder: ShardRecordProcessorFactory => Scheduler,
+      settings: KinesisSchedulerSourceSettings = KinesisSchedulerSourceSettings.defaultInstance


Yes, please remove them on these new methods.

ennru · 2020-01-13T10:13:19Z

kinesis/src/main/scala/akka/stream/alpakka/kinesis/KinesisSchedulerSettings.scala

+  /**
+   * Java API
+   */
+  def create(maxBatchSize: Int, maxBatchWait: FiniteDuration): KinesisSchedulerCheckpointSettings =


Same as below.

ennru · 2020-01-13T10:15:49Z

kinesis/src/main/scala/akka/stream/alpakka/kinesis/KinesisSchedulerSettings.scala

+  def apply(bufferSize: Int, backpressureTimeout: java.time.Duration): KinesisSchedulerSourceSettings =
+    KinesisSchedulerSourceSettings(bufferSize, FiniteDuration.apply(backpressureTimeout.toMillis, MILLISECONDS))
+  def apply: KinesisSchedulerSourceSettings = KinesisSchedulerSourceSettings(1000, 1.minute)
+
+  val defaultInstance: KinesisSchedulerSourceSettings = KinesisSchedulerSourceSettings.apply
+
+  /**
+   * Java API
+   */
+  def create(bufferSize: Int, backpressureTimeout: FiniteDuration): KinesisSchedulerSourceSettings =
+    new KinesisSchedulerSourceSettings(bufferSize, backpressureTimeout)


No reason for java.time.Duration in apply, but in the Java API.
You should create the instance in val defaultInstance and return that from apply.

kinesis/src/main/scala/akka/stream/alpakka/kinesis/impl/ShardProcessor.scala

ennru · 2020-01-13T10:22:42Z

kinesis/src/main/scala/akka/stream/alpakka/kinesis/javadsl/KinesisSchedulerSource.scala

+  def create(
+      schedulerBuilder: SchedulerBuilder
+  ): Source[CommittableRecord, CompletionStage[Scheduler]] =
+    create(schedulerBuilder, KinesisSchedulerSourceSettings.defaultInstance)


Yes, this is the same as the default arguments in the Scala DSL. I think it is better to make passing the default instance look as nice as possible (maybe call it just defaults?).

ennru · 2020-01-20T08:52:22Z

@aserrallerios Any chance to push this the last bit?

aserrallerios · 2020-01-20T08:57:02Z

Yes, let me see if I can get through it today.

aserrallerios · 2020-01-20T09:41:02Z

@ennru can you please remind me the suggestions regarding the kinesis.md file. I'm not able to find them.

ennru · 2020-01-20T09:42:38Z

project/Dependencies.scala

+      ) ++ Seq(
+        "software.amazon.awssdk" % "kinesis" % AwsSdk2Version, // ApacheV2
+        "software.amazon.awssdk" % "firehose" % AwsSdk2Version, // ApacheV2
+        "software.amazon.awssdk" % "dynamodb" % AwsSdk2Version, // ApacheV2


DynamoDB is only used in tests, right?

Is used by the KCL library, it's a transitive dependency. Probably I should remove it.

Ah, OK.
Might be better to pull it transitively, so please remove it.

ennru · 2020-01-20T10:06:48Z

I believe the docs are OK.
We switched to Scalatest 3.1.0 which renames a few things. Would be great if you could rebase and fix those. (But if you are too short on time, I can do that.)

aserrallerios · 2020-01-20T10:22:37Z

Rebased and squashed (I can push the old history if needed).

ennru

LGTM.

ennru · 2020-01-20T16:54:16Z

Long-time coming! Thank you for your work on this.

seglo · 2020-03-24T20:00:27Z

Hi @aserrallerios. We're observing a transiently failing error for a Kinesis test case. You can find runtime details here in #2219. The test in question utilizes a lot of Thread.sleeps and asynchronous code in Futures that are never awaited upon. When I run locally it can fail ~50% of the time. Can you take a look at this test? I spent some time troubleshooting it this afternoon, but could not determine when additional checkpointing was called (it must be from the AWS libs).

aserrallerios · 2020-03-25T07:43:58Z

Will have a look later today. Thanks!

seglo · 2020-03-25T14:21:19Z

Thanks a lot @aserrallerios!

ennru added the p:kinesis label Apr 29, 2019

aserrallerios force-pushed the kcl-streams branch from 1b6944a to a6d8a90 Compare May 2, 2019 13:02

ennru reviewed May 9, 2019

View reviewed changes

aserrallerios commented May 14, 2019

View reviewed changes

kinesis/src/main/scala/akka/stream/alpakka/kinesis/scaladsl/KinesisSchedulerSource.scala Outdated Show resolved Hide resolved

ennru reviewed May 15, 2019

View reviewed changes

ennru added the dependency-change For PRs changing the version of a dependency. label May 15, 2019

ennru added this to the 1.1.0 milestone May 15, 2019

ennru modified the milestones: 1.1.0, 2.0.0 Jun 28, 2019

ennru reviewed Aug 9, 2019

View reviewed changes

kinesis/src/main/scala/akka/stream/alpakka/kinesis/impl/SetupStage.scala Outdated Show resolved Hide resolved

aserrallerios force-pushed the kcl-streams branch from 87579eb to 9a1ebee Compare August 10, 2019 09:56

probot-autolabeler bot added the documentation label Aug 10, 2019

aserrallerios force-pushed the kcl-streams branch from 1e0fe60 to 0f1b0be Compare August 11, 2019 08:40

ennru reviewed Aug 12, 2019

View reviewed changes

kinesis/src/main/scala/akka/stream/alpakka/kinesis/scaladsl/KinesisSchedulerSource.scala Outdated Show resolved Hide resolved

ennru reviewed Aug 12, 2019

View reviewed changes

docs/src/main/paradox/kinesis.md Outdated Show resolved Hide resolved

project/Dependencies.scala Show resolved Hide resolved

kinesis/src/main/scala/akka/stream/alpakka/kinesis/Errors.scala Outdated Show resolved Hide resolved

aserrallerios mentioned this pull request Aug 21, 2019

fix BackpressureTimeout after checkpoint throw ShutdownException aserrallerios/kcl-akka-stream#15

Merged

ennru modified the milestones: 2.0.0-M1, 2.0.0 Nov 7, 2019

ennru modified the milestones: 2.0.0-M2, 2.0.0 Dec 16, 2019

aserrallerios force-pushed the kcl-streams branch from f6cbd08 to f1d343c Compare December 17, 2019 18:26

ennru reviewed Dec 18, 2019

View reviewed changes

aserrallerios force-pushed the kcl-streams branch from a681c70 to b8115cc Compare December 19, 2019 09:29

aserrallerios force-pushed the kcl-streams branch from b8115cc to df101e6 Compare December 19, 2019 18:18

ennru reviewed Dec 20, 2019

View reviewed changes

aserrallerios force-pushed the kcl-streams branch from 9a8bfac to b24934b Compare December 21, 2019 10:37

ennru reviewed Jan 13, 2020

View reviewed changes

ennru reviewed Jan 20, 2020

View reviewed changes

Add AWS Kinesis KCL streams

ad21619

aserrallerios force-pushed the kcl-streams branch from a554894 to ad21619 Compare January 20, 2020 10:21

ennru added 2 commits January 20, 2020 11:58

Enable fatal warnings

f090bf7

ScalafmtSbt

149ca88

ennru added the p:new label Jan 20, 2020

ennru approved these changes Jan 20, 2020

View reviewed changes

ennru changed the title ~~AWS Kinesis KCL streams~~ AWS Kinesis KCL streams support Jan 20, 2020

ennru merged commit 86be456 into akka:master Jan 20, 2020

aserrallerios mentioned this pull request Feb 4, 2020

Add build against scala 2.13 aserrallerios/kcl-akka-stream#17

Open


		import scala.concurrent.duration._

		final class KinesisSchedulerSourceSettings(val bufferSize: Int, val backpressureTimeout: FiniteDuration) {


		object KinesisSchedulerSourceSettings {

		val defaultInstance = new KinesisSchedulerSourceSettings(1000, 1.minute)

AWS Kinesis KCL streams support #1667

AWS Kinesis KCL streams support #1667

Conversation

aserrallerios commented Apr 26, 2019

aserrallerios commented Apr 27, 2019

aserrallerios commented Apr 27, 2019

aserrallerios commented May 3, 2019 • edited Loading

ennru left a comment

Choose a reason for hiding this comment

ennru left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ennru left a comment

Choose a reason for hiding this comment

aserrallerios commented Aug 10, 2019 • edited Loading

aserrallerios commented Aug 11, 2019

aserrallerios commented Aug 11, 2019

ennru left a comment

Choose a reason for hiding this comment

aserrallerios commented Dec 17, 2019

ennru commented Dec 17, 2019

aserrallerios commented Dec 17, 2019

ennru left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aserrallerios commented Dec 19, 2019

ennru left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aserrallerios commented Dec 20, 2019

aserrallerios commented Dec 21, 2019

ennru left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ennru Jan 13, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ennru commented Jan 20, 2020

aserrallerios commented Jan 20, 2020

aserrallerios commented Jan 20, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ennru commented Jan 20, 2020

aserrallerios commented Jan 20, 2020

ennru left a comment

Choose a reason for hiding this comment

ennru commented Jan 20, 2020

seglo commented Mar 24, 2020 • edited Loading

aserrallerios commented Mar 25, 2020

seglo commented Mar 25, 2020

aserrallerios commented May 3, 2019 •

edited

Loading

aserrallerios commented Aug 10, 2019 •

edited

Loading

ennru Jan 13, 2020 •

edited

Loading

seglo commented Mar 24, 2020 •

edited

Loading