Support for batching messages into a single kinesis record #24

c-riddell · 2015-07-03T03:55:57Z

Hi

It would be cool to be able to batch a number of log messages per kinesis put record in some way, since kinesis supports up to a 1MB payload and a single log message can easy be <1kb. So it seems wasteful on high throughput systems to use 1 kinesis record per log message.

If you create such a feature, it could be provided by a config option, and then be limited on number of log messages and/or size, with a minimum send timeout.

Formatting for the data blob would be an issue here (e.g. multiple JSON records per message). Perhaps there could also be a config option for this too, whether it goes into a bigger json (records unordered), json array, or simply the same format but line separated.

Thanks

yuta-imai · 2015-07-07T01:13:36Z

@cj74

Hello, thanks for your idea. Currently we are using PutRecords API which allows us putting up to 500 records at 1 API call without client-side batching records.

As you mentioned, actually we can client-side batching up to 1MB, which may allow us beyond 500 records. However, this will introduces complexity into consumer side. The consumers would have to de-compose batched records before processing.

So at this point of time, I don't want to do this in my idea. What do you think about this?

c-riddell · 2015-07-12T04:42:41Z

@imaifactory Thanks for the response & nice work so far

I was thinking that a format like this would optimise the shard throughput, given the kinesis shard limits - 1,000 per sec per shard putrecord limit for Kinesis (& 1MB/sec)

But, on a second thought, if a single log message is ~1kb, then your going to hit the MB/sec limit regardless of the putrecord limit.

So thinking about it, batching (reducing putrecords) would only make sense if you were also able to reduce the size of the messages (optimise MB/sec rate), for example by reducing the payload size by enabling GZIP compression (UTF-8) on the batched data.
For someone like me, these 2 optimisations together would easy cut the no of shards I need in half, since GZIP would easy reduce the size of the payload by more than half.

Of course you are right, the consumer would need to parse the record per the scheme in which it was put. I believe such a feature would have to be enabled by a configuration option on the fluent-plugin-kinesis. If the user does that, then naturally, the user would then be responsible for updating their consumer code to parse the batched records in the new format, to me this is no problem.

Thanks

yuta-imai · 2015-07-14T12:29:35Z

@cj74

Hmm, makes sense, thanks.

Regarding compression: I think it actually reduce payload size/network throughput, however it's difficult to optimise the batched record size to Shard Limitation(1MB/sec). For example, whether batched 1500 records of 1KB records fit under 1MB or not, depends on its contents. Of course, it reduce your network throughput.

Do you still want compression feature on this?
If so, do you think we should provide options batch and compression separately or not?

c-riddell · 2015-08-01T02:42:09Z

@imaifactory Yep, This feature would be very good for me.

I think it makes sense to separate those 2 configuration options.
I also think they would have sub-options. Specifically, the batch size and the queue timeout (and/or size limit) options for if batch is on: I.e. to flush the queue before it reaches the set batch size, if it is the case that it hasn't been flushed in X seconds, or the queue reaches Y size.

And then the compression type and compression encoding options.
I'd suggest UTF-8 GZIP would be enough to support for the first version. Since different encoding schemes might have different options, I think a gzip-compression option would be enough rather than a generic compression option. Up to you though, of course.

So finally, the relevant config might read something like:

batch-size 500                              # Enables batching and sets size to 500
batch-queue-timeout 300             # 300 seconds (5 minutes) till the queue flushes regardless
batch-queue-maxsize 10000000  # 10000000 bytes - 10MB. If the combined queue payload size gets here, flush

gzip-compression utf8                  # Enable gzip compression with UTF-8 encoding

yuta-imai · 2015-08-21T17:04:04Z

@cj74 Sorry for long lag.

Thank you for your suggestion about options. OK, it makes sense. I will put this to my queue.

c-riddell · 2015-10-05T02:33:09Z

Hey @imaifactory, hope you are good, any news on how this feature is going?

Thanks

yuta-imai · 2015-10-08T00:34:42Z

@cj74

Sorry for keeping you waited. Unfortunately I don't have update on this. If you are in a hurry, is it possible for you to send PR?

Also, I should inform everyone in a separate thread though, I want to tell you that the primary owner of this project has been handing over now. I will make a issue to inform when it is settled.

Thank you.

c-riddell · 2015-10-14T22:38:39Z

Hey @imaifactory
I haven't done ruby before however I can take a look.

Actually, now that Amazon have released Kinesis Firehose, support for this would actually trump my requirement for the gzip/batching features, because all I am doing is pushing to S3 (maybe other people would still find gzip/batching useful though). Kinesis Firehose should also be easier to implement in the project.
What do you think?

yuta-imai · 2015-10-15T06:43:19Z

@cj74

Based on your use case, I would opt for Kinesis Firehose because with it, you do not have to deploy consumer application and you do not have to worry about they are operating normally. So, if configuration capability for compression and chunking of Kinesis Firehose satisfies you, why not use Firehose.

c-riddell · 2015-10-15T07:15:52Z

@imaifactory now we just need firehose support in aws-fluent-plugin-kinesis, then!

yuta-imai · 2015-10-15T09:11:00Z

@cj74

Use this! The plugin below is developed by @winebarrel, who is also greatly contributing to this plugin!
https://github.com/winebarrel/fluent-plugin-kinesis-firehose

rahulashok · 2015-10-22T23:53:52Z

I added Gzip support for Fluent plugin kinesis. Here is a pull request for this: #39

riywo · 2015-10-28T14:56:47Z

@cj74

I'm going to close this.

For Amazon Kinesis Firehose, you can try @winebarrel 's plugin.
For compression, see #38 - Support zlib compression #39
For batch, I think it should be integrated with Kinesis Producer Library (KPL) which has an attribute for aggregation.
- http://docs.aws.amazon.com/kinesis/latest/dev/kinesis-kpl-concepts.html

yuta-imai added the enhancement label Aug 21, 2015

riywo closed this as completed Oct 28, 2015

wryun mentioned this issue Nov 6, 2015

KPL format aggregated logs #43

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for batching messages into a single kinesis record #24

Support for batching messages into a single kinesis record #24

c-riddell commented Jul 3, 2015

yuta-imai commented Jul 7, 2015

c-riddell commented Jul 12, 2015

yuta-imai commented Jul 14, 2015

c-riddell commented Aug 1, 2015

yuta-imai commented Aug 21, 2015

c-riddell commented Oct 5, 2015

yuta-imai commented Oct 8, 2015

c-riddell commented Oct 14, 2015

yuta-imai commented Oct 15, 2015

c-riddell commented Oct 15, 2015

yuta-imai commented Oct 15, 2015

rahulashok commented Oct 22, 2015

riywo commented Oct 28, 2015

Support for batching messages into a single kinesis record #24

Support for batching messages into a single kinesis record #24

Comments

c-riddell commented Jul 3, 2015

yuta-imai commented Jul 7, 2015

c-riddell commented Jul 12, 2015

yuta-imai commented Jul 14, 2015

c-riddell commented Aug 1, 2015

yuta-imai commented Aug 21, 2015

c-riddell commented Oct 5, 2015

yuta-imai commented Oct 8, 2015

c-riddell commented Oct 14, 2015

yuta-imai commented Oct 15, 2015

c-riddell commented Oct 15, 2015

yuta-imai commented Oct 15, 2015

rahulashok commented Oct 22, 2015

riywo commented Oct 28, 2015