Cannot stream rows larger than ~1MB #36

bchazalet · 2015-12-16T14:01:38Z

I am getting this error from bottledwater:

./kafka/bottledwater: While reading snapshot: PGRES_FATAL_ERROR: ERROR:  bottledwater_export: Avro conversion failed: Cannot write 4145327 bytes in memory buffer

There might very well be large blobs in the table (in a bytea column) it's processing before it fails: is there an implicit limit to the size a row due to some avro limitations?

The text was updated successfully, but these errors were encountered:

bchazalet · 2015-12-16T16:03:10Z

I've found a

#define MAX_BUFFER_LENGTH 1048576

in io_util.c. My problem might be related to that.

ept · 2015-12-30T21:13:38Z

Yeah, seems likely that you're hitting that limit. I put it there to avoid accidentally allocating unreasonably large amounts of memory in the case of a bug. Can you try increasing the limit?

I guess we might be able to bump it up — allocating a few megabytes probably isn't going to hurt anyone these days — but we should have some sort of limit. It's not ideal to load a large blob entirely into memory, but I'm not sure the APIs allow streaming large blobs incrementally.

msakrejda · 2016-04-15T02:06:12Z

This is also potentially an issue with Kafka itself, right? If the message exceeds Kafka's messages.max.bytes, there's not much bottledwater can do here...

/cc @samstokes

samstokes · 2016-08-17T23:02:05Z

Documenting current behaviour:

if the length in bytes of an encoded message (depending on the --output-format) exceeds Kafka's messages.max.bytes, the broker will refuse the message. If the Bottled Water client is running with --on-error=log, it will log the error but continue running, dropping the offending row and acknowledging the corresponding WAL as flushed. If the client is running with --on-error=exit, it will stop running without consuming the corresponding WAL.
if the length in bytes of an individual value (e.g. a TEXT or BYTEA) exceeds MAX_BUFFER_LENGTH in io_util.c, the extension will refuse to send the offending row, and will terminate the replication stream, regardless of the client's --on-error setting.

I'm pretty tempted to just truncate large values in the extension, since they're almost certainly going to hit problems at every step of whatever data pipeline they're flowing into (broker message.max.bytes, consumer fetch.message.max.bytes, whatever system they're flowing into next), and somebody along the line will have to decide what to do with arbitrarily large messages.

samstokes · 2016-08-18T01:07:08Z

I'm pretty tempted to just truncate large values in the extension

Unfortunately I wasn't thinking clearly when I wrote this. Large values could be strings or byte arrays, but they could also be records or other not-cleanly-truncatable things (especially since a large string is probably occurring within a record). They're written as Avro, and truncating them would produce binary garbage, so the Avro API doesn't even allow it.

Our choices are drop the value or abort.

mcapitanio · 2016-08-18T04:14:21Z

I agree @samstokes, we have not many choices.

I could add the one to manager a sort of "Dead lettera topic" in which the Extension could serialize the messages which don't meet the max lenght requirement in Avro format (for example a folder on the local filesystem for each table/topic or a single folder with namespace for the serialize messages).

Doing so the choice to abort would not be so distructive, I could recover the dropped messages in the DLT and, according to it's key, decide how to manage It.

Some drawbacks to be evaluated are maybe impact on extension performance and the risk of local storage saturation.

msakrejda · 2016-08-18T20:11:46Z

Another possible (long-term--definitely don't think this is worth doing short-term) solution is to break up the value into separate smaller messages that fit under the limit. This is pretty ugly and may not be worth doing, but I think it's an option, no?

samstokes · 2016-08-18T23:34:08Z

@uhoh-itsmaciek we could if we changed the framing protocol to have a "packet switched" semantics. Currently each frame contains an entire row, encoded as an Avro record. (Frames themselves are also Avro records, so the row record is stored as a byte array in a field of the frame record.) We could instead split up the record Avro into multiple frames, with the initial frame having a "length" field saying how many frames to expect, and have the client recombine the parts. (We might also need sequence numbers, and a checksum... :)) Is that the sort of thing you had in mind?

Unfortunately, that doesn't get around the reason the code has a maximum size limit in the first place, which is to avoid allocating large memory buffers in the extension. If you have a table with a single BLOB column which is used to store 200MB Docker images, you're going to need a 200MB buffer to write the Avro value representing each row. I'm not sure if there's a C API for generating Avro in a streaming basis.

msakrejda · 2016-08-19T00:16:15Z

Yeah, that's exactly what I had in mind (N.B.: I know almost nothing about Avro or bottledwater =D ). The memory usage is certainly a concern, too, but that'd be a technical concern we could theoretically overcome (rather than a fundamental limitation of the design).

I think this is largely, moot, though, and there's more value in working with the more common smaller messages.

samstokes · 2016-08-26T23:17:15Z

I've submitted PR #115 which offers a mitigation: if running the client with the --on-error=log policy, it will also configure the initial snapshot and replication output plugin to run with a similar error policy, so that if it cannot encode a row (e.g. due to hitting the size limit as discussed here), it will skip the offending row and continue on, instead of terminating the snapshot or replication stream.

N.B. once again this is a tradeoff between availability and consistency. Skipping the value means that the Kafka stream will be inconsistent with Postgres.

@mcapitanio your "dead letter" idea makes sense, although writing the entire value to the dead letter area seems a bit redundant given it's already stored in Postgres. Maybe it would make sense to write the primary key to the dead letter area, so that you could look up the full row in Postgres to diagnose. It's a tricky area though since often the primary key will be personally identifiable information (e.g. if you're using a username or email address as a primary key, or even a sequential id if your site exposes its database ids to end users), so we couldn't implement this simply by writing the primary key to the log.

(Also if the primary key by itself is larger than the limit this wouldn't help. On the other hand if you have primary keys larger than 1MB you probably have bigger problems :))

maparent · 2017-03-06T18:11:33Z

Not sure if this was considered: In some cases, it would make sense to exclude certain columns in the configuration, or to allow to replace them with something else. Many blobs are immutable in practice, and it makes sense to store references to another form of storage rather than the blob itself in the event stream.

bchazalet changed the title ~~is there a limit on a row/fiels's value size?~~ is there a limit on a row/fields' value size? Dec 16, 2015

samstokes added the gotcha label Apr 17, 2016

samstokes mentioned this issue Aug 13, 2016

Production-readiness? #96

Closed

This was referenced Aug 23, 2016

Replication stops with "missing chunk number 0 for toast value" #113

Open

Pass client error policy down to snapshot and output plugin #115

Merged

samstokes changed the title ~~is there a limit on a row/fields' value size?~~ Cannot stream rows larger than ~1MB Aug 26, 2016

samstokes mentioned this issue Aug 31, 2016

Support --on-error=log #85

Merged

smferguson mentioned this issue Oct 11, 2016

Postgres as data source? zendesk/maxwell#434

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot stream rows larger than ~1MB #36

Cannot stream rows larger than ~1MB #36

bchazalet commented Dec 16, 2015

bchazalet commented Dec 16, 2015

ept commented Dec 30, 2015

msakrejda commented Apr 15, 2016

samstokes commented Aug 17, 2016

samstokes commented Aug 18, 2016

mcapitanio commented Aug 18, 2016

msakrejda commented Aug 18, 2016

samstokes commented Aug 18, 2016

msakrejda commented Aug 19, 2016

samstokes commented Aug 26, 2016

maparent commented Mar 6, 2017

Cannot stream rows larger than ~1MB #36

Cannot stream rows larger than ~1MB #36

Comments

bchazalet commented Dec 16, 2015

bchazalet commented Dec 16, 2015

ept commented Dec 30, 2015

msakrejda commented Apr 15, 2016

samstokes commented Aug 17, 2016

samstokes commented Aug 18, 2016

mcapitanio commented Aug 18, 2016

msakrejda commented Aug 18, 2016

samstokes commented Aug 18, 2016

msakrejda commented Aug 19, 2016

samstokes commented Aug 26, 2016

maparent commented Mar 6, 2017