Remove extraneous whitespace and output writes #2

jarreds · 2015-08-17T20:48:36Z

Previously, wal2json would write to the output plugin multiple times
prior to the transaction committing. When outputting this way the
logical decoding replication consumer would have to maintain state to
build valid JSON messages eg accumulate multiple messages in
memory. This is difficult and error prone. This change allows wal2json
to emit full JSON messages with each output plugin write.

The plugin would also output tab/whitespace. Although pretty printing of
JSON is nice, there are many tools out there to take the compact format
and pretty print it. From the perspective of wal2json output, it's just
wasted space.

Previously, wal2json would write to the output plugin multiple times prior to the transaction committing. When outputting this way the logical decoding replication consumer would have to maintain state to build valid JSON messages eg accumulate multiple messages in memory. This is difficult and error prone. This change allows wal2json to emit full JSON messages with each output plugin write. The plugin would also output tab/whitespace. Although pretty printing of JSON is nice, there are many tools out there to take the compact format and pretty print it. From the perspective of wal2json output, it's just wasted space.

eulerto · 2015-08-19T00:02:12Z

@jarreds if we choose to build JSON message piece by piece, the plugin have to maintain the string tossed by wal2json and we write at least 3 times. The drawback is that we write a lot (small pieces) in long transactions. The other approach is to build JSON message in a once. All of the message pieces are maintained onto the plugin memory until the transaction end and we write once. The drawbacks are that it can bloat server memory if we have a long transaction (say, UPDATE foo SET a = a+1) and it sends a lot of data at the same time. Which is worse? @ethansf Can latter approach break your pg-migrator?

The pretty print part, I agree that it can be an option.

PS> separate commits for different patches please. It is easier to review/apply.

jarreds · 2015-08-19T01:51:26Z

That's a really good point regarding bulk updates/deletes. It will certainly create a massive copybuf message. I need to think about that point a little more.

One the one hand, I really like the entire transaction in a single JSON message. On the the other, there are certainly problems like the one you outlined regarding large updates/deletes/inserts in a transaction. The other JSON decoding plugins I've used create a standalone JSON message for each db operation. That's another option to look at here.

In the meantime, this is how we've using you're plugin and it's been great. Thank you so much for writing it.

eulerto · 2015-08-19T02:10:45Z

@jarreds I wouldn't like to add an option for write-at-once operation because it changes the way receiver works. All of the logical decoding plugins that I inspected until now, use the piece-by-piece-write approach (same as wal2json). Although I haven't written a consumer yet, I think a simple state machine can solve the last-json-message problem. Also, we don't bloat the server memory or have network spikes (at the end of the transaction).

jarreds · 2015-08-19T02:18:06Z

In my use case the piece-by-piece writes are fine, but each write must result in a fully formed JSON message.

Right now we have 1:1 writes from the output plugin to Apache Kafka. This is where the problem lies. Each Kafka message read should result in a valid JSON message from wal2json. For example, if it takes wal2json five writes to create valid JSON then we have to buffer five Kafka messages in memory to create the JSON. This does not work for our use case.

Feel free to close this PR. I just wanted to share that we've been using this. I will probably be modifying it again to write piece-by-piece again, but with full and valid JSON output for each write. I'll open a PR with that at some point here to see if you're interested.

eulerto · 2015-08-19T03:58:13Z

@jarreds I thought more about your idea. I think it would be a good idea to support both approaches (at least for SQL interface POV). That's because you can choose the convenient approach for each fetch depending on your transactions (if you are using SQL interface i.e. pg_logical_slot_*_changes functions). Unfortunately, pg_recvlogical does not have such flexibility (you can only provide parameters at startup).
Idea is a parameter (say, write_once whose default is false for backward compatibility) that postpone writes. Is it work for you?

ethanfrey · 2015-08-22T22:25:37Z

@eulerto The plugin I wrote takes the output of pg_recvlogical as a stream and buffers it until it has a complete json item to parse. So, for me as long as the stream contains chunks of json when concatenated, either apporach works fine. I like your idea of allowing both as an option.

@jarreds The idea to buffer the whole json in one sql row, while a bit of overhead, is quite nice. One query I wish I could make would be to view the transactions in reverse order (from most recent to oldest). I would be happy to ORDER BY lxn DESC, but then it is no longer valid json. I don't know Apache Kafka, but couldn't you buffer it with a middleware that just returns each complete json chunk? I use a nice python library (https://pypi.python.org/pypi/ijson) that can do this in about 3 lines.

eulerto · 2015-08-26T19:44:50Z

@jarreds @ethanfrey I committed two patches that did what we discussed here.

(i) a pretty-print option to remove whitespaces;
(ii) a write-in-chunks option to allow write at the end of transaction.

Please test if it work for you.

jarreds force-pushed the extraneous-output branch from d1d2b75 to 7038030 Compare August 17, 2015 20:52

jarreds force-pushed the extraneous-output branch from 7038030 to 96f6278 Compare August 18, 2015 21:41

jarreds closed this Aug 19, 2015

jarreds deleted the extraneous-output branch August 19, 2015 16:36

jpfuentes2 mentioned this pull request May 22, 2018

Add an optional new output format #65

Closed

c-nichols mentioned this pull request Aug 9, 2018

"no known snapshots" error when db rows are large #74

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove extraneous whitespace and output writes #2

Remove extraneous whitespace and output writes #2

jarreds commented Aug 17, 2015

eulerto commented Aug 19, 2015

jarreds commented Aug 19, 2015

eulerto commented Aug 19, 2015

jarreds commented Aug 19, 2015

eulerto commented Aug 19, 2015

ethanfrey commented Aug 22, 2015

eulerto commented Aug 26, 2015

Remove extraneous whitespace and output writes #2

Remove extraneous whitespace and output writes #2

Conversation

jarreds commented Aug 17, 2015

eulerto commented Aug 19, 2015

jarreds commented Aug 19, 2015

eulerto commented Aug 19, 2015

jarreds commented Aug 19, 2015

eulerto commented Aug 19, 2015

ethanfrey commented Aug 22, 2015

eulerto commented Aug 26, 2015