Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove extraneous whitespace and output writes #2

Closed
wants to merge 1 commit into from

Conversation

jarreds
Copy link

@jarreds jarreds commented Aug 17, 2015

Previously, wal2json would write to the output plugin multiple times
prior to the transaction committing. When outputting this way the
logical decoding replication consumer would have to maintain state to
build valid JSON messages eg accumulate multiple messages in
memory. This is difficult and error prone. This change allows wal2json
to emit full JSON messages with each output plugin write.

The plugin would also output tab/whitespace. Although pretty printing of
JSON is nice, there are many tools out there to take the compact format
and pretty print it. From the perspective of wal2json output, it's just
wasted space.

@jarreds jarreds force-pushed the extraneous-output branch from d1d2b75 to 7038030 Compare August 17, 2015 20:52
Previously, wal2json would write to the output plugin multiple times
prior to the transaction committing. When outputting this way the
logical decoding replication consumer would have to maintain state to
build valid JSON messages eg accumulate multiple messages in
memory. This is difficult and error prone. This change allows wal2json
to emit full JSON messages with each output plugin write.

The plugin would also output tab/whitespace. Although pretty printing of
JSON is nice, there are many tools out there to take the compact format
and pretty print it. From the perspective of wal2json output, it's just
wasted space.
@jarreds jarreds force-pushed the extraneous-output branch from 7038030 to 96f6278 Compare August 18, 2015 21:41
@eulerto
Copy link
Owner

eulerto commented Aug 19, 2015

@jarreds if we choose to build JSON message piece by piece, the plugin have to maintain the string tossed by wal2json and we write at least 3 times. The drawback is that we write a lot (small pieces) in long transactions. The other approach is to build JSON message in a once. All of the message pieces are maintained onto the plugin memory until the transaction end and we write once. The drawbacks are that it can bloat server memory if we have a long transaction (say, UPDATE foo SET a = a+1) and it sends a lot of data at the same time. Which is worse? @ethansf Can latter approach break your pg-migrator?

The pretty print part, I agree that it can be an option.

PS> separate commits for different patches please. It is easier to review/apply.

@jarreds
Copy link
Author

jarreds commented Aug 19, 2015

That's a really good point regarding bulk updates/deletes. It will certainly create a massive copybuf message. I need to think about that point a little more.

One the one hand, I really like the entire transaction in a single JSON message. On the the other, there are certainly problems like the one you outlined regarding large updates/deletes/inserts in a transaction. The other JSON decoding plugins I've used create a standalone JSON message for each db operation. That's another option to look at here.

In the meantime, this is how we've using you're plugin and it's been great. Thank you so much for writing it.

@eulerto
Copy link
Owner

eulerto commented Aug 19, 2015

@jarreds I wouldn't like to add an option for write-at-once operation because it changes the way receiver works. All of the logical decoding plugins that I inspected until now, use the piece-by-piece-write approach (same as wal2json). Although I haven't written a consumer yet, I think a simple state machine can solve the last-json-message problem. Also, we don't bloat the server memory or have network spikes (at the end of the transaction).

@jarreds
Copy link
Author

jarreds commented Aug 19, 2015

In my use case the piece-by-piece writes are fine, but each write must result in a fully formed JSON message.

Right now we have 1:1 writes from the output plugin to Apache Kafka. This is where the problem lies. Each Kafka message read should result in a valid JSON message from wal2json. For example, if it takes wal2json five writes to create valid JSON then we have to buffer five Kafka messages in memory to create the JSON. This does not work for our use case.

Feel free to close this PR. I just wanted to share that we've been using this. I will probably be modifying it again to write piece-by-piece again, but with full and valid JSON output for each write. I'll open a PR with that at some point here to see if you're interested.

@jarreds jarreds closed this Aug 19, 2015
@eulerto
Copy link
Owner

eulerto commented Aug 19, 2015

@jarreds I thought more about your idea. I think it would be a good idea to support both approaches (at least for SQL interface POV). That's because you can choose the convenient approach for each fetch depending on your transactions (if you are using SQL interface i.e. pg_logical_slot_*_changes functions). Unfortunately, pg_recvlogical does not have such flexibility (you can only provide parameters at startup).
Idea is a parameter (say, write_once whose default is false for backward compatibility) that postpone writes. Is it work for you?

@jarreds jarreds deleted the extraneous-output branch August 19, 2015 16:36
@ethanfrey
Copy link

@eulerto The plugin I wrote takes the output of pg_recvlogical as a stream and buffers it until it has a complete json item to parse. So, for me as long as the stream contains chunks of json when concatenated, either apporach works fine. I like your idea of allowing both as an option.

@jarreds The idea to buffer the whole json in one sql row, while a bit of overhead, is quite nice. One query I wish I could make would be to view the transactions in reverse order (from most recent to oldest). I would be happy to ORDER BY lxn DESC, but then it is no longer valid json. I don't know Apache Kafka, but couldn't you buffer it with a middleware that just returns each complete json chunk? I use a nice python library (https://pypi.python.org/pypi/ijson) that can do this in about 3 lines.

@eulerto
Copy link
Owner

eulerto commented Aug 26, 2015

@jarreds @ethanfrey I committed two patches that did what we discussed here.

(i) a pretty-print option to remove whitespaces;
(ii) a write-in-chunks option to allow write at the end of transaction.

Please test if it work for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants