-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove extraneous whitespace and output writes #2
Conversation
d1d2b75
to
7038030
Compare
Previously, wal2json would write to the output plugin multiple times prior to the transaction committing. When outputting this way the logical decoding replication consumer would have to maintain state to build valid JSON messages eg accumulate multiple messages in memory. This is difficult and error prone. This change allows wal2json to emit full JSON messages with each output plugin write. The plugin would also output tab/whitespace. Although pretty printing of JSON is nice, there are many tools out there to take the compact format and pretty print it. From the perspective of wal2json output, it's just wasted space.
7038030
to
96f6278
Compare
@jarreds if we choose to build JSON message piece by piece, the plugin have to maintain the string tossed by wal2json and we write at least 3 times. The drawback is that we write a lot (small pieces) in long transactions. The other approach is to build JSON message in a once. All of the message pieces are maintained onto the plugin memory until the transaction end and we write once. The drawbacks are that it can bloat server memory if we have a long transaction (say, UPDATE foo SET a = a+1) and it sends a lot of data at the same time. Which is worse? @ethansf Can latter approach break your pg-migrator? The pretty print part, I agree that it can be an option. PS> separate commits for different patches please. It is easier to review/apply. |
That's a really good point regarding bulk updates/deletes. It will certainly create a massive copybuf message. I need to think about that point a little more. One the one hand, I really like the entire transaction in a single JSON message. On the the other, there are certainly problems like the one you outlined regarding large updates/deletes/inserts in a transaction. The other JSON decoding plugins I've used create a standalone JSON message for each db operation. That's another option to look at here. In the meantime, this is how we've using you're plugin and it's been great. Thank you so much for writing it. |
@jarreds I wouldn't like to add an option for write-at-once operation because it changes the way receiver works. All of the logical decoding plugins that I inspected until now, use the piece-by-piece-write approach (same as wal2json). Although I haven't written a consumer yet, I think a simple state machine can solve the last-json-message problem. Also, we don't bloat the server memory or have network spikes (at the end of the transaction). |
In my use case the piece-by-piece writes are fine, but each write must result in a fully formed JSON message. Right now we have 1:1 writes from the output plugin to Apache Kafka. This is where the problem lies. Each Kafka message read should result in a valid JSON message from wal2json. For example, if it takes wal2json five writes to create valid JSON then we have to buffer five Kafka messages in memory to create the JSON. This does not work for our use case. Feel free to close this PR. I just wanted to share that we've been using this. I will probably be modifying it again to write piece-by-piece again, but with full and valid JSON output for each write. I'll open a PR with that at some point here to see if you're interested. |
@jarreds I thought more about your idea. I think it would be a good idea to support both approaches (at least for SQL interface POV). That's because you can choose the convenient approach for each fetch depending on your transactions (if you are using SQL interface i.e. pg_logical_slot_*_changes functions). Unfortunately, pg_recvlogical does not have such flexibility (you can only provide parameters at startup). |
@eulerto The plugin I wrote takes the output of pg_recvlogical as a stream and buffers it until it has a complete json item to parse. So, for me as long as the stream contains chunks of json when concatenated, either apporach works fine. I like your idea of allowing both as an option. @jarreds The idea to buffer the whole json in one sql row, while a bit of overhead, is quite nice. One query I wish I could make would be to view the transactions in reverse order (from most recent to oldest). I would be happy to ORDER BY lxn DESC, but then it is no longer valid json. I don't know Apache Kafka, but couldn't you buffer it with a middleware that just returns each complete json chunk? I use a nice python library (https://pypi.python.org/pypi/ijson) that can do this in about 3 lines. |
@jarreds @ethanfrey I committed two patches that did what we discussed here. (i) a pretty-print option to remove whitespaces; Please test if it work for you. |
Previously, wal2json would write to the output plugin multiple times
prior to the transaction committing. When outputting this way the
logical decoding replication consumer would have to maintain state to
build valid JSON messages eg accumulate multiple messages in
memory. This is difficult and error prone. This change allows wal2json
to emit full JSON messages with each output plugin write.
The plugin would also output tab/whitespace. Although pretty printing of
JSON is nice, there are many tools out there to take the compact format
and pretty print it. From the perspective of wal2json output, it's just
wasted space.