-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unknown message type: '\u{0}' for postgres #886
Comments
During benchmarking I also see this for the debug version of the error:
This occurs for queries returning ~8000 rows consisting of five integer columns. Looking at libpq, if the message id is invalid (such as zero) and the message length is valid, then libpq discards the message, returns an error and sets the connection to the ready state: Whatever it is, although it's intermittent, at much higher load, it's consistently repeatable. |
This might be related to futures being dropped somewhere (see #563 for a MySQL related issue it caused). We're definitely not good future citizens right now (working on fixing that in the next version). |
Ok, thanks @mehcode ! |
This is not due to PgStream being dropped. I added some logging statements and confirm this under load. I'll keep looking at #563, and try to understand more.... |
The above PR reduces the amount of failed queries, but there are still issues even with the PR. Here's some sample log output for a failed query. This repeats with similar errors until the query completes with 0 rows of output, while the correct output should be thousands of rows. Based on this, I'm gonna take a wild guess that there was a previous decoding error that is causing loss of synchronization. I'm going to try to study the other parts of postgres messages decoding. |
I'm able to reproduce this. When a message with a null first byte occurs, the previous message before the error is a DataRow message. the null first byte of the header is occurring because, somehow, the actual first byte of the DataRow message is being discarded. As a result, the first byte of the payload size is being interpreted as the message format byte. In other words, instead of reading bytes 1-5 as the header, bytes 2-6 are being read. I'm not sure if the issue is in decoding the previous data row or somewhere else. I'm looking at DataRow decoding now. |
The messages preceding the null message type (zero first byte of message) appear to all have the correct payload size. So at the moment it doesn't look like the issue is an incorrect size of the the previous payload. |
I was mistaken about the number of bytes being discarded. A whole message header, five byes is discarded. As a result, the succeeding payload is treated as a message header. In this instance, it's the five-byte message header of a DataRow message that is discarded, without discarding the payload. |
@rich-murphey I really appreciate you looking into this. I need to apologize though as it seems I wasn't clear when I was talking about the problem being related to dropping. To go into more detail, the problem is that all async functions must be re-entrant. This is a property that Rust does not enforce but will subtly break (like this) if its not followed. By re-entrant I mean that unless an async function finishes no mutable state must be altered. Keep in mind that the outer future may be dropped at any async fn foo(&mut self) {
self.do_thing().await?; // may never return
self.update += 1;
self.do_thing().await?; // may never return
self.update += 1;
} is very buggy as if the function exits early due to the future dropping, the state on |
The next big release of SQLx (likely 0.6) is currently having its core rewritten from scratch and keeping this in mind while doing so. This is honestly a nasty problem and a big footgun for new and experienced async Rust developers. I'm not sure what a good solution here (for long-term Rust) as it doesn't look very trivial to lint against. Sorry for the MySQL but that's what I've chosen to do this to first (as it seemed to have the worst bugs being reported in relation to it): Here is // receive the next packet from the database server
// may block (async) on more data from the server
pub(crate) async fn recv_packet(&mut self) -> Result<Packet<Bytes>, Error> {
// https://dev.mysql.com/doc/dev/mysql-server/8.0.12/page_protocol_basic_packets.html
// https://mariadb.com/kb/en/library/0-packet/#standard-packet
let mut header: Bytes = self.stream.read(4).await?;
let packet_size = header.get_uint_le(3) as usize;
let sequence_id = header.get_u8();
self.sequence_id = sequence_id.wrapping_add(1);
let payload: Bytes = self.stream.read(packet_size).await?;
// BUG BUG BUG !!! if the above read never returns, the stream is now in a weird state as the packet header has been thrown out
// TODO: packet compression
// TODO: packet joining
if payload[0] == 0xff {
self.busy = Busy::NotBusy;
// instead of letting this packet be looked at everywhere, we check here
// and emit a proper Error
return Err(
MySqlDatabaseError(ErrPacket::decode_with(payload, self.capabilities)?).into(),
);
}
Ok(Packet(payload))
} and here is it in async fn read_packet_async<'de, T>(&'de mut self) -> Result<T>
where
T: Deserialize<'de, Capabilities>,
{
// https://dev.mysql.com/doc/internals/en/mysql-packet.html
self.stream.read_async(4).await?;
let payload_len: usize = self.stream.get(0, 3).get_uint_le(3) as usize;
// FIXME: handle split packets
assert_ne!(payload_len, 0xFF_FF_FF);
self.sequence_id = self.stream.get(3, 1).get_u8();
self.stream.read_async(4 + payload_len).await?;
// wait until AFTER our last await to start telling the stream that we actually are done with these bytes
self.stream.consume(4);
let payload = self.stream.take(payload_len);
if payload[0] == 0xff {
// if the first byte of the payload is 0xFF and the payload is an ERR packet
let err = ErrPacket::deserialize_with(payload, self.capabilities)?;
return Err(Error::Connect(Box::new(MySqlDatabaseError(err))));
}
T::deserialize_with(payload, self.capabilities)
} |
Hey, @mehcode! I'm struggling to understand here, but it sounds like recv_packet() needs to be able to be called concurrently on a given connection without altering the the state of the PgConnection untill it is returning? I'll study this more, but meanwhile, here's some more info. I believe an issue is occurring here, in PgConnection::wait_until_ready():
The driver is waiting at the above line for a ReadyForQuery message when a DataRow message arrives. The code at this location discards the header (only), and leaves the payload of the DataRow message in the read buffer. When the driver reads the next message header, it sees a null first byte that is the beginning of the payload of a DataRow message. wait_until_ready() is called at the beginning of PgConnection::run(), and in this instance we're in the middle of a stream of RowData, so the executing code should be in the try_stream loop instead. No error is being reported regarding processing the stream of rows. Maybe something is causing PgConnection::run() to silently return prematurely, leaving RowData messages available. I'm going to look further at PgConnection::run(), but regardless, it may be worth considering altering wait_until_ready() to either consume the payload, or instead, avoid consuming the header, one of the two, just from a protocol perspective. The risk is that as it stands, it will consume five bytes at a time until a payload size just happens to be just the right length to land it at the beginning of a later message. That may discard other valid messages (not just DataRow). |
Ok, I see how read_packet_async() avoids consuming the read buffer until it's ready to finish. I think there may also be an issue in the callers. to recv(). I'm certain PgStream and BufStream are not being dropped when this issue occurs. Could an outer future be dropped without dropping the PgStream or BufStream? |
I think I see your disconnect. PgStream and BufStream are not being dropped, that's correct. The future created from calling a method like |
Ah, so if the future created by execute is dropped after the header is consumed, and before the payload is consumed, the payload will remain in the read buffer without a header. So the issue isn't in wait_until_ready() or run(), it's in recv() being only partially executed. Whew, thanks so much for helping me understand this better. I definitely need to study futures much more. |
I tried to reproduce this in a test, #972. But the test succeeds, so I'm beginning to wonder whether the failure is exposed by something specific to actix. |
Here's a test that reproduces the unknown message type error, under certain conditions. https://github.com/rich-murphey/sqlx-actix-test One of the conditions that reproduces the error is PgPoolOptions::test_before_acquire(false). The default is true. I'm hoping to further explore PgStream and BufStream using this test. |
#1099 would probably fix this. |
An additional option to address this is #1103. This PR seems a bit awkward compared to the way it's done in 'next' branch though. |
Hey everyone, I think I've found a smaller reproduction case using hyper/tokio and Here's a quick video showing the errorUsing autocannon for the benchmark Also worth noting that my postgres db logs show this: WARNING: there is no transaction in progress
demo.movI'm happy to post an example repo if it helps anyone debug this further. However, it sounds like 0.6 may likely fix this anyway. I guess once 0.6 is in a testable state, I'll give it a try and see if this error disappears. |
Hey everyone. @Diggsey mentioned above that #1099 would probably fix this, and indeed it does fix the problem for me. Using the latest master commit, I no longer get the
I also tested before and after that PR merged to be sure. Thanks for fixing @Diggsey! I understand a 0.6 release is under development. Is there any chance we can get a 0.5.2 version released before then? I recently converted an api service from sync to async rust (huge amount of work 😅) and really excited to get it deployed. Even if there's no 0.5.2 release, I'm going to deploy as is because the error should be very unlikely for the amount of traffic I get. |
@jakedeichert you should be aware that #1099 is not a perfect fix: although it works well in practice, there's still a small chance of the error occuring. Only 0.6 will truly fix the issue. (Of course that may be fine for your use case) |
Closing as we've had some improvements to the Postgres driver since this issue was last active. If anyone can reproduce the problem consistently with SQLx 0.6.0 please open a new issue. |
we recently got this error in production, and is not the first time I saw it, we are using 0.7.3 of sqlx so I wonder if it's related to the problems you folks discussed there.
|
@alvico I am having the same issue. Did you figure out what was going on? |
I believe I finally figured out what was causing this: #3467 Reopening. |
During benchmarking, I've got a lot of errors as next:
ERROR sqlx_core::pool::connection] error occurred while flushing the connection: encountered unexpected or invalid data: unknown message type: '\u{0}'
Idk is it related to this issue, but in postgres I also have next warning:
[550] WARNING: there is no transaction in progress
(in app I have a transaction, and seems for some reason it was not started)The text was updated successfully, but these errors were encountered: