unknown message type: '\u{0}' for postgres #886

dmitryb-dev · 2020-12-09T16:15:18Z

During benchmarking, I've got a lot of errors as next:

ERROR sqlx_core::pool::connection] error occurred while flushing the connection: encountered unexpected or invalid data: unknown message type: '\u{0}'

Idk is it related to this issue, but in postgres I also have next warning: [550] WARNING: there is no transaction in progress (in app I have a transaction, and seems for some reason it was not started)

The text was updated successfully, but these errors were encountered:

rich-murphey · 2020-12-23T08:05:56Z

During benchmarking I also see this for the debug version of the error:

Protocol(\n    \"unknown message type: \\'\\\\u{0}\\'\",\n)

This occurs for queries returning ~8000 rows consisting of five integer columns.
The benchmark issues 1024 queries, 8 at a time.

Looking at libpq, if the message id is invalid (such as zero) and the message length is valid, then libpq discards the message, returns an error and sets the connection to the ready state:

https://github.com/postgres/postgres/blob/6db27037b99358018353b8748719cfa9de484988/src/interfaces/libpq/fe-protocol3.c#L407

Whatever it is, although it's intermittent, at much higher load, it's consistently repeatable.

mehcode · 2020-12-23T08:08:40Z

This might be related to futures being dropped somewhere (see #563 for a MySQL related issue it caused). We're definitely not good future citizens right now (working on fixing that in the next version).

rich-murphey · 2020-12-23T08:38:34Z

Ok, thanks @mehcode !

rich-murphey · 2021-01-02T07:34:26Z

This is not due to PgStream being dropped. I added some logging statements and confirm this under load.

I'll keep looking at #563, and try to understand more....

rich-murphey · 2021-01-03T06:26:06Z

The above PR reduces the amount of failed queries, but there are still issues even with the PR.

Here's some sample log output for a failed query.
the error is: Protocol("unknown message type: \'\\u{0}\'")
the rest of the message header is: b"\0\0\0\0"
the remainder of the read buffer is: b"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x05\0\0\0"

This repeats with similar errors until the query completes with 0 rows of output, while the correct output should be thousands of rows.

Based on this, I'm gonna take a wild guess that there was a previous decoding error that is causing loss of synchronization.

I'm going to try to study the other parts of postgres messages decoding.

rich-murphey · 2021-01-03T15:45:57Z

I'm able to reproduce this. When a message with a null first byte occurs, the previous message before the error is a DataRow message.

the null first byte of the header is occurring because, somehow, the actual first byte of the DataRow message is being discarded. As a result, the first byte of the payload size is being interpreted as the message format byte. In other words, instead of reading bytes 1-5 as the header, bytes 2-6 are being read.

I'm not sure if the issue is in decoding the previous data row or somewhere else. I'm looking at DataRow decoding now.

rich-murphey · 2021-01-03T22:32:37Z

The messages preceding the null message type (zero first byte of message) appear to all have the correct payload size. So at the moment it doesn't look like the issue is an incorrect size of the the previous payload.

rich-murphey · 2021-01-03T22:46:50Z

I was mistaken about the number of bytes being discarded. A whole message header, five byes is discarded. As a result, the succeeding payload is treated as a message header.

In this instance, it's the five-byte message header of a DataRow message that is discarded, without discarding the payload.

mehcode · 2021-01-03T22:51:12Z

@rich-murphey I really appreciate you looking into this. I need to apologize though as it seems I wasn't clear when I was talking about the problem being related to dropping.

To go into more detail,

the problem is that all async functions must be re-entrant. This is a property that Rust does not enforce but will subtly break (like this) if its not followed. By re-entrant I mean that unless an async function finishes no mutable state must be altered.

Keep in mind that the outer future may be dropped at any .await point. This has the effect that a function such as :

async fn foo(&mut self) {
  self.do_thing().await?; // may never return
  self.update += 1;

  self.do_thing().await?; // may never return
  self.update += 1;
}

is very buggy as if the function exits early due to the future dropping, the state on self is now in an undefined state.

mehcode · 2021-01-03T22:59:53Z

The next big release of SQLx (likely 0.6) is currently having its core rewritten from scratch and keeping this in mind while doing so.

This is honestly a nasty problem and a big footgun for new and experienced async Rust developers. I'm not sure what a good solution here (for long-term Rust) as it doesn't look very trivial to lint against.

Sorry for the MySQL but that's what I've chosen to do this to first (as it seemed to have the worst bugs being reported in relation to it):

Here is recv_packet (in MySQL) from master:

    // receive the next packet from the database server
    // may block (async) on more data from the server
    pub(crate) async fn recv_packet(&mut self) -> Result<Packet<Bytes>, Error> {
        // https://dev.mysql.com/doc/dev/mysql-server/8.0.12/page_protocol_basic_packets.html
        // https://mariadb.com/kb/en/library/0-packet/#standard-packet

        let mut header: Bytes = self.stream.read(4).await?;

        let packet_size = header.get_uint_le(3) as usize;
        let sequence_id = header.get_u8();

        self.sequence_id = sequence_id.wrapping_add(1);

        let payload: Bytes = self.stream.read(packet_size).await?;
        // BUG BUG BUG !!! if the above read never returns, the stream is now in a weird state as the packet header has been thrown out

        // TODO: packet compression
        // TODO: packet joining

        if payload[0] == 0xff {
            self.busy = Busy::NotBusy;

            // instead of letting this packet be looked at everywhere, we check here
            // and emit a proper Error
            return Err(
                MySqlDatabaseError(ErrPacket::decode_with(payload, self.capabilities)?).into(),
            );
        }

        Ok(Packet(payload))
    }

and here is it in next (likely 0.6):

    async fn read_packet_async<'de, T>(&'de mut self) -> Result<T>
    where
        T: Deserialize<'de, Capabilities>,
    {
        // https://dev.mysql.com/doc/internals/en/mysql-packet.html
        self.stream.read_async(4).await?;

        let payload_len: usize = self.stream.get(0, 3).get_uint_le(3) as usize;

        // FIXME: handle split packets
        assert_ne!(payload_len, 0xFF_FF_FF);

        self.sequence_id = self.stream.get(3, 1).get_u8();

        self.stream.read_async(4 + payload_len).await?;

        // wait until AFTER our last await to start telling the stream that we actually are done with these bytes

        self.stream.consume(4);
        let payload = self.stream.take(payload_len);

        if payload[0] == 0xff {
            // if the first byte of the payload is 0xFF and the payload is an ERR packet
            let err = ErrPacket::deserialize_with(payload, self.capabilities)?;
            return Err(Error::Connect(Box::new(MySqlDatabaseError(err))));
        }

        T::deserialize_with(payload, self.capabilities)
    }

rich-murphey · 2021-01-04T01:51:01Z

Hey, @mehcode!
thanks very much for the guidance and pointer to the MySql code.

I'm struggling to understand here, but it sounds like recv_packet() needs to be able to be called concurrently on a given connection without altering the the state of the PgConnection untill it is returning? I'll study this more, but meanwhile, here's some more info.

I believe an issue is occurring here, in PgConnection::wait_until_ready():

sqlx/sqlx-core/src/postgres/connection/mod.rs

Line 77 in 93ba3d5

if let MessageFormat::ReadyForQuery = message.format {

The driver is waiting at the above line for a ReadyForQuery message when a DataRow message arrives. The code at this location discards the header (only), and leaves the payload of the DataRow message in the read buffer. When the driver reads the next message header, it sees a null first byte that is the beginning of the payload of a DataRow message.

wait_until_ready() is called at the beginning of PgConnection::run(), and in this instance we're in the middle of a stream of RowData, so the executing code should be in the try_stream loop instead. No error is being reported regarding processing the stream of rows. Maybe something is causing PgConnection::run() to silently return prematurely, leaving RowData messages available.

I'm going to look further at PgConnection::run(), but regardless, it may be worth considering altering wait_until_ready() to either consume the payload, or instead, avoid consuming the header, one of the two, just from a protocol perspective. The risk is that as it stands, it will consume five bytes at a time until a payload size just happens to be just the right length to land it at the beginning of a later message. That may discard other valid messages (not just DataRow).

rich-murphey · 2021-01-04T02:15:19Z

Ok, I see how read_packet_async() avoids consuming the read buffer until it's ready to finish.

I think there may also be an issue in the callers. to recv().

I'm certain PgStream and BufStream are not being dropped when this issue occurs. Could an outer future be dropped without dropping the PgStream or BufStream?

mehcode · 2021-01-04T02:27:16Z

I think I see your disconnect. PgStream and BufStream are not being dropped, that's correct. The future created from calling a method like execute is being dropped.

rich-murphey · 2021-01-04T03:26:20Z

Ah, so if the future created by execute is dropped after the header is consumed, and before the payload is consumed, the payload will remain in the read buffer without a header.

So the issue isn't in wait_until_ready() or run(), it's in recv() being only partially executed.

Whew, thanks so much for helping me understand this better. I definitely need to study futures much more.

rich-murphey · 2021-01-10T00:14:39Z

I tried to reproduce this in a test, #972.

But the test succeeds, so I'm beginning to wonder whether the failure is exposed by something specific to actix.

rich-murphey · 2021-01-11T12:43:53Z

Here's a test that reproduces the unknown message type error, under certain conditions.

https://github.com/rich-murphey/sqlx-actix-test

One of the conditions that reproduces the error is PgPoolOptions::test_before_acquire(false). The default is true.

I'm hoping to further explore PgStream and BufStream using this test.

Diggsey · 2021-03-13T22:37:27Z

#1099 would probably fix this.

rich-murphey · 2021-03-14T02:35:45Z

An additional option to address this is #1103. This PR seems a bit awkward compared to the way it's done in 'next' branch though.

jacobdeichert · 2021-03-23T02:36:41Z

Hey everyone, I think I've found a smaller reproduction case using hyper/tokio and SELECT 1 within a transaction running under a 10s benchmark with 10 connections.

Here's my http endpoint

Here's my select_one implementation (just for this demo)

Here's a quick video showing the error

Using autocannon for the benchmark
Also worth noting that my postgres db logs show this: WARNING: there is no transaction in progress

demo.mov

I'm happy to post an example repo if it helps anyone debug this further. However, it sounds like 0.6 may likely fix this anyway.

I guess once 0.6 is in a testable state, I'll give it a try and see if this error disappears.

jacobdeichert · 2021-04-13T00:33:35Z

Hey everyone. @Diggsey mentioned above that #1099 would probably fix this, and indeed it does fix the problem for me.

Using the latest master commit, I no longer get the unknown message type: '\u{0}' error when running my benchmark tests.

sqlx = { git = "https://github.com/launchbadge/sqlx", features = [ "chrono", "postgres", "runtime-tokio-native-tls" ] }

I also tested before and after that PR merged to be sure. Thanks for fixing @Diggsey!

I understand a 0.6 release is under development. Is there any chance we can get a 0.5.2 version released before then? I recently converted an api service from sync to async rust (huge amount of work 😅) and really excited to get it deployed. Even if there's no 0.5.2 release, I'm going to deploy as is because the error should be very unlikely for the amount of traffic I get.

Diggsey · 2021-04-13T01:12:21Z

@jakedeichert you should be aware that #1099 is not a perfect fix: although it works well in practice, there's still a small chance of the error occuring. Only 0.6 will truly fix the issue. (Of course that may be fine for your use case)

abonander · 2022-07-01T20:36:29Z

Closing as we've had some improvements to the Postgres driver since this issue was last active. If anyone can reproduce the problem consistently with SQLx 0.6.0 please open a new issue.

alvico · 2024-04-04T15:09:00Z

we recently got this error in production, and is not the first time I saw it, we are using 0.7.3 of sqlx so I wonder if it's related to the problems you folks discussed there.

error occurred while testing the connection on-release
encountered unexpected or invalid data: expected 0x00 or 0xfe (OK_Packet) but found 0x03

GabrielBianconi · 2024-06-28T17:53:54Z

@alvico I am having the same issue. Did you figure out what was going on?

abonander · 2024-08-27T07:23:08Z

I believe I finally figured out what was causing this: #3467

Reopening.

rich-murphey mentioned this issue Jan 2, 2021

match libpq treatment of null byte in message header #953

Closed

rich-murphey mentioned this issue Jan 7, 2021

Resume read_package after drop. #915

Closed

rich-murphey mentioned this issue Jan 10, 2021

test interleaved reading of result streams #972

Closed

rich-murphey mentioned this issue Mar 14, 2021

postgres: atomic read message #1103

Closed

abonander closed this as completed Jul 1, 2022

OliverNChalk mentioned this issue Dec 30, 2023

bug(PgListener): periodic protocol error due to unknown message type #2956

Closed

abonander mentioned this issue Aug 27, 2024

fix cancellation issues with PgListener, PgStream::recv() #3467

Merged

abonander reopened this Aug 27, 2024

abonander closed this as completed in #3467 Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unknown message type: '\u{0}' for postgres #886

unknown message type: '\u{0}' for postgres #886

dmitryb-dev commented Dec 9, 2020

rich-murphey commented Dec 23, 2020

mehcode commented Dec 23, 2020

rich-murphey commented Dec 23, 2020

rich-murphey commented Jan 2, 2021

rich-murphey commented Jan 3, 2021

rich-murphey commented Jan 3, 2021

rich-murphey commented Jan 3, 2021

rich-murphey commented Jan 3, 2021

mehcode commented Jan 3, 2021

mehcode commented Jan 3, 2021

rich-murphey commented Jan 4, 2021

rich-murphey commented Jan 4, 2021

mehcode commented Jan 4, 2021

rich-murphey commented Jan 4, 2021

rich-murphey commented Jan 10, 2021

rich-murphey commented Jan 11, 2021

Diggsey commented Mar 13, 2021

rich-murphey commented Mar 14, 2021

jacobdeichert commented Mar 23, 2021 •

edited

Loading

jacobdeichert commented Apr 13, 2021 •

edited

Loading

Diggsey commented Apr 13, 2021 •

edited

Loading

abonander commented Jul 1, 2022

alvico commented Apr 4, 2024

GabrielBianconi commented Jun 28, 2024

abonander commented Aug 27, 2024

unknown message type: '\u{0}' for postgres #886

unknown message type: '\u{0}' for postgres #886

Comments

dmitryb-dev commented Dec 9, 2020

rich-murphey commented Dec 23, 2020

mehcode commented Dec 23, 2020

rich-murphey commented Dec 23, 2020

rich-murphey commented Jan 2, 2021

rich-murphey commented Jan 3, 2021

rich-murphey commented Jan 3, 2021

rich-murphey commented Jan 3, 2021

rich-murphey commented Jan 3, 2021

mehcode commented Jan 3, 2021

mehcode commented Jan 3, 2021

rich-murphey commented Jan 4, 2021

rich-murphey commented Jan 4, 2021

mehcode commented Jan 4, 2021

rich-murphey commented Jan 4, 2021

rich-murphey commented Jan 10, 2021

rich-murphey commented Jan 11, 2021

Diggsey commented Mar 13, 2021

rich-murphey commented Mar 14, 2021

jacobdeichert commented Mar 23, 2021 • edited Loading

jacobdeichert commented Apr 13, 2021 • edited Loading

Diggsey commented Apr 13, 2021 • edited Loading

abonander commented Jul 1, 2022

alvico commented Apr 4, 2024

GabrielBianconi commented Jun 28, 2024

abonander commented Aug 27, 2024

jacobdeichert commented Mar 23, 2021 •

edited

Loading

jacobdeichert commented Apr 13, 2021 •

edited

Loading

Diggsey commented Apr 13, 2021 •

edited

Loading