Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unknown message type: '\u{0}' for postgres #886

Closed
dmitryb-dev opened this issue Dec 9, 2020 · 25 comments · Fixed by #3467
Closed

unknown message type: '\u{0}' for postgres #886

dmitryb-dev opened this issue Dec 9, 2020 · 25 comments · Fixed by #3467

Comments

@dmitryb-dev
Copy link

During benchmarking, I've got a lot of errors as next:

ERROR sqlx_core::pool::connection] error occurred while flushing the connection: encountered unexpected or invalid data: unknown message type: '\u{0}'

Idk is it related to this issue, but in postgres I also have next warning: [550] WARNING: there is no transaction in progress (in app I have a transaction, and seems for some reason it was not started)

@rich-murphey
Copy link
Contributor

During benchmarking I also see this for the debug version of the error:

Protocol(\n    \"unknown message type: \\'\\\\u{0}\\'\",\n)

This occurs for queries returning ~8000 rows consisting of five integer columns.
The benchmark issues 1024 queries, 8 at a time.

Looking at libpq, if the message id is invalid (such as zero) and the message length is valid, then libpq discards the message, returns an error and sets the connection to the ready state:

https://github.com/postgres/postgres/blob/6db27037b99358018353b8748719cfa9de484988/src/interfaces/libpq/fe-protocol3.c#L407

Whatever it is, although it's intermittent, at much higher load, it's consistently repeatable.

@mehcode
Copy link
Member

mehcode commented Dec 23, 2020

This might be related to futures being dropped somewhere (see #563 for a MySQL related issue it caused). We're definitely not good future citizens right now (working on fixing that in the next version).

@rich-murphey
Copy link
Contributor

Ok, thanks @mehcode !

@rich-murphey
Copy link
Contributor

This is not due to PgStream being dropped. I added some logging statements and confirm this under load.

I'll keep looking at #563, and try to understand more....

@rich-murphey
Copy link
Contributor

The above PR reduces the amount of failed queries, but there are still issues even with the PR.

Here's some sample log output for a failed query.
the error is: Protocol("unknown message type: \'\\u{0}\'")
the rest of the message header is: b"\0\0\0\0"
the remainder of the read buffer is: b"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x05\0\0\0"

This repeats with similar errors until the query completes with 0 rows of output, while the correct output should be thousands of rows.

Based on this, I'm gonna take a wild guess that there was a previous decoding error that is causing loss of synchronization.

I'm going to try to study the other parts of postgres messages decoding.

@rich-murphey
Copy link
Contributor

I'm able to reproduce this. When a message with a null first byte occurs, the previous message before the error is a DataRow message.

the null first byte of the header is occurring because, somehow, the actual first byte of the DataRow message is being discarded. As a result, the first byte of the payload size is being interpreted as the message format byte. In other words, instead of reading bytes 1-5 as the header, bytes 2-6 are being read.

I'm not sure if the issue is in decoding the previous data row or somewhere else. I'm looking at DataRow decoding now.

@rich-murphey
Copy link
Contributor

The messages preceding the null message type (zero first byte of message) appear to all have the correct payload size. So at the moment it doesn't look like the issue is an incorrect size of the the previous payload.

@rich-murphey
Copy link
Contributor

I was mistaken about the number of bytes being discarded. A whole message header, five byes is discarded. As a result, the succeeding payload is treated as a message header.

In this instance, it's the five-byte message header of a DataRow message that is discarded, without discarding the payload.

@mehcode
Copy link
Member

mehcode commented Jan 3, 2021

@rich-murphey I really appreciate you looking into this. I need to apologize though as it seems I wasn't clear when I was talking about the problem being related to dropping.

To go into more detail,

the problem is that all async functions must be re-entrant. This is a property that Rust does not enforce but will subtly break (like this) if its not followed. By re-entrant I mean that unless an async function finishes no mutable state must be altered.

Keep in mind that the outer future may be dropped at any .await point. This has the effect that a function such as :

async fn foo(&mut self) {
  self.do_thing().await?; // may never return
  self.update += 1;

  self.do_thing().await?; // may never return
  self.update += 1;
}

is very buggy as if the function exits early due to the future dropping, the state on self is now in an undefined state.

@mehcode
Copy link
Member

mehcode commented Jan 3, 2021

The next big release of SQLx (likely 0.6) is currently having its core rewritten from scratch and keeping this in mind while doing so.

This is honestly a nasty problem and a big footgun for new and experienced async Rust developers. I'm not sure what a good solution here (for long-term Rust) as it doesn't look very trivial to lint against.

Sorry for the MySQL but that's what I've chosen to do this to first (as it seemed to have the worst bugs being reported in relation to it):

Here is recv_packet (in MySQL) from master:

    // receive the next packet from the database server
    // may block (async) on more data from the server
    pub(crate) async fn recv_packet(&mut self) -> Result<Packet<Bytes>, Error> {
        // https://dev.mysql.com/doc/dev/mysql-server/8.0.12/page_protocol_basic_packets.html
        // https://mariadb.com/kb/en/library/0-packet/#standard-packet

        let mut header: Bytes = self.stream.read(4).await?;

        let packet_size = header.get_uint_le(3) as usize;
        let sequence_id = header.get_u8();

        self.sequence_id = sequence_id.wrapping_add(1);

        let payload: Bytes = self.stream.read(packet_size).await?;
        // BUG BUG BUG !!! if the above read never returns, the stream is now in a weird state as the packet header has been thrown out

        // TODO: packet compression
        // TODO: packet joining

        if payload[0] == 0xff {
            self.busy = Busy::NotBusy;

            // instead of letting this packet be looked at everywhere, we check here
            // and emit a proper Error
            return Err(
                MySqlDatabaseError(ErrPacket::decode_with(payload, self.capabilities)?).into(),
            );
        }

        Ok(Packet(payload))
    }

and here is it in next (likely 0.6):

    async fn read_packet_async<'de, T>(&'de mut self) -> Result<T>
    where
        T: Deserialize<'de, Capabilities>,
    {
        // https://dev.mysql.com/doc/internals/en/mysql-packet.html
        self.stream.read_async(4).await?;

        let payload_len: usize = self.stream.get(0, 3).get_uint_le(3) as usize;

        // FIXME: handle split packets
        assert_ne!(payload_len, 0xFF_FF_FF);

        self.sequence_id = self.stream.get(3, 1).get_u8();

        self.stream.read_async(4 + payload_len).await?;

        // wait until AFTER our last await to start telling the stream that we actually are done with these bytes

        self.stream.consume(4);
        let payload = self.stream.take(payload_len);

        if payload[0] == 0xff {
            // if the first byte of the payload is 0xFF and the payload is an ERR packet
            let err = ErrPacket::deserialize_with(payload, self.capabilities)?;
            return Err(Error::Connect(Box::new(MySqlDatabaseError(err))));
        }

        T::deserialize_with(payload, self.capabilities)
    }

@rich-murphey
Copy link
Contributor

Hey, @mehcode!
thanks very much for the guidance and pointer to the MySql code.

I'm struggling to understand here, but it sounds like recv_packet() needs to be able to be called concurrently on a given connection without altering the the state of the PgConnection untill it is returning? I'll study this more, but meanwhile, here's some more info.

I believe an issue is occurring here, in PgConnection::wait_until_ready():

if let MessageFormat::ReadyForQuery = message.format {

The driver is waiting at the above line for a ReadyForQuery message when a DataRow message arrives. The code at this location discards the header (only), and leaves the payload of the DataRow message in the read buffer. When the driver reads the next message header, it sees a null first byte that is the beginning of the payload of a DataRow message.

wait_until_ready() is called at the beginning of PgConnection::run(), and in this instance we're in the middle of a stream of RowData, so the executing code should be in the try_stream loop instead. No error is being reported regarding processing the stream of rows. Maybe something is causing PgConnection::run() to silently return prematurely, leaving RowData messages available.

I'm going to look further at PgConnection::run(), but regardless, it may be worth considering altering wait_until_ready() to either consume the payload, or instead, avoid consuming the header, one of the two, just from a protocol perspective. The risk is that as it stands, it will consume five bytes at a time until a payload size just happens to be just the right length to land it at the beginning of a later message. That may discard other valid messages (not just DataRow).

@rich-murphey
Copy link
Contributor

Ok, I see how read_packet_async() avoids consuming the read buffer until it's ready to finish.

I think there may also be an issue in the callers. to recv().

I'm certain PgStream and BufStream are not being dropped when this issue occurs. Could an outer future be dropped without dropping the PgStream or BufStream?

@mehcode
Copy link
Member

mehcode commented Jan 4, 2021

I think I see your disconnect. PgStream and BufStream are not being dropped, that's correct. The future created from calling a method like execute is being dropped.

@rich-murphey
Copy link
Contributor

Ah, so if the future created by execute is dropped after the header is consumed, and before the payload is consumed, the payload will remain in the read buffer without a header.

So the issue isn't in wait_until_ready() or run(), it's in recv() being only partially executed.

Whew, thanks so much for helping me understand this better. I definitely need to study futures much more.

@rich-murphey
Copy link
Contributor

I tried to reproduce this in a test, #972.

But the test succeeds, so I'm beginning to wonder whether the failure is exposed by something specific to actix.

@rich-murphey
Copy link
Contributor

Here's a test that reproduces the unknown message type error, under certain conditions.

https://github.com/rich-murphey/sqlx-actix-test

One of the conditions that reproduces the error is PgPoolOptions::test_before_acquire(false). The default is true.

I'm hoping to further explore PgStream and BufStream using this test.

@Diggsey
Copy link
Contributor

Diggsey commented Mar 13, 2021

#1099 would probably fix this.

@rich-murphey
Copy link
Contributor

An additional option to address this is #1103. This PR seems a bit awkward compared to the way it's done in 'next' branch though.

@jacobdeichert
Copy link

jacobdeichert commented Mar 23, 2021

Hey everyone, I think I've found a smaller reproduction case using hyper/tokio and SELECT 1 within a transaction running under a 10s benchmark with 10 connections.

Here's my http endpoint controller
Here's my select_one implementation (just for this demo) query
Here's a quick video showing the error
Using autocannon for the benchmark
Also worth noting that my postgres db logs show this: WARNING: there is no transaction in progress

demo.mov

I'm happy to post an example repo if it helps anyone debug this further. However, it sounds like 0.6 may likely fix this anyway.

I guess once 0.6 is in a testable state, I'll give it a try and see if this error disappears.

@jacobdeichert
Copy link

jacobdeichert commented Apr 13, 2021

Hey everyone. @Diggsey mentioned above that #1099 would probably fix this, and indeed it does fix the problem for me.

Using the latest master commit, I no longer get the unknown message type: '\u{0}' error when running my benchmark tests.

sqlx = { git = "https://github.com/launchbadge/sqlx", features = [ "chrono", "postgres", "runtime-tokio-native-tls" ] }  

I also tested before and after that PR merged to be sure. Thanks for fixing @Diggsey!

I understand a 0.6 release is under development. Is there any chance we can get a 0.5.2 version released before then? I recently converted an api service from sync to async rust (huge amount of work 😅) and really excited to get it deployed. Even if there's no 0.5.2 release, I'm going to deploy as is because the error should be very unlikely for the amount of traffic I get.

@Diggsey
Copy link
Contributor

Diggsey commented Apr 13, 2021

@jakedeichert you should be aware that #1099 is not a perfect fix: although it works well in practice, there's still a small chance of the error occuring. Only 0.6 will truly fix the issue. (Of course that may be fine for your use case)

@abonander
Copy link
Collaborator

Closing as we've had some improvements to the Postgres driver since this issue was last active. If anyone can reproduce the problem consistently with SQLx 0.6.0 please open a new issue.

@alvico
Copy link

alvico commented Apr 4, 2024

we recently got this error in production, and is not the first time I saw it, we are using 0.7.3 of sqlx so I wonder if it's related to the problems you folks discussed there.

error occurred while testing the connection on-release
encountered unexpected or invalid data: expected 0x00 or 0xfe (OK_Packet) but found 0x03 

@GabrielBianconi
Copy link

@alvico I am having the same issue. Did you figure out what was going on?

@abonander
Copy link
Collaborator

I believe I finally figured out what was causing this: #3467

Reopening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants