Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EOF error in packets.go readPacket() not propagating #108

Closed
insasho opened this issue Jul 10, 2013 · 38 comments · Fixed by #109
Closed

EOF error in packets.go readPacket() not propagating #108

insasho opened this issue Jul 10, 2013 · 38 comments · Fixed by #109

Comments

@insasho
Copy link

insasho commented Jul 10, 2013

I'm running a query that is returning around 250k rows from a MySQL 5.1 server. Occasionally I will get incomplete result sets without an error being returned to my invocation of the API methods. The following is printed:

[MySQL] 2013/07/10 11:18:38 packets.go:67: EOF

The relevant snippet of code:

errLog.Print(err.Error())
return nil, driver.ErrBadConn

Replacing "Print" with "Panic" shows the following stack trace:

[MySQL] 2013/07/10 11:23:33 packets.go:67: EOF
panic: EOF

goroutine 1 [running]:
log.(_Logger).Panic(0xc2000aa1e0, 0x7fd696a4f460, 0x1, 0x1)
/usr/local/go/src/pkg/log/log.go:193 +0xa7
github.com/go-sql-driver/mysql.(_mysqlConn).readPacket(0xc2000c3770, 0xc200100000, 0x66, 0x1000, 0xc2000ae000, ...)
ROOT/src/github.com/go-sql-driver/mysql/packets.go:67 +0x683
github.com/go-sql-driver/mysql.(_mysqlRows).readBinaryRow(0xc2000fe540, 0xc2000f6680, 0x4, 0x4, 0x0, ...)
ROOT/src/github.com/go-sql-driver/mysql/packets.go:861 +0x4e
github.com/go-sql-driver/mysql.(_mysqlRows).Next(0xc2000fe540, 0xc2000f6680, 0x4, 0x4, 0x0, ...)
ROOT/src/github.com/go-sql-driver/mysql/rows.go:68 +0x120
database/sql.(*Rows).Next(0xc2000c24e0, 0x676020)
/usr/local/go/src/pkg/database/sql/sql.go:1310 +0xc1
main.writeBooks(0xc2000c2120, 0xc2000aa5f0)
ROOT/src/example.com/sitemap/sitemapgenerator.go:152 +0x167
main.main()
ROOT/src/example.com/sitemap/sitemapgenerator.go:280 +0x87e

goroutine 2 [syscall]:
exit status 2

It looks like readBinaryRow is not actually returning the non-nil error, and instead returning early with a nil error:

func (rows *mysqlRows) readBinaryRow(dest []driver.Value) (err error) {
data, err := rows.mc.readPacket()
if err != nil {
return
}

This bug is particularly insidious because it is happening in a call to .Next(), which means that callers will be operating on partial result sets without knowing it.

System:

  • Ubuntu 13.04 64-bit
  • go 1.1
  • MySQL 5.1.63-0ubuntu0.10.04.1-log

This happens with both an SSH port forward and when on a proper network connection (rackspace cloud).

@arnehormann
Copy link
Member

I don't see the problem with the code in readBinaryRow. For comparison, here's a simplified version of the same pattern in the playground which prints an error as expected. Of course we'll still have a look, but the returned value is unlikely to be the cause.
As a reminder to myself for further analysis tomorrow: maybe the io.Reader call in buffer is faulty and wrongly handles EOF. I didn't see any other probable causes at first glance.

@insasho
Copy link
Author

insasho commented Jul 10, 2013

Thanks for looking into this. Yes, you're right, I misread that method and it looks OK to me.

Here are the relevant snippets of code we're using to connect, query, and iterate:

func ConnectAs(credentials Credentials) (*sql.DB, error) {
    connectString := fmt.Sprintf("%s:%s@tcp4(%s:%d)/%s", credentials.Username,
        credentials.Password, *host, *port, *database)
    conn, err := sql.Open("mysql", connectString)
    return conn, err
}
conn, err := ConnectAs(config.Database)
if err != nil {
    log.Panic(err)
}
// id int unsigned not null
// title varchar(250) not null
// ts timestamp not null default 0
// url url varchar(250) not null charset ascii (table is utf8)
SQL := "select id, title, ts, url from sometable limit ?"
results, err := conn.Query(SQL, numrows)
if err != nil {
    log.Panic(err)
}
var i int
for i = 0; results.Next(); i++ {

@arnehormann
Copy link
Member

Please update your driver version to the current master with go get -u github.com/go-sql-driver/mysql and try again. This should be fixed now. I'll close the bug, but if you still get the error, I'll reopen it. If it works now, please drop a comment anyway.
Thanks!

@julienschmidt
Copy link
Member

go get doesn't work yet since it uses the go1 tag

@arnehormann
Copy link
Member

So that's what you meant with release tags in the other PR... 😃
Well, I didn't get to know go get that intimately yet.

@julienschmidt
Copy link
Member

Yes, but unfortunately it doesn't work as well as I hoped (and the documentation of the go get tool promised): https://code.google.com/p/go/issues/detail?id=5779
I hoped that we could use a go1.1 tag for the next release, but this doesn't work.

@insasho
Copy link
Author

insasho commented Jul 11, 2013

I did a simple "rm -fr src/github.com/go-sql-driver", removed pkg/, and then used "git clone" to fetch the latest and I'm still seeing the problem. If it would help, I can possibly provide a packet capture for you guys but I can't share it via github because it will contain proprietary data.

@julienschmidt julienschmidt reopened this Jul 11, 2013
@julienschmidt
Copy link
Member

Let's see if we can get some more debug info. There are several reasons which can cause an EOF error. It would be great if the error was more descriptive in general but I think for now the best approach is to make a temporary debug branch to find the exact code where the error is triggered.
Please also check whether some other component is killing the connection (e.g. server side timeout).

@arnehormann
Copy link
Member

Is it still the same stacktrace? If I'm reading it right, readPacket can only get an io.EOF from buffer.readNext. That can only happen when the io.Reader in buffer reaches an io.EOF but there's still data missing.
So the bug should either be in net.Dial or the int parameter need is negative due to an overflow. Both are not that likely.
Looking at your stacktrace, the io.EOF originates in packets.go:67. I can't make any sense of that. Can you please doublecheck your version and maybe post a new stacktrace?

@julienschmidt
Copy link
Member

Please re-run it with the eof-debug branch.
Any suggestions for more debug output?

@insasho
Copy link
Author

insasho commented Jul 11, 2013

I've rebuilt with eof-debug and re-ran. Here's the stack trace:

fill EOF (need: %d, got: %d) 76 52
panic: EOF

goroutine 1 [running]:
github.com/go-sql-driver/mysql.(*buffer).fill(0xc2000f54c0, 0x4c, 0xc2000b0000, 0xc2000af060)
        /home/username/src/example/src/go/trunk/src/github.com/go-sql-driver/mysql/buffer.go:61 +0x3e6
github.com/go-sql-driver/mysql.(*buffer).readNext(0xc2000f54c0, 0x4c, 0x0, 0x0, 0x0, ...)
        /home/username/src/example/src/go/trunk/src/github.com/go-sql-driver/mysql/buffer.go:85 +0x6e
github.com/go-sql-driver/mysql.(*mysqlConn).readPacket(0xc2000c6770, 0xc2000ff7b1, 0x4, 0x84f, 0x0, ...)
        /home/username/src/example/src/go/trunk/src/github.com/go-sql-driver/mysql/packets.go:54 +0x467
github.com/go-sql-driver/mysql.(*mysqlRows).readBinaryRow(0xc2000fd570, 0xc2000f5740, 0x4, 0x4, 0x0, ...)
        /home/username/src/example/src/go/trunk/src/github.com/go-sql-driver/mysql/packets.go:891 +0x4e
github.com/go-sql-driver/mysql.(*mysqlRows).Next(0xc2000fd570, 0xc2000f5740, 0x4, 0x4, 0x0, ...)
        /home/username/src/example/src/go/trunk/src/github.com/go-sql-driver/mysql/rows.go:64 +0xb6
database/sql.(*Rows).Next(0xc2000c5540, 0x677c60)
        /usr/local/go/src/pkg/database/sql/sql.go:1310 +0xc1
main.writeBooks(0xc2000c5300, 0xc2000ac640)
        /home/username/src/example/src/go/trunk/src/example.com/sitemapgenerator/sitemapgenerator.go:153 +0x167
main.main()
        /home/username/src/example/src/go/trunk/src/example.com/sitemapgenerator/sitemapgenerator.go:296 +0x4da

goroutine 2 [syscall]:

@julienschmidt
Copy link
Member

There must be some other reason. The packet header said the packet body is 76 byte long but only 52 byte can be read. My guess is that something is killing the connection.

@julienschmidt
Copy link
Member

Please measure the time from opening the connection / beginning read to the EOF. If it is a suspicious time like a multiple of 30s it is probably some sort of timeout killing the connection.

Can you also send me the packet capture and your DSN from db.Open (of course without user/password) via mail to go at julienschmidt dot com

@julienschmidt
Copy link
Member

After analyzing the log we know the following:

  • The server kills the TCP connection
  • A timeout is unlikely since the connection is killed after only ~ 4s
  • The connection is killed during reading (row 110 of 250k)

To sum up: I'm baffled.

Some more things to try:

  • Try Exec(..) with the same query which drops the rows immediately
  • Try the same query with db.Query(..) / db.Exec(..) without a parameter (so no prepared statement -> using the text protocol)
  • Re-run the with the updated debug branch to see what is in the buffer when the error occurs.

@arnehormann
Copy link
Member

I think it's 46100 rows, not 110 (counted by number of occurences of the regexp MySQL Protocol([\n][^\n]*){17}Message:). Still, something else does kill the connection.
In Frame 292:, the server sends a FIN before it gets an ACK from the client. No MySQL EOF packet is sent (or I couldn't see one). That's a different pattern than in the rest of the interactions or specified by the protocol.
Please try Juliens suggestions and maybe compare the trace with a trace of the mysql command line client (which is using the text protocol, so try a call to Query without parameters first please).

@julienschmidt
Copy link
Member

Currently, we assume that the problem is not related to the driver.
If you can reproduce this error and think that it is caused by the driver, please comment, so that we reopen this issue.

@insasho
Copy link
Author

insasho commented Jul 24, 2013

We observed in the packet trace that the connection is closed prematurely and the driver doesn't propagate that error up to the client. Even if we attribute the premature FIN to a quirk of my environment, isn't the failure to propagate the EOF up to the caller still a bug?

@arnehormann
Copy link
Member

The EOF is propagated. See your own trace above:

[MySQL] 2013/07/10 11:23:33 packets.go:67: EOF
panic: EOF

goroutine 1 [running]:
...

Given that, I don't really know what else we should do. What sensible reaction should the driver provide?

@pengux
Copy link

pengux commented Aug 22, 2013

I'm also getting this error, but on line 69:

[MySQL] 2013/08/22 15:45:07 packets.go:69: EOF

I have tried to close the statement and the db connection and reopen them again after a certain iteration but the problem still persist.

@arnehormann
Copy link
Member

Yet another EOF... ok.
Can you connect with the C-client without trouble?
What's the output when you replace errLog.Print(err.Error()) with panic(err)?
What happens when you retry the last call on err != nil on the same connection?
All in all, I'm inclined to think this is a MySQL error for a specific server version. Like non logging of errors and dos-potential or multi-result problems - connection drop on next. If it's the second one: @pengux, @insasho - did one of you mess with the driver like in issue 66?

@arnehormann
Copy link
Member

I found another one! This looks great, a StackOverflow answer for a Python connection problem similar to yours.

And I just remembered we had the same problem for some large queries once.
Our ugly workaround: append a ridiculously large limit to your select statement, that made it significantly less flakey for us.

Gotta love them MySQL easter eggs ❤️ 🎊 ❤️!

EDIT one more thing: Please have a look at aborted_clients with SHOW GLOBAL STATUS; before and after your query. Was it incremented?

@insasho
Copy link
Author

insasho commented Aug 22, 2013

@pengux which server version are you running? I'm on 5.1.63.

I'm seeing similar EOF errors: packets.go:31: EOF

Unfortunately I haven't had time to dig into it further.

@arnehormann: Is there a way to force use_result/store_result with the go-mysql-driver? It would be good to test both cases.

@arnehormann
Copy link
Member

I hope I'm right here but can't see any way I'm not (concerning PHP):
The whole client side api (including protocol handling) is implemented in this package, which has no other external dependencies. There is no use_result or store_result in Go, store_result in php is just use_result with results looped in the driver and stored in an array. That's client side only, the server does not distinguish between these modes.
So if you take a lot of time during iteration, you can try storing everything in an array instead.

Based on your tcpdump, that's not the case.

Another thing to try: please change your failing SELECT fields FROM ... query to SELECT SQL_NO_CACHE fields FROM ... and say if this fixes things.

@pengux
Copy link

pengux commented Aug 22, 2013

OK, here are my findings:

... LIMIT 999999

Does not work

SELECT SQL_NO_CACHE ...

Does not work

Store in a temporary slice, WORKS!

@arnehormann You are correct that my iterations takes some times, up to a few seconds each. I select about 30k of rows in each select. So what's the correlation with this error?

Edit: Storing the rows in a temporary array does work better, but not entirely. Before it would stop after about 1000 rows, but now it stops after about 5000 rows.

@arnehormann
Copy link
Member

@pengux please read this and try with a higher net_write_timeout (it's a server side option, write is read for the client). Try appending &net_write_timeout=600 to the DSN. Maybe it's some kind of timeout problem.

@pengux
Copy link

pengux commented Aug 24, 2013

@arnehormann Yes, appending the net_write_timeout=600 seems to fix the problem. Is it possible to make it the default value for a connection?

@arnehormann
Copy link
Member

@pengux Appending it makes it the default for all connections using that DSN.

Do you mean the driver should make it the default for all connections? We won't do that. Appending it to the DSN only means this is called after a connection is established - just the query SET net_write_timeout=600. It's not magic, it's really just another query. And making it the default for everyone leads to a lot of unneeded queries. Maybe someone wants a higher value than 5 minutes, maybe someone wants a lower one to save server resources or detect badly written code - it's not the drivers job to make assumptions in that regard.

@pengux
Copy link

pengux commented Aug 24, 2013

@arnehormann No, what I mean is that there is a default value for net_write_timeout which is obviously lower than 600 which cause this error. It would be great if the driver could set a good default value so that those who use it don't stumble upon this problem. Or maybe there is another implementation that could fix this issue? How does the PHP Mysql driver do it, because I've never encountered anything similar there even with same code?

@arnehormann
Copy link
Member

@pengux sorry, I messed up the markdown for a link above (read this...). I just fixed it, here's a copy for your convenience.

And it's exactly what I meant. We can determine the default which is set. But why should the driver do that, that's just another default query most people don't need (two if you factor in the discovery of the current value, the default appears to be one minute btw.). This is a server side issue, caused by the servers configuration and it can be fixed with a little SQL. If you need it, you add it. If you don't, the driver shouldn't assume you do. I'd rather not have this activated by default than provide an option to disable it. The default value of one minute is perfectly fine for most people.

I may still add this to the README.

About PHP, I have no idea. It's a language I have experience with, but if I don't really have to, I prefer to keep my distance. For me, working with PHP causes an immediate urge to take a shower and cleanse myself.

@julienschmidt
Copy link
Member

Some more patience please, @arnehormann 😉
I'm not entirely done with this issue yet. Let's analyze the the network communication of other drivers (e.g. PHP's mysqli) first.

@julienschmidt julienschmidt reopened this Aug 25, 2013
@julienschmidt
Copy link
Member

@pengux I'd suggest you put this in your mysql config file instead of setting this for every single connection via the driver.

@arnehormann
Copy link
Member

Patience granted. You were pretty silent for some time, welcome back!

@ymmt2005
Copy link
Contributor

ymmt2005 commented Sep 9, 2013

I had the same problem.

At least the error message should be improved to suggest this is a timeout problem.

@julienschmidt
Copy link
Member

Can anyone still reproduce this with the current git master (not the go get version)?

@insasho
Copy link
Author

insasho commented Oct 16, 2013

Thanks for continuing to investigate this. I am still seeing the EOF messages occasionally with the 'go get' version pulled a few days ago, and they aren't timeout related (see timestamps):

2013/10/16 14:03:12 grunt.go:214: Processing: async.Envelope{...
[MySQL] 2013/10/16 14:03:15 packets.go:31: EOF
[MySQL] 2013/10/16 14:03:21 packets.go:31: EOF

This corresponds to packets.go:

func (mc *mysqlConn) readPacket() (data []byte, err error) {
    // Read packet header
    data, err = mc.buf.readNext(4)
    if err != nil {
        errLog.Print(err.Error())
        return nil, driver.ErrBadConn
    }

If you think that might be related to what you suspect is fixed, I'll pull master and push new binaries.

Thanks.

@julienschmidt
Copy link
Member

Yes, the git master has several changes which should improve this issue.

@insasho
Copy link
Author

insasho commented Oct 16, 2013

OK, pushed. I'll keep you posted.

@julienschmidt
Copy link
Member

Closed until we get new reports of this bug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants