-
Notifications
You must be signed in to change notification settings - Fork 336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for streaming to a MySQL parameter from System.IO.Stream? #943
Comments
I am getting it working by simply turning it into a byte array and submitting that, but that is not particularly ideal if the data is large. |
A little bit of both. 😀 MySqlConnector doesn't have support for it, hence the exception. It could certainly be added (as a convenience method for this use case). However, the MySQL text protocol requires the entire Still, this is probably worth adding, even if just for API compatibility with other ADO.NET providers, even if it doesn't support actual streaming, at least initially. |
Right, it would be nice to support it just to make it more compatible with other ADO.net connectors like SQL Server, even if under the hood it just does what I did, and convert the stream into a byte array. That does mean it would crash out if someone tries to insert a massive blob, but most folks don't have the max packet size for MySQL set to a massive number anyway (in our case it's 16MB for production) so unless the MySQL wire protocol supported proper streaming, it would still be limited to whatever that size is. Clearly SQL Server supports massive blobs as the Rebus unit test for the code in question inserts a 100MB file, which naturally crapped itself on my test system with a 16MB max packet size :) At the end of the day folks shouldn't be inserting massive blobs into MySQL anyway, as it's the wrong tool for the job. if you need to store massive blobs, they should be stored in a storage bucket somewhere (Amazon S3, Google etc) and a link to the resource stored in MySQL, so I don't think it's a big limitation. But, it would be nice when porting code like this for it to just 'work', at least until you actually try to insert something massive :). |
That's a good point: usually one is quite limited in the maximum blob size they can insert, due to this limit. |
Updated ADO.NET tests; it appears that SqlClient is the only other library that supports using a Npgsql has decided not to support it (and throws a Most other libraries "succeed" but write |
Yeah it's 6 of one, half a dozen of another. When you get a nice exception at least you know stuff is not working and can fix it. In some ways I suppose having the developer specifically write the code like I did to turn it into a byte array means you are then fully aware of the consequences (memory usage). That might be better than having someone think they can stream a 1G file into a database blob only to find it runs out of memory :) |
I'm adding support for A different approach (e.g., that read repeatedly into a 64KiB buffer) could be implemented, which would work for all (A concern, as discussed above, with supporting any arbitrary |
Ok, I think supporting any stream should be implemented to make it really useful. A memory stream itself is not all that useful as all the data will already be in memory in that case so not sure it buys you much other than avoiding the second copy. A lot of the time it’s likely to be a real stream like a file stream or network stream, so the buffering approach would make the most sense to me? If the buffering approach is implemented, does that mean it would avoid the max packet size issue? Or will that still be a problem? If it can avoid that then at least it would be possible to stream large blob data in, larger that the max packer size. |
MySqlBulkCopy is the closest thing to streaming that MySqlConnector has; the right way to support |
It’s crazy but I have never had a need for the data table class myself (we use everything via a micro ORM), so it’s not clear to me how you would stream a file off disk for instance, into a data table that can the be streamed to MySQL via the bulk copy class? Unfortunately the SQL Server sample code just shows how to toss data from one table to another using it. |
Assuming it’s possible to copy the data via the MySQL wire protocol via a 64K buffer and avoid the max packet size issues, I think that’s the best approach. If that is not possible then maybe I need to learn how to use a data table and bulk copy in that way for rebus. |
|
AFAIK that's only possible by first executing a But an |
Looks like there has to be a way to support it, as it’s officially supported for the C API? https://dev.mysql.com/doc/c-api/8.0/en/mysql-stmt-send-long-data.html |
I don't think so, since from that page:
EDIT: This may just mean the maximum size of each chunk, not the total length? |
people say it works for Java and PHP, so it must be possible? |
A good point from that SO question:
It seems pointless to invent a way of inserting arbitrarily large data that can't be retrieved. Additionally, using So perhaps |
You can already receive large data as it’s already possible with the connector to read data into a stream. |
No, because there's no way for the server to send it without exceeding |
I can look into implementing |
So this example us pretty pointless then? Never tried it, since I can’t get data into the DB. Lol https://dev.mysql.com/doc/connector-net/en/connector-net-programming-blob-reading.html |
Also as much as I think storing large data in MySQL is silly, in the case of rebus for a message transport it quite useful to support data larger than the max packet size because the point of a message queue is it’s all transient data. It does in and out really fast and it makes things so much simpler if it’s all just handled in the message structures and apis and not something you need to resort to something like a shared storage block to support. The next thing I need to figure out is why the message reading via MySQL is so much slower than the SQL server version. Alas we still use MySQL 5.7 and it does not support ignoring locked row level data like SQL server does, so their simple approach of deleting the row in the transaction and having other readers simply ignore it can’t work. It just results in deadlocks. |
The documentation for MySqlConnector could determine if a prepared command had a The implementation problem would be that currently, multiple commands are batched together into one packet. (MySQL doesn't allow multiple statements (e.g., |
Interesting. So the solution would be to not do the statement batching when a stream is involved. Does MySqlConnector only do the splitting if the statements are prepared? I just found something interesting. As I mentioned above I was looking into how to speed up the performance of Rebus.MySqlConnector compared to SQL Server. The MySqlConnector version was able to insert data into the message queue way faster than SQL does when using the least transport, but SQL blew the doors off MySQL when it came to pulling the data back out. Part of the problem is the lack of row level lock ignoring for select statements in MySQL 5.7, but it's not all of it. I tried changing things somewhat to speed it up, but did not succeed. Here are some results: SQL Server: *** Using NORMAL SQL transport *** *** Using LEASE-BASED SQL transport *** MySQLConnector: *** Using NORMAL SQL transport *** *** Using LEASE-BASED SQL transport *** Now what is super interesting, is that I just back ported the same code to run on the Oracle connector, and it was WAY faster than MySqlConnector for the read operations (not as fast as SQL Server, but significantly faster): Oracle Connector: *** Using NORMAL SQL transport *** *** Using LEASE-BASED SQL transport *** So clearly from the above the MySqlConnector version is quite a bit faster than the oracle version, but for receiving it is not. The big difference is this library is written to be async all the way through, hence all of the SQL operations in the library are async calls, but as we know the Oracle connector is not really async at all, it just does sync calls with async semantics. I suspect that is why the inserts are so much faster since it can send a lot more over due to using async, but for the receives, I wonder if either the async stuff is making it slower, or if it's something to do with the statement batching you mentioned above? The reason I ask is the receive operation is implemented as multiple SQL statements that get all sent together in the same transaction, designed in such a way that we pull out the next message, then update it to mark it as being processed (to avoid the row level locking stuff I mentioned) and then selecting the data out. Very similar to how SQL Server is done, except that with SQL Server it does not mark it as processing, it deletes it and the transaction is kept open until the message is successfully consumer (which will not work without row level lock ignores for select, which is not in 5.7).
So either this statement splitting stuff you mentioned is what is slowing it down, or it's that this particular operation does not do so well with async? |
No, I don't think it's the async stuff. In that test I can easily disable all the async so only one operation runs at a time, and that dramatically slowed down the MySqlConnector version as well as the Oracle one, but the oracle one was much faster still. 9.3 msg/s for MySqlConnector and 626.4 msg/s or Oracle. So quite a bit slower in both cases. If you are interested in profiling this to see where it's so slow, the test to run is this one: You need a database user called mysql with the password mysql that has full access to the rebus% databases (and make a schema called rebus2_test. |
Well, so much for that theory. If you actually try this, you get the exception Sample code to implement this is here: https://github.com/bgrainger/MySqlConnector/tree/send-long-data The only benefit this provides is being able to insert a row whose individual columns are less than |
I am. This library should be faster than Oracle's MySQL Connector/NET for all use cases, except that Which version of MySql.Data were you using? 8.0.22/8.0.23 has a severe performance regression when reading rows, so I'd be very surprised if it were the faster library. |
8.0.23 was the version I was testing against. I just grabbed the latest one when I back ported the code. |
Whelp, it's async overhead. Damn, that was a lot more than I expected. I had a sneaking suspicion it was, and it was one reason we have never done async for SQL programming because I think the overheads just pile up when the operations you are doing can run so fast. It only makes sense if the MySQL operation is going to take a while. *** Using NORMAL SQL transport *** Note that it's still way slower than SQL Server, but it is faster than the Oracle connector. Note however that the insert speed did drop off to about half the performance, but that's because inserts generally take longer so there is a gain to be had there. But I am leaving it disabled because the other upside of not doing any async for the SQL transport in Rebus, is that you can better guarantee ordering of results on both inserts and reads. But it would be interesting to perhaps do a hybrid, and leave async on for inserts, but have it off for reads. |
What is interesting though, is the SQL Server code I cloned for Rebus was also pure async, and the SQL Server version does not have these problems. Might be well worth taking a closer look at the performance bottlenecks without MySqlConnector when using async as perhaps that will shed some light on my SQL Server is so much faster here and perhaps it can be optimized so it's just as fast when doing async? Here is my feature branch with async remove to compare it: https://github.com/kendallb/Rebus.MySqlConnector/tree/feature/remove-async |
Thanks, I'll take a look. |
Cool. Very curious to see what shakes out. In the short term I might do a version that removes the async for receive and leaves it in for inserts, but either way the SQL Server version blows the doors off both MySQL versions. That could just be a SQL Server vs MySQL performance issue itself (not a connector issue), but there could be other things in there. It does make me wonder that perhaps the SQL Server connector is tuned in such a way that async is only used where it makes sense and has a performance advantage, and not just all the time? |
I'm not aware of head-to-head tests of those databases, but the TechEmpower Framework Benchmarks use PostgreSQL and MySQL (SQL Server isn't supported). The top 43 results are all PostgreSQL; the fastest MySQL client comes in at 41% of the speed of the top pg client. So there may be bottlenecks in the server itself.
I'm not sure, but usually the point of async is to use async TCP/IP socket I/O (instead of blocking) so that the current thread can be freed up while waiting for a network response. It's not primarily to improve performance. Since (almost) all DB operations involve client/server I/O, it does really mandate using async all the time. (There are minor exceptions, such as |
For comparison purposes and to determine what the highest potential performance is, I ran the same test with the in memory and file system transports: *** Using Filesystem transport *** *** Using Memory transport *** So clearly the memory transport is ridiculously fast, which indicates the bottlenecks are clearly all in the transport layers, and the file system transport is super slow to insert (makes sense) and about twice as fast to retrieve as MySQL without async. It does make me wonder how SQL server is so damn fast, and I suspect the overheads here at MySQL itself. I do not have the query cache enabled in MySQL (for our systems it always just made stuff slower in production), so it's possible that is a big difference between MySQL and SQL Server. So I suspect the upper bound for performance with MySQL is the non-async version, and the key part is figuring out why using async for receive causes it to be so much slower. Ideally we would leave the async there as well, and just figure out what is making it slow. Not sure how to profile that myself or I would give it a go. |
Oh I guess they killed the query cache for good anyway :) |
I added a console performance test harness app (that runs just that one test) that can be executed under dotTrace. Unfortunately, comparing an async run to a sync run is almost impossible, due to the callbacks. I did update MySqlConnector's benchmarks and while async adds a slight performance overhead, it's nowhere near 2x worse: BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-10875H CPU 2.30GHz, 1 CPU, 16 logical and 8 physical cores
.NET Core SDK=5.0.200-preview.21079.7
[Host] : .NET Core 5.0.3 (CoreCLR 5.0.321.7212, CoreFX 5.0.321.7212), X64 RyuJIT
Job-TQEJBX : .NET Framework 4.8 (4.8.4300.0), X64 RyuJIT
Job-ISFNWE : .NET Core 5.0.3 (CoreCLR 5.0.321.7212, CoreFX 5.0.321.7212), X64 RyuJIT
|
Forgot to add the link: https://github.com/bgrainger/Rebus.MySqlConnector/tree/performance-test |
Thanks, I was doing some more testing and noticed that the performance difference is way less if you increase the number of messages for whatever reason. Testing with 10000 messages it's 232.2 msg/s vs 188 msg/s so significantly less. So that that point the Async is adding some overhead, but it's not nearly as much (about 20% or so). I think I will change it to not bother doing async for the actual receive part of the transport since that is expected to get in and out as fast as possible. I am also going to check the query explanation to see if perhaps something is not set up correctly with the way I configured the indexes for the queries in question. I wonder if perhaps an index is not being used when I thought it was... |
Ok well I didn't have the index set up correctly, so I have improved it slightly so the at least the SQL explain says it is using the index now, but it did not fix much. Clearly the performance difference between async and non-async drops off as the message queue gets bigger and the overhead of the query starts to play a bigger role but it is still slower than SQL server. For now I am leaving async off for the reading code and will play with restructuring the queries a bit. I think the fundamental difference in why MySQL is slower than SQL server in this instance is the way indexing works with timestamp filtering and sorting. It is one of those areas where MySQL seems to fall down, but I don't see any other way to structure the query to be any faster. Might be interesting to see if it's better on MySQL 8. |
Ahh, I know what the core problem is. It's the lack of row level lock ignore for MySQL earlier than 8. When we select out the new message row with this query:
even though I fixed the indexing on the tables and split the query up, it's only a minor performance improvement. Instead what is happening is that the 20 async tasks that are attempting to read the next message in the queue are all getting transaction locked on the query, because it's not possible for MySQL to issue a row lock for just the next row so I believe it's locking the entire table. So it's in essence, a serialization point and only one thread at a time can actually be snagging the next message from the queue. The fix for MySQL 8 is to implement SKIP LOCK for the reads: https://dev.mysql.com/doc/refman/8.0/en/innodb-locking-reads.html which I think will then allow other threads to get in and grab the next message. If that actually worked, then doing a full async read might actually be a performance benefit, but until then may as well leave it out. I have to get MySQL 8 installed and test this theory out. |
Well MySQL 8 is faster overall, but SKIP LOCK didn't change anything. I am not sure if it's possible with MySQL to avoid a full table lock for that query, and locking is clearly a big part of it. With MySQL 8 I got some lock exceptions on deletes and when I changed to retry the operation for deadlocks on receive it sped it up quite a bit where before I would simply return an empty message, which means the message queue goes back to sleep. For whatever reason the difference in Async vs non-Async with MySQL 8 is larger, probably in part because it's faster to process that query, but it's not a whole lot. MySQL 8 Async: MySQL 8 Non-Async: MySQL 5.7 Async: MySQL 5.7 Non-Async: Anyway, at this point I have fixed the issues with MySQL 8 so probably just gonna roll with what I have which is to not use async for retrieving the messages and deal with MySQL being slow. Still way slower than SQL Server. I think that's just how it works. |
Support for |
Awesome thanks! |
Is streaming now supported only by MySqlConnector (in memory), or either by MySQL wire protocol itself? |
It is emulated, not in the wire protocol. |
This issue turned into a very wide-ranging discussion, but the original point in #943 (comment) stands: due to the MySQL protocol (length-prefixed packets with a maximum size), it's not possible to stream data when executing a As a workaround, use code similar to the following:
This makes it clear that all data must be buffered in memory before MySqlConnector can process it (as opposed to setting |
Working on porting Rebus to MySqlConnector, and there is support for streaming from a source directly into the database via a database parameter, but this appears to not be supported with MySqlConnector? is this a limitation of MySqlConnector, or a limitation of the MySQL wire protocol itself? Streaming the result set works, I just cannot find a way to stream data on insert?
I am using it like this (stolen directly from the SQL Server version):
The text was updated successfully, but these errors were encountered: