-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New System.Data.Common batching API #28633
Comments
Thanks @roji for putting this together. A couple of points:
|
Typo: Will Is it worth implementing
Add:
Or should it call the same method on the batch command it creates? (I don't understand the benefit of preparing the DbCommand objects in the DbCommandSet, which won't actually be executed themselves.) MySQL implementation thoughts: For the "text" protocol (the default), MySQL already supports all commands in a batch being sent in a single network packet. MySqlConnector would probably end up concatenating all the SQL from the For the "binary" protocol (i.e., prepared commands), this would be a helpful API improvement. MySQL disallows more than one statement being prepared at once, so MySqlConnector has to parse SQL and split commands apart. The design goals of this proposal eliminate that requirement. However, for MySQL, the second command in the batch wouldn't actually be sent until |
@roji, thank you for writing this up. I am excited about the possibilities! Currently, the proposal allows command behavior to be specified once for the entire set of commands. However, command behavior can be command-specific. The same behavior may not apply to all commands that could go into a command set. To accommodate this, what would you think of changing the proposal so that behavior may optionally be specified when commands are added (e.g. |
@divega maybe it makes sense to repurpose dotnet/SqlClient#19 to track the implementation of the new batching API in SqlClient specifically? That seems to be its main focus in any case. Regarding the additional overload accepting an |
Thanks, fixed.
Good question. Most of the ADO.NET base classes implement System.ComponentModel.Component, which implements
Richer access to the commands in the set is one thing I wanted to get feedback on. I'm not sure it makes sense for the command set to expose full list semantics - what's your view? Unless I'm mistaken, C# collection initialization syntax actually has nothing to do with
Are you referring only to the shim, or to the DbCommandSet spec in general? For the latter, I'm not sure this makes much sense - what value does it provide over clearly specifying that they are ignored in batched execution? In addition, setting
You're right, this was a mistake. In theory it's possible to prepare the batch command as you suggest, and reuse it again with different parameter values. The issue is how to invalidate if any of the user-provided commands are modified (e.g. change in Thanks for your comments on the MySQL provider. To be sure I'm understanding things correctly, here are a few questions:
|
Most of the value on CommandBehavior seem like they make sense only at the batch level - CloseConnection, SingleResult, SingleRow, SequentialAccess (since a single reader is returned, and would not likely support switching between sequential and non-sequential between commands across providers). A good way to think about CommandBehavior is that actually affects the reader, rather than the command - it's a parameter to |
In favour of implementing the standard Dispose pattern for consistency. I have no experience with
I believe you are mistaken; see docs. Perhaps unfortunately, this does immediately imply access to the commands in the collection, which feels like a distraction from the point of this class. It does already have It doesn't feel like the point of
Yes, I was suggesting that you specify the behaviour of the shim implementation w.r.t. those properties. I think the shim should do something with them when creating its batch command, and this seems like the right thing to do.
No, they are not. MySqlConnector parses the CommandText, finds
Yes, through a convoluted mechanism. Binary protocol is implied by calling
Correct. Each individual statement is given its own (integral) statement ID, and an "execute statement" command specifies a single statement ID. The client prepares all the statements in advance, then sends "execute statement ID 1", then reads the result set, then sends "execute statement ID 2", then reads its result set, etc. Some MySQL Servers will permit pipelining, i.e., the client sending one network packet that contains multiple "execute statement ID n" payloads, then the server reading those one-at-a-time and processing them in order. However, there are a wide variety of proxies and backends (Aurora, Azure, TiDB, Percona, etc.) that speak the MySQL protocol and pipelining tends to break them. (It could possibly be added as an opt-in option for people who needed maximum performance and were confident their DB supported it.)
I have not found this to be true in practice. I thought I had once found documentation confirming that it's not supported but don't remember where that was. But that's a good point that some future MySQL Server might permit it. |
@roji no problem. I added what I described on my emails as option 2, since it is a variation of option 1.
Ok, done. |
Sorry, my bad - you're right of course. That's what comes out of sloppy testing at 3AM.
I agree with most of what you say, although I do think there's real value in providing collection initialization - maybe a good mid-way solution is to implement
OK, I understand, but I don't see why the shim should do anything with these properties - can you explain why you think that's important? The shim would obviously set those properties on the internally-used batch command it creates (since that's the one that actually executes), but why the others? Another possible issue is that it would be somewhat incoherent to have the shim do it, but not to require the same of actual provider implementations, and I don't really see the sense in that either...
I see, so if I understand correctly you support formatting and parsing parameter values both in text (for unprepared) and in binary (for prepared)... Npgsql used to work the same way a long while ago. Note also #25022 which I plan to tackle soon, and which may also obviate the need to parse for parameter substitution/interpolation. The ultimate goal here is to allow you to write your driver with any sort of parsing/rewriting - hopefully that can work out for you. If there are any other reasons you parse/rewrite (aside from batching and parameter substitution), could you post a note about them?
Thanks for this information - this kind of cross-provider/cross-database information is very important. In any case, it's unfortunate that MySQL imposes these restrictions...
OK. IMHO batching is important enough for perf that it may make sense to provide an opt-in for this, although it's true that you also have the text protocol that already supports this. Regarding special backends and their capabilities, in the PostgreSQL case it's sometimes possible for the driver to detect this and automatically switch off/on certain features, maybe it's possible in your case as well (just thinking out loud). Thanks again for all the detail and the valuable input into this - please don't hesitate to share any other thoughts on this proposal etc. |
Looks great :) For the shim: e.g. DB2 requires BEGIN/END around batched statements, so the shim itself needs to consult the ADO.NET provider whether it can simply concat the statements or needs to do additional things. Some databases don't support select statements batched together with update statements for instance, so your shim should anticipate on that, consulting the ADO.NET provider. Some databases don't support batching at all (access/oledb) so for these, the shim should simply execute them one by one. When concatenating statements, you obviously run into the problem of parameters with the same name. You therefore have to make the parameter names unique. This can be problematic as it requires you to parse the statement (as far as I can see). Most ADO.NET providers have no real limit on the # of parameters except... SQL Server! :) So you have to consult the ADO.NET provider there too in the shim how much statements you can batch in 1 go automatically based on the # of parameters. I don't think it's that important to offer a 'batch size' value for the shim. This is for .NET Core x.y only? Or will this be available on .net full as well? |
I wasn't planning on the shim actually implementing all the different behaviors, but it's an interesting thought... Since I don't think a way exists to ask providers whether/how they support batching, I assume you're expecting the shim to check which provider is being used and to hardcode the different behaviors, right?
Yes, I noted this problem above in the proposal - my plan is currently to throw in this case. It definitely seems like a non-starter for the shim to parse SQL to uniquify parameter names (especially as different databases have different SQL dialects etc.). This is after all only a backwards-compat shim, and it seems reasonable for it not to work for some edge cases - of course provider implementations of DbCommandSet can do what they want.
That's a good point. I'd expect this kind of thing to be managed as an implementation detail by SqlClient's future implementation of DbCommandSet (as mentioned above, SqlClient actually has an internal SqlCommandSet which actually isn't publicly accessible). I do agree that we shouldn't concern ourselves with this at the general API level. Of course it's also possible for the user to execute two batches instead of one in certain situation (not that passing this onto the user is a good idea).
I don't have a definite answer for this yet, but as a general rule new APIs get added to .NET Core (or rather to .NET Standard) and not necessarily to .NET Framework. |
yes, in the case when it's created directly. It could do that when the first DbCommand is added. With the DbProviderFactory route I recon the ADO.NET provider could create a specific derived type of the DbCommandSet shim and simply execute things sequentially if batching isn't supported. I likely overlooked it, but if you don't have this yet, I think if there are commands added to the DbCommandSet which are assigned to different connections it should throw. |
If I'm understanding this correctly, when a command set returns multiple result sets, the first result set is accessed by calling It's possible (at least in the SQL Server world) for a single command to return zero, one or multiple result sets--and the number returned can dynamically vary (e.g. a stored procedure might be programmed to return one result set under certain conditions and two under others). Making sense out of the returned result sets in a situation like this can get tricky/be impossible without knowing which set came from which command. For example, let's say I execute a command set consisting of two commands, both of which call StoredProcA. This proc returns zero, one or two result sets depending on circumstances. If executing the command set returns four result sets, I know that the first two correspond with the first call to StoredProcA and the last two correspond with the second call. However, if, say, only two total result sets are returned, sorting out which came from where is a bit trickier--did they both come from the first proc invocation, the second invocation or did one come from each? To accommodate situations like this, is something like the following an option?
With an API like this, the developer passes in a set of commands then fetches out a set of command results--the granularity passed in is the same as what is returned. It also eliminates the need to figure out how to return a combined records effected count (because that number is available on a per-command basis) and potentially simplifies the exception situation (an exception associated with a command's results can be thrown while that command's results are being worked with, eliminating the need to figure out how to identify how to tag the exception to indicate which command triggered it). |
This is critical for ORMs. In general, this new API should expose all the raw power that is there. It should not drop information or limit the possible actions. This API is for consumers that maximally care about performance and flexibility. The Regarding error handling: It must be possible for providers to keep executing in case of error and keep incurring errors. SQL Server will happily continue in case of many errors. All those errors must be reported the API caller somehow. SQL Server commands can also generate info messages. These must be provided as well. Is the timeout considered global or per command? This was not clearly stated as far as I can tell. IMO, it should be a global timeout. Most of the time, you want a timeout for the entire business transactions. If you expect creating a customer order takes 100ms then you want to abort if the entire thing takes e.g. 10s. You don't care about the duration of individual queries at all. Feel free to close my now superseded issue https://github.com/dotnet/corefx/issues/31070. I am very happy to see progress on this. This is going to a major performance improvement. |
The way I was seeing it, the shim is only there for providers which haven't (yet) implemented any support. Providers which do simply have their own implementation of DbCommandSet which does not derive from the shim - they can do whatever it is they want. Thus, DB2 would automatically add BEGIN/END, SqlClient would split the batch when there are too many parameters, etc. This raises the question of how desirable it is for the shim to start implementing provider-specific logic, which really belongs in the provider implementations. Some basic functionality may make sense: knowing that some specific providers simply don't support batching, and executing the command sequentially could be useful for the transitional period. But ultimately the expectation is for providers to provide their own implementation, totally unrelated to the shim.
The idea was simply to ignore the Connection property on the commands in the set, as well as the Transaction and Timeout properties - only those on the DbCommandSet are used. |
Maybe the provider should be able to tell if it does support batching and other parameters such as what the maximum efficient number of commands and parameters is (before splitting needs to happen). Callers such as EF could interrogate the provider and adapt their behavior. We'd need to identify all the common but relevant properties in which providers differ. |
We're in agreement, the question is really about how best to expose the non-aggregated rows affected.
That may be true in the general case, but don't forget that this is an API abstraction that's also used to contact local and even in-memory databases such as Sqlite. We've also seen that when fetching cached data from PostgreSQL over a low-latency 10Gbps network link, even minor-seeming allocations start to have an effect in the TechEmpower benchmarks. I agree that these cases may be somewhat artificial and not related to real-world usage, but I'm generally averse to introducing APIs that necessarily cause allocations if there's another way.
I agree - and per-command timeouts are probably not technically possible in at least some (or even most) of the providers. There's a note on all this in the proposal above - CommandTimeout is ignored on the individual commands (like Connection and Transaction), and DbCommandSet has its own
This is of course very provider-specific; the general batching API doesn't define any mandatory behavior but definitely should not get in the way either (there's already a note in the proposal above). I personally think that it makes more sense to abort the batch once an error occurs, but if continuing is the desired behavior for a specific provider (e.g. SQL Server), then it could throw an aggregate exception containing all the individual exceptions at the end of batch execution. At the moment I don't see anything in the proposal that needs to be changed, let me know if you see a problem.
As @divega wrote on that issue, it's still useful to keep it to track the implementation of the SqlClient DbCommandSet - presumably the existing internal SqlCommandSet would be changed to conform to the new API being shaped here. So I think we should keep it open for now.
Great :) Thanks for your own analysis and description in dotnet/SqlClient#19, it helped push the realization that a true batching API really is necessary.
This is similar to the point @FransBouma made above. The way I'm looking at it, the API consumer (e.g. EF Core) should not be concerned with such implementation details - they should simply construct a single big batch and execute it. If a specific provider's batching implementation considers that the batch should be split - either because it must be (e.g. too many parameters) or because it would run more efficiently that way - then it would do so internally in a transparent way. Unless I'm mistaken, for SQL Server there's even a threshold below which batching isn't efficient at all (see dotnet/efcore#9270). The SqlClient implementation would simply not batch at all for these cases. There may be some details which make it impossible or undesirable to totally abstract this away from the user - possibly around exception handling - but at the moment I can't see a reason not to. |
Ok, though if the shim fails, it's of no use to the user, so the user should be aware of the limits up front. E.g. if DB2 doesn't yet have their own implementation (and thus falls back on the shim, like any other provider) it will fail. This might be OK, but the user might be confused: "I use the default one and it breaks". So documentation might be in order that using the shim likely doesn't work in some cases. Be aware too that mixing select commands and update commands might not always work with all db's. So if you concatenate statements in the shim, the user might run into this limitation too. I think documentation is enough here, as it breaks at runtime and it's not determinable up front if it will.
yeah this is a bit of a tough call: concatenating statements and then not uniquing the parameter names already makes it fail on most systems, so if you don't do that, the user has to do that, and they likely already take other precautions (like group inserts together in 1 shim instance, updates in another etc.). Regarding the values returned by ExecuteNonQuery(Async) which uses the concatenated query command: There is a trick to use here which allows us to determine per update statement whether it affected 0 or more rows: append So for each query concatenated you append a unique parameter and a unique snippet. A bit overkill (I decided not to go for this in my framework, as when 1 query fails, e.g. in the middle, the whole batch is rolled back but you likely have to rerun the whole batch anyway as that's way easier than to sift through which command is the wrong one and why, then schedule the batch again without the ones already succeeded). |
That's correct. Note that this is how the ADO.NET API already works when you use concatenation-based batching.
That's a very valid point - the scenario of a single command returning a variable number of resultsets is indeed tricky, even if it seems like quite an edge case. Your proposal for a CommandSetResults (or some version thereof) could help, but IMHO this approach has the following drawbacks:
In addition, even without this something like this, there are some workarounds which can allow users to distinguish which resultset belongs to which command: one could look at the shape of the resultset (which columns it has) to determine which stored procedure generated it, or one could even insert an arbitrary To summarize, I think the concern you raise is valid. However, in my opinion it's too much of a rare case to consider (significantly) complicating the API surface for the general, simple case - especiall when valid workarounds seem to exist. |
Agreed. Another approach would be for the shim to contain a hard-coded list of the specific providers which support cancatenation-based batching, and only do it for them; for other providers it would fall back to non-batched execution. This would ensure that the shim always works, although it doesn't always actually batch. A more extreme approach would be to give up entirely on actually batching in the shim, and always execute in non-batched fashion (this could make sense if we think that the duplicate parameter name problem is going to occur a lot - not sure about this).
Absolutely - this is a database-specific detail that we shouldn't make any attempt to handle. Documentation will be important here.
That's true - I think we should do our reasonable best, but rewriting user-provided SQL of unknown dialects seems like it's a non-starter. Here as well, documentation will hopefully help people out, and providers will hopefully implement their own DbCommandSet to obviate the shim entirely.
Yes, this is similar to what EF Core does for SQL Server in order to implement optimistic concurrency, which requires knowing for each update if it worked or not. This is exactly the kind of details that I'd expect the SqlDbCommandBatch to take care of internally, abstracting away the task from O/RMs and users. By the way, PostgreSQL provides this information as part of the wire protocol, so no additional hacky queries are needed. |
Thanks @roji. It seems I overlooked a few statements in this long thread.
These details do not matter for semantics but for performance. The caller can adjust its behavior in ways that the provider cannot. For example, the provider does not understand the nature of the queries and the network latency environment. EF might have 1 billion inserts queued but it will likely not want to create 1 billion command objects. So some moderation/negotiation on the number of commands per set has to take place in any case. If EF could query the maximum command and parameter count it could directly materialize the perfect number of commands. I believe quite strongly, that the provider should never try to optimize the batch size or suppress batching in some cases. It's the wrong place to make these decisions. The EF issue does not state what mechanism of batching was used. Maybe just concatenated SQL which has different properties than command sets (most significantly, recompilation). If command sets ever are slower I'd like to know why. What specific property (of the protocol?) caused that? It would be surprising to me. A concatenation based shim would not work for statements that must be the single statement in the batch e.g. Maybe the shim should be the safest option possible and all optimizations should be left to the provider. |
I think we're saying the exact same thing here; I may have just worded it confusingly. I am not proposing that All I'm suggesting is adding to the spec a bullet point that explains how the shim will set the properties on the internally-created batch command. Your spec mentions how |
@bgrainger ahh, understood! I indeed thought you wanted the shim to modify the properties on the user-provided commands. Added a bullet point as suggested. |
I'd like to understand this point a bit better - it also seems to contradict somewhat with your comment above that proposes the provider expose the maximum efficient number of commands/parameters. After all, if the provider can expose this knowledge, why shouldn't it just act on it and remove the burden from the user? But more importantly, I'm interested in exactly what kind of user knowledge could be useful for determining batch sizes/thresholds, that the provider does not have. Shouldn't batching always be desirable regardless of the network latency environment? How would the nature of the query impact the decision-making here? Some real-world examples could be useful on this. Note that the user would always have the option of creating multiple DbCommandSets, according to whatever criteria they see fit. The question here is whether the provider implementation should make its own decisions on how to execute the commands given by the user. Don't forget we have some very clear provider-specific cases, like the max parameter limit per batch in the SQL Server case - do you really think this kind of thing should be the user's responsibility rather than the provider's?
That's very possible - I was also very surprised that batching two statements could actually be slower than not batching them, and was explained that it's related to overhead involved in batching. @divega and/or @AndriySvyryd may be able to provide more info.
Regardless of the above, if there really are many commands to execute, EF may decide to execute them in several batches - regardless of anything provider- or environment-specific, i.e. as a means of avoiding instantiation of lots of command instances. I'm not sure this is actually necessary, but the point is that it's a different kind of batch-breaking-down as the one discussed above. |
@GSPP one more comment - note that whether a particular implementation of DbCommandBatch splits the batch or not is entirely internal and provider-specific. We can of course publish guidelines and recommendations, but at the end of the day each provider will do whatever it is they think is necessary - in other words, it's not part of the API itself. |
@GSPP missed this last comment of yours:
Would it work in a non-concatenation-based batch? If not, then that seems OK - it's simply a limitation of SQL Server for both forms of batching. If
I assume that by renaming parameters you mean changing their names so that they can be unique across all commands in the batch. If so, then I agree that this isn't viable, if only for practical reasons - we're not going to be able build an SQL parser into the shim that can detect placeholders which aren't part of SQL literals etc.
That seems to point towards the shim executing the commands without any sort of batching, as proposed above in the discussion with @FransBouma. I agree there's some merit in this... |
Is there going to be an considerations for T-SQLs Also, how will the different providers hand scope? For instance if I do the following:
If this were two commands, what happens to the scope of Could an API be provided to accept a SQL statement and return a command batch? Is there any merit in such an API? |
I concur with the idea of allowing execution to continue on error when the server supports this. However, I think it's also important to have the ability to know about errors as they are encountered vs. after the entire batch has been executed. Currently, with ADO.Net (at least, the SQL Server implementation), error exceptions are raised when the error is read from the incoming data stream. The consumer can then respond/react to the exception, as appropriate. With this API as it currently stands, if a consumer is reading a result set that was not fully provided by the server due to an error, it sounds like the result set will look like a valid, complete result set. The consumer won't be able rely on that actually being the case until the entire set of commands has finishes executing. If the consumer wants to perform an irrevocable action on a per-data set basis, the consumer will need to cache all data sets as they are received while they wait for execution to complete and the possible aggregate exception before taking action. If, instead, errors are raised as they are encountered, the consumer knows that when they read a complete data set, it is actually complete, and so can be actioned on immediately. If a related error is encountered during reading, the consumer is informed and gets to decide whether or not to continue on (e.g. move to the next data reader). Another benefit of immediate exception on error is that, if the consumer decides that an error warrants cancelling the command set's execution, they can close the connection (perhaps automatically as the exception exits a |
I don't know, if the error isn't a transient error, the current transaction is likely to be aborted on the server side, meaning whatever you decide to do, you have to run the previous commands again as they've been rolled back due to the aborted transaction. |
Not pushing. |
@expcat still in the plan for 6.0 - this is very high up on my priority list. |
I've just submitted #54467, apologies to all involved in the original discussions that this took so long. I've more or less implemented this proposal for Npgsql in npgsql/npgsql#3860, giving us further confidence that the API shape is correct (beyond @bgrainger's implementation in MySqlConnector and @Wraith2 experimental implementation in SqlClient). Note that we ideally need to merge this in time for preview6 (in three weeks) to ensure it goes in for .NET 6.0. Following my implementation, I'm proposing the following two tiny changes to the API proposal:
|
One additional, possibly more controversial thing... Our proposal allows specifying different command behaviors for different commands in the batch. Looking at this now, it seems to introduce quite a bit of complexity, as well as API inconsistency with DbCommand, for what would very likely be a very rare edge case. I think we should consider reverting to the DbCommand API of accepting CommandBehavior in ExecuteReader. Some objections:
Here are some notes per value of CommandBehavior: CloseConnectionThis makes no sense in batching, except possibly in the very last DbBatchCommand. We'd have to define the behavior for when it's specify in non-last commands (ignore? apply after the batch?). This value in itself is a an argument for not allowing per-command behavior. SchemaOnlyWanting to mix real command execution (non-SchemaOnly) with SchemaOnly in a batch seems very exotic. You typically either execute or retrieve the schema, but not both in the same batch. KeyInfoThis is used to receive extra metadata information when using SchemaOnly. Wanting to get this extra information for some queries and absolutely not wanting it for others doesn't seem useful. SequentialAccessThis is the one option which seems remotely plausible - you want to execute multiple commands, one of which has a huge column and should be traversed sequentially. Without per-command behavior, you can still process the entire batch sequentially - there isn't a big downside of that (aside from not being able to access result columns in random fashion). SingleResult and SingleRowAs far as I know, these are mainly useful in cases where a stored procedure is being executed, but only the first resultset or row of that sproc is needed, and it's not possible to modify the sproc to accept a parameter managing that. It's possible this could be useful when executing multiple sprocs in a batch, but overall it seems like quite the corner case. |
In MySqlConnector, MySqlDataReader does have a (single) Per-command No other |
FWIW, the only feature request MySqlConnector has had for It does seem like the easier solution would be to add In MySql.Data (and now MySqlConnector), this triggers an optimisation that sends a hint to MySQL that only one row should be returned, which can result in a dramatic speed-up in query execution. I can still see the value of this on a per-command basis. |
Hmm...would it help if the per-batch command CommandBehavior behaved as an override which is applied only if it is set? If it is not set then the CommandBehavior from the parent DbCommand would be used (so the parent DbCommand's CommandBehavior applies to all commands in the batch except for any commands whose CommandBehavior is set directly). |
@bgribaudo this was briefly discussed in #28633. The problem is that at that point we need a distinction between set-to-default and no-set-at-all, e.g. making behavior nullable on the individual batch commands. While this is possible, it introduces even further complexity for what is really edge-casey. @bgrainger thanks for the added info. I can also see scenarios where per-command behavior could be useful, but I'm not sure they're useful enough (and often enough) when weighed against the disadvantages (i.e. consistency with the DbCommand API, having to set on all commands, etc.). It's also important to remember that users don't always have to batch everything in every case - there's always the option of not batching for the specific edge case which requires different behaviors; the downside is only an extra roundtrip, rather than blocking some functionality etc. But I'm definitely OK with going with what people think here. |
This is a good counterpoint. It also seems plausible to me that some database backends might not even support customising the behaviour on a per-command basis (in order to provide high-efficiency batching). So there's a question of whether the API should expose something that may not be feasible for some providers. ( I have no objection to moving the |
I haven't investigated from the SqlClient perspective so I don't know if any can be used. |
Looking at the CommandBehavior question again and after some online consultations, I've removed CommandBehavior from DbBatchCommand and added it as a parameter to DbBatch.ExecuteReader (and async). Summary:
|
This proposal has been approved in principle, subject to confirmation by several database providers that the API shape is good.
Issues tracking provider implementations:
This issue is based on previous discussions in #15375 and #17445.
Background
Batching multiple SQL statements can be critical for database performance, as they can be executed in a single roundtrip rather than waiting for each statement to complete before sending the next one. System.Data.Common doesn't currently provide a first-class API for this, but some providers (e.g. SqlClient) allow batching statements by concatenating them together into DbCommand.CommandText, separated by semicolons:
The problem with this approach, is that most databases require separate protocol messages for each statement (e.g. PostgreSQL, MySQL), forcing the database's ADO.NET provider to parse the SQL client-side and to split on semicolons. This is both unreliable (parsing SQL is difficult) and bad for performance - ideally an ADO.NET provider should simply forward user-provided SQL to the database, without parsing or rewriting it (see #25022 for a related issue forcing providers to parse SQL).
As a specific case, SQL Server does natively support the multi-statement commands, but even there this has drawbacks for performance, detailed by @GSPP in dotnet/SqlClient#19. Also, as @bgribaudo pointed out, the current concatenation-based approach doesn't allow invoking multiple stored procedures in batching fashion. SqlClient does include SqlCommandSet, which performs efficient, non-concatenation-based batching, but that class is internal and currently usable only via DbDataSet, and not for general usage (NHibernate apparently accesses this via reflection). This proposal would allow exposing SqlCommandSet via a public API.
Goals
Proposed API
General usage and examples
Usage is fairly trivial and aligned with the existing DbCommand APIs. Users first create a new DbBatch, either by calling
DbCommandFactory.CreateBatch()
, or by instantiating one directly. Commands are added into the batch, execution properties (e.g.Timeout
) are set on it, and finally on of theExecute*()
methods are called on it.Connection
,Transaction
andTimeout
are specified on the DbBatch, like they are set on DbCommand for un-batched operations.Here is a code sample using DbProviderFactory for database portability:
The verboseness of this API corresponds to how ADO.NET currently looks. General usability improvements are planned for later.
Here is a suggested code sample working against a specific provider and using initializers for better readability and terseness:
Design and naming motivations
The original proposal had the batch holding a list of DbCommand instead of the new DbBatchCommand type. The change, originally proposed by @bgribaudo, was motivated by the following reasons:
The name DbBatchCommand will hopefully convey the similarity between that type and DbCommand (both hold SQL, parameters and a CommandType), while at the same time keeping them distinct (DbBatchCommand isn't executable outside of a DbBatch). The name DbStatement was also considered, although it seems this would introduce more confusion:
Affected rows
DbCommand has APIs for finding out how many rows were affected by the command:
ExecuteNonQuery[Async]()
returns an intDbDataReader.RecordsAffected
propertyHowever, both of these provide an aggregate number; if the command contained more than one statement, it's currently impossible to find out how many rows were affected by each statement separately.
Providing non-aggregate affected rows could be done via an overload of
ExecuteNonQuery[Async]()
which returns anint[]
, or which accepts a user-providedint[]
and populates it. The approaches have their complexities (perf, array management...), and as we're introducing a new DbBatchCommand dedicated to batching, we propose to simply add aRecordsAffected
property on it instead. The provider would populate this property for each DbBatchCommand to indicate how many rows were affected by that command.As with DbDataReader.RecordsAffected, this property would contain -1 for SELECT statements, and 0 if no rows were affected or the statement failed.
Command behavior
DbCommand.ExecuteReader()
accepts aCommandBehavior
enum which allows execution behavior to be tweaked. Even if it seems to be a rare case, users may need to specify different behaviors for different batch commands. As a result,DbBatch.ExecuteReader()
does not accept a behavior parameter, and DbBatchCommand has a CommandBehavior property instead.This requires the user to set any non-default behavior on each and every batch command. A batch-wide default could be accepted by
DbBatch.ExecuteReader()
, but that would require a way to distinguish between the batch default and the enum's default value (CommandBehavior.Default
), e.g. by making it nullable on the DbCommandBatch. The introduced complexity doesn't seem to be worth it.Note that not all providers will support all command behaviors on batched commands - it is expected that in some cases the entire batch will have to share the same behavior. Also, CommandBehavior.CloseConnection makes no sense on batched commands except the last, and the provider should probably throw on this.
DbException.Command
Since we now execute batches, we should provide a way of letting users know which command in the batch triggered an exception. We propose to do this by introducing a new virtual property on DbException, called
BatchCommand
, pointing to the relevant DbBatchCommand instance. In non-batched command execution this property would be left unassigned.Notes on batch size management
In some cases, there may be some value in having the provider's batching implementation implicitly manage batch sizes. Examples:
The pro here is to free the user from knowing low-level, database-specific details, transparently increasing perf and reducing friction for everyone (we're assuming the user has no special knowledge which would assist in the decision-making). The con here is more complexity and a general uncertainty that this is needed - any examples from the community would be helpful.
Regardless, this can be viewed as an implementation detail of a provider; although we can provide guidelines, at the end of the day each provider will implement the behavior they desire.
Backwards-compatibility
As a new API, no breakage is being introduced, but it has been proposed to provide a backwards compatibility shim for the new API that either performs concatenation-based batching, or simply executes without batching at all.
As of now, the decision is to not provide any sort of shim: trying to use the new batching API with providers that haven't implemented support for it will throw a NotSupportedException. The reasons for this are:
Feature detection
As the API will throw for providers which don't implement it, a way must be provided for consumers to know whether the API is supported. The proposal is to add a
CanCreateBatch
bool property to both DbProviderFactory and DbConnection, alongside theCreateCommandSet()
methods.Additional Notes
DbBatch.Add(string commandText)
, which would create the CommandText internally and be a lighter API, but some design issues need to resolved around parameter management. We're deferring this for now as it's non-critical sugar.Open questions
Edit history
IEnumerable<DbCommand>
and includes the twoGetEnumerator()
methods, as per @bgrainger's suggestionDbConnection.CreateCommandSet()
and renamedDbProviderFactory.CreateCommandSet()
(it previously had a Db which doesn't belong). AddedCanCreateCommandSet
property to DbProviderFactory, DbConnection.DbConnection.CreateCommandSet()
along protected virtualCreateDbCommandSet()
IEnumerable<DbCommand>
itself.CommandBehavior
to DbBatchCommand and removed thecommandBehavior
parameter toDbBatch.ExecuteReader()
; added a section to explain. AddedRecordsAffected
to DbBatchCommand and updated the affected rows section.IList<DbBatchCommand>
withDbBatchCommandCollection
.The text was updated successfully, but these errors were encountered: