Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.
/ corefx Public archive

SqlClient managed networking improvements #35363

Merged
merged 10 commits into from
Jun 4, 2019

Conversation

Wraith2
Copy link
Contributor

@Wraith2 Wraith2 commented Feb 15, 2019

While profiling SqlClient for various other PR's it became clear that the managed implementation was wasting a lot of resources in networking and that this was severely degrading performance relative to the native pinvoke version. This PR works towards closing that gap.

The core of the changes start in SNIPacket. This has had a start field added which allows the ability to reserve space at the start of the packet for an additional header, the MARS smux function can now take advantage of this reservation removing the need to allocate a new packet and data array before sending. As a result of the header functionality it is no longer possible to cache and re-use a single attention packet.

The lifetime management of packets has been tightened and calls to Release added to avoid dropping packets and their rented buffers for the GC to deal with. The packet has had the IDispose and IEquatable removed because they are not needed and the explicit call to Release over Dispose makes the lifetime management much clearer in the code. As a result of the clearer packet management there is now no need to clone a packet before sending which avoids further re-allocation. This changes the number of packet copies from 2 to 1 in standard cases and with the smux reservation 3 to 1 for mars connections.

The functional and manual tests run through sucessfully.

Benchmark results are overall positive:

Method Mean Error StdDev Gen 0/1k Op Gen 1/1k Op Gen 2/1k Op Allocated Memory/Op
master native 75.37 ms 0.5611 ms 0.5249 ms - - - 273.97 KB
master managed 77.81 ms 0.7198 ms 0.6733 ms 7000.0000 - - 318.35 KB
branch managed 74.07 ms 0.2619 ms 0.2322 ms - - - 298.19 KB

Those results make the managed version looks very similat to the native one in performance, this isn't really the case. Profiles show the improvement in a slighlty more nuanced way.

Baseline, master native: very little GC activity and the test takes 10s
master native

Current, master managed: huge amounts of gc activity, a gen0 roughly every 16ms, the test takes 13s
master managed

Improved, branch managed: much less GC activity and the test takes 12s
smux managed

So looking at it optimistically the memory improvements pull back roughly a third of the speed difference between managed and native.

/cc @afsanehr, @tarikulsabbir, @David-Engel, @saurabh500

@karelz
Copy link
Member

karelz commented Mar 4, 2019

@afsanehr @tarikulsabbir @Gary-Zh @David-Engel can you please look at this one too? Also 2 weeks old without code review :(

@AfsanehR-zz
Copy link
Contributor

@karelz @Wraith2 We are working towards an internal deadline and we will start reviewing these PRs starting next week.

@AfsanehR-zz AfsanehR-zz added this to the 3.0 milestone Mar 5, 2019
@karelz
Copy link
Member

karelz commented Mar 18, 2019

@afsanehr any update on review ETA?

@Gary-Zh
Copy link
Contributor

Gary-Zh commented Mar 18, 2019

@karelz We are reviewing this PR.

@@ -84,6 +84,8 @@ internal abstract class SNIHandle
/// </summary>
public abstract Guid ConnectionId { get; }

public virtual bool SMUXEnabled => false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to make it public?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't publc, it's a public method on an internal class so it's effectively private to external view. it is however part of the public surface inside the assembly.

/// </summary>
/// <param name="obj"></param>
/// <returns>true if equal</returns>
public override bool Equals(object obj)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose to remove these public functions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Packets are never compared to each other. If in further work they are compared it will be in a pool and the reference equality will be all that is needed. As such these methods aren't needed or used currently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to remain what we have before even if it's not used currently.
There should be a reason why they were added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They were originally added with the intention of using them when pooling managed packets. The packet pooling was never fully written and integrated so they weren't used. See the removed WritePacketCache code in TdsParserStateObjectManaged. Are you sure you want dead and trivial code keeping?

@Gary-Zh
Copy link
Contributor

Gary-Zh commented Mar 27, 2019

@Wraith2 I've run EFCore test on Windows and some of the test cases failed due to server timeout.
Here's the stack trace:

System.Data.SqlClient.SqlException : Timeout expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.\r\n---- System.ComponentModel.Win32Exception : The wait operation timed out.
   at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at System.Data.SqlClient.SqlDataReader.TryConsumeMetaData()
   at System.Data.SqlClient.SqlDataReader.get_MetaData()
   at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString)
   at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async, Int32 timeout, Task& task, Boolean asyncWrite, SqlDataReader ds)
   at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method)
   at System.Data.SqlClient.SqlCommand.ExecuteScalar()
   at Microsoft.EntityFrameworkCore.SqlServer.Scaffolding.Internal.SqlServerDatabaseModelFactory.GetDefaultSchema(DbConnection connection) in D:\dotnetcore\EFCoreTest\src\EFCore.SqlServer\Scaffolding\Internal\SqlServerDatabaseModelFactory.cs:line 183
   at Microsoft.EntityFrameworkCore.SqlServer.Scaffolding.Internal.SqlServerDatabaseModelFactory.Create(DbConnection connection, DatabaseModelFactoryOptions options) in D:\dotnetcore\EFCoreTest\src\EFCore.SqlServer\Scaffolding\Internal\SqlServerDatabaseModelFactory.cs:line 121
   at Microsoft.EntityFrameworkCore.TestUtilities.RelationalDatabaseCleaner.Clean(DatabaseFacade facade) in D:\dotnetcore\EFCoreTest\test\EFCore.Relational.Specification.Tests\TestUtilities\RelationalDatabaseCleaner.cs:line 52
   at Microsoft.EntityFrameworkCore.TestUtilities.SqlServerDatabaseFacadeExtensions.<>c.<EnsureClean>b__0_0(DatabaseFacade database) in D:\dotnetcore\EFCoreTest\test\EFCore.SqlServer.FunctionalTests\TestUtilities\SqlServerDatabaseFacadeExtensions.cs:line 12
   at Microsoft.EntityFrameworkCore.ExecutionStrategyExtensions.<>c__2`1.<Execute>b__2_0(<>f__AnonymousType0`2 s) in D:\dotnetcore\EFCoreTest\src\EFCore\Extensions\ExecutionStrategyExtensions.cs:line 77
   at Microsoft.EntityFrameworkCore.ExecutionStrategyExtensions.<>c__DisplayClass12_0`2.<Execute>b__0(DbContext c, TState s) in D:\dotnetcore\EFCoreTest\src\EFCore\Extensions\ExecutionStrategyExtensions.cs:line 354
   at Microsoft.EntityFrameworkCore.SqlServer.Storage.Internal.SqlServerExecutionStrategy.Execute[TState,TResult](TState state, Func`3 operation, Func`3 verifySucceeded) in D:\dotnetcore\EFCoreTest\src\EFCore.SqlServer\Storage\Internal\SqlServerExecutionStrategy.cs:line 47
   at Microsoft.EntityFrameworkCore.ExecutionStrategyExtensions.Execute[TState,TResult](IExecutionStrategy strategy, Func`2 operation, Func`2 verifySucceeded, TState state) in D:\dotnetcore\EFCoreTest\src\EFCore\Extensions\ExecutionStrategyExtensions.cs:line 352
   at Microsoft.EntityFrameworkCore.ExecutionStrategyExtensions.Execute[TState,TResult](IExecutionStrategy strategy, TState state, Func`2 operation) in D:\dotnetcore\EFCoreTest\src\EFCore\Extensions\ExecutionStrategyExtensions.cs:line 302
   at Microsoft.EntityFrameworkCore.ExecutionStrategyExtensions.Execute[TState](IExecutionStrategy strategy, TState state, Action`1 operation) in D:\dotnetcore\EFCoreTest\src\EFCore\Extensions\ExecutionStrategyExtensions.cs:line 70
   at Microsoft.EntityFrameworkCore.TestUtilities.SqlServerDatabaseFacadeExtensions.EnsureClean(DatabaseFacade databaseFacade) in D:\dotnetcore\EFCoreTest\test\EFCore.SqlServer.FunctionalTests\TestUtilities\SqlServerDatabaseFacadeExtensions.cs:line 11
   at Microsoft.EntityFrameworkCore.TestUtilities.SqlServerTestStore.Clean(DbContext context) in D:\dotnetcore\EFCoreTest\test\EFCore.SqlServer.FunctionalTests\TestUtilities\SqlServerTestStore.cs:line 136
   at Microsoft.EntityFrameworkCore.TestUtilities.SqlServerTestStore.CreateDatabase() in D:\dotnetcore\EFCoreTest\test\EFCore.SqlServer.FunctionalTests\TestUtilities\SqlServerTestStore.cs:line 119
   at Microsoft.EntityFrameworkCore.TestUtilities.SqlServerTestStore.Initialize(Func`1 createContext, Action`1 seed) in D:\dotnetcore\EFCoreTest\test\EFCore.SqlServer.FunctionalTests\TestUtilities\SqlServerTestStore.cs:line 80
   at Microsoft.EntityFrameworkCore.TestUtilities.TestStore.Initialize(IServiceProvider serviceProvider, Func`1 createContext, Action`1 seed) in D:\dotnetcore\EFCoreTest\test\EFCore.Specification.Tests\TestUtilities\TestStore.cs:line 44
   at Microsoft.EntityFrameworkCore.TestUtilities.RelationalTestStore.Initialize(IServiceProvider serviceProvider, Func`1 createContext, Action`1 seed) in D:\dotnetcore\EFCoreTest\test\EFCore.Relational.Specification.Tests\TestUtilities\RelationalTestStore.cs:line 29
   at Microsoft.EntityFrameworkCore.TestUtilities.SqlServerTestStore.InitializeSqlServer(IServiceProvider serviceProvider, Func`1 createContext, Action`1 seed) in D:\dotnetcore\EFCoreTest\test\EFCore.SqlServer.FunctionalTests\TestUtilities\SqlServerTestStore.cs:line 72
   at Microsoft.EntityFrameworkCore.TestUtilities.SqlServerTestStore.CreateInitialized(String name, Boolean useFileName, Nullable`1 multipleActiveResultSets) in D:\dotnetcore\EFCoreTest\test\EFCore.SqlServer.FunctionalTests\TestUtilities\SqlServerTestStore.cs:line 42
   at Microsoft.EntityFrameworkCore.Query.QueryBugsTest.CreateTestStore[TContext](Func`1 contextCreator, Action`1 contextInitializer) in D:\dotnetcore\EFCoreTest\test\EFCore.SqlServer.FunctionalTests\Query\QueryBugsTest.cs:line 5358
   at Microsoft.EntityFrameworkCore.Query.QueryBugsTest.CreateDatabase9277() in D:\dotnetcore\EFCoreTest\test\EFCore.SqlServer.FunctionalTests\Query\QueryBugsTest.cs:line 2737
   at Microsoft.EntityFrameworkCore.Query.QueryBugsTest.From_sql_gets_value_of_out_parameter_in_stored_procedure() in D:\dotnetcore\EFCoreTest\test\EFCore.SqlServer.FunctionalTests\Query\QueryBugsTest.cs:line 2705
----- Inner Stack Trace -----

Failed test cases are:
Microsoft.EntityFrameworkCore.Query.QueryBugsTest.From_sql_gets_value_of_out_parameter_in_stored_procedure
and
Microsoft.EntityFrameworkCore.SequenceEndToEndTest.Can_use_sequence_end_to_end_from_multiple_contexts_concurrently_async

@Wraith2
Copy link
Contributor Author

Wraith2 commented Mar 27, 2019

Interesting. I saw similar behaviour when I enabled packet pooling but couldn't work out why. Nothing I've changed should affect the flow of packets only how they're constructed, why would a timeout occur? Can you see anything I've changed anywhere that could cause this?

@Gary-Zh
Copy link
Contributor

Gary-Zh commented Mar 28, 2019

@Wraith2 Please look into Microsoft.EntityFrameworkCore.SequenceEndToEndTest.Can_use_sequence_end_to_end_from_multiple_contexts_concurrently_async since this one is always reproducible. I ran managed SNIPacket on Windows. Do you have EFCore test local setup? If so then you can debug into the test case.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Mar 28, 2019

I have the repo to run the tests but I've never used EF and that test case looks complicated.
As I've said a couple of times now I have seen this problem before directly in the manual tests, fully reproducible. I couldn't resolve it and couldn't isolate the problem so I backed away from the cause to the current code changes because they don't exhibit the problem. It seems to be the case that the error is in something fairly fundamental that I've changed and I don't know where that is.

I know I can't fix this alone because I've tried. Trying to trace individual packets through this mountain of a library is virtually impossible. If it were my job then i've give it a go but this isn't my job. If there's not going to be any attempt at collaboration or suggestion on avenues of approach from the expert owners then there's no point leaving this open.

@Wraith2 Wraith2 closed this Mar 28, 2019
@saurabh500
Copy link
Contributor

Hey @Wraith2 don't close this PR. Let's add the do not merge tag to it and revisit it when there is bandwidth. I agree that it is not easy to debug through the mountain of packets.

Please reopen this PR.

@saurabh500
Copy link
Contributor

Also please note that at this point, we will also have to debug into this problem and figure out where things are going wrong.
The packet handling is quite complex and we have some idea of what is going on, but not to the last bit of nuances involved. You making the effort to improve it is absolutely appreciated and this PR will take time to push through and figure out the problems with what is going wrong in the improvments that you are suggesting.

@Wraith2 Wraith2 changed the title SqlClient managed networking improvements WIP SqlClient managed networking improvements Mar 29, 2019
@Wraith2 Wraith2 reopened this Mar 29, 2019
@karelz
Copy link
Member

karelz commented Apr 13, 2019

@Gary-Zh do you have ETA for the test verification? It's been 9 days since that info ...

@saurabh500
Copy link
Contributor

There is an issue with EF tests hanging on Linux which is being followed up on dotnet/efcore#15333
We use
Also @Wraith2 found out that there is an additional test from DataAccessBenchmark that helped catch issues which were missed by the tests that we run. There was a bug that went through, which was caught by DataAccessBenchmark when @Wraith2 executed those tests. Once we port those over and get some help from the EF team about the test hang issue, we can move forward.

@karelz
Copy link
Member

karelz commented Apr 13, 2019

Thanks for update! Do you have rough ETA? (just to set overall expectations)

@Wraith2
Copy link
Contributor Author

Wraith2 commented Apr 22, 2019

I've got a further branch of this which adds packet caching per connection and reliably fails a manual test with a timeout, probably the same one you're seeing in the EF core tests. I meant to add this functionality originally but the presence of the timeout every time I added the cache convinced me I was doing something wrong but couldn't find it.

Now we know that it isn't my change causing it you might find it useful to use this branch to try and track it down, as I said I tried for some weeks and couldn't. https://github.com/Wraith2/corefx/tree/sqlperf-managedsmux2

@Wraith2
Copy link
Contributor Author

Wraith2 commented Apr 22, 2019

In my local copy of the branch I linked I seem to have fixed the timeout. It wasn't my intention to do so but another bug where a packet was used after it was no longer valid became apparent when packet recycling was introduced.

The error I fixed was in SNITcpHandle.Receive, it returns a packet as an out parameter but if an error occurs that out parameter is not set to null. Changing that method so that errors always return null removed the error I was looking into (which was in a similar test) and the timeout I'd seen before. Might be worth trying.

@ViktorHofer
Copy link
Member

@saurabh500 @Wraith2 could you please provide a status on this?

@karelz
Copy link
Member

karelz commented May 22, 2019

@afsanehr @Gary-Zh what is status of this PR? Any update? Was there some offline deal with @Wraith2 about this one?

@Gary-Zh
Copy link
Contributor

Gary-Zh commented May 22, 2019

Hi @karelz , most of the tests have already passed, there is only one pending test: EFCore on Linux, and it keeps hanging on all of our Linux VMs.
As suggested by the EFCore team we've tried to increase the number of logic CPUs allocated to the VM but it does not change the result. Once this issue is resolved and test passes, we'll merge the PR immediately.

Here's the link to the hanging issue:
dotnet/efcore#15333

@karelz
Copy link
Member

karelz commented May 22, 2019

Thanks @Gary-Zh for the update. Who is working on the last failing test? You / SqlClient team / @Wraith2?

@Gary-Zh
Copy link
Contributor

Gary-Zh commented May 22, 2019

Hi @karelz , no test failures spotted yet because we can't even run the tests successfully. The EFCore test itself is hanging and we are looking for help from EFCore team right now.
It passed on windows tho, with & without native SNI, so I think it should pass on Linux as well.
However we still want to see it passing on Linux before merging this PR.

@karelz
Copy link
Member

karelz commented May 23, 2019

Thanks @Gary-Zh! If you have rough ETA, or at least time-frame in which you plan to reply back, that would be useful to @Wraith2 and me as well :)

@Wraith2 Wraith2 changed the title WIP SqlClient managed networking improvements SqlClient managed networking improvements Jun 3, 2019
@Gary-Zh Gary-Zh merged commit 2469a91 into dotnet:master Jun 4, 2019
@Wraith2 Wraith2 deleted the sqlperf-managedsmux branch June 11, 2019 18:40
@cheenamalhotra
Copy link
Member

Note: Porting to Microsoft.Data.SqlClient incomplete without #40732

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants