-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DISK IO wait due to the millions of update statement #1962
Comments
There were no changes in this area since 1.7.16, so the problem isn't related to the upgrade itself. Theoretically I also see you've enabled the Can you please tell me more about millions of updates per second – how did you diagnose this number and what's the number of updates per second? Also, what's the number of background job servers you have and what's the number of workers they have? |
Thank you for providing the screenshots! First of all, Hangfire is write-heavy, because it's primary task is to update background job state data. However it should cause millions of updates only when you have millions background jobs. Can you tell me whether it is possible to get "Query → Executions" breakdown to understand what queries are executed most of time? Also I see that heartbeat of your servers executed 7 hours ago. Is it due to stale data (e.g. you didn't refresh that page for 7 hours) or even the refreshed page show 7 hour old heartbeats? |
Thank you a lot for your help and all the screenshots. The new query you are talking about appeared due to configuration changes you've mentioned in the first comment, and the fetching implementation switch was caused by enabling the following options. When both of them enabled and SlidingInvisibilityTimeout = TimeSpan.FromMinutes(5),
QueuePollInterval = TimeSpan.Zero, However I haven't seen before that in some cases it can cause billions of queries, and this is a bug I'm not aware of and can't even tell you yet what's happened. I will investigate this issue and will let you know once it's resolved and would be happy if you are able to answer some more questions. As a workaround, you can roll back SlidingInvisibilityTimeout = TimeSpan.FromMinutes(5),
QueuePollInterval = TimeSpan.FromSeconds(queuePollInterval), Regarding the investigation, I see three symptoms – waits caused by the new query, high execution counts of the new query and much higher transaction rate. Waits and PAGE IO are likely caused by the I'd also like to see the bigger picture, because some queries can affect other ones. Of course those queries can't increase the number of executions, but anyway. Could you answer the following questions?
|
Ah, thank you for seeing "Requeue Failed Hangfire Jobs" query. That one will need to be deprecated in our system. It was some update statement to re-queue failed jobs. Since then, we have developed a JobFilter to do that. I will contact the right developers to look at removing that. SQL: 4619364175 - Similar for deleting duplicate jobs. Also looks like one I will follow up with developers. SQL: 4863681399 - hmm, strange. It's a simple Select-statement from a Province table. Will need to investigate from my side. Instance type is x1e.2xlarge Okay, thanks Sergey. We're going to try removing UsePageLocksOnDequeue and set the SlidingInvisibilityTimeout/QueuePollInterval as recommended. Probably won't know for a couple of weeks, as our production releases are scheduled. Will report back if we see improvements. |
Thank you so much for the update, great the issue is resolved for you! However not at all, because I still understand what's happened with |
Hi Sergey, Unfortunately, after a VM migration, we´re experiencing the same behavior in our SQL Server. Running version We´re detecting more than 600 processes in the SQL Server like this: Those processes seem to be executing: Do you have a workaround or suggestion to avoid this issue? Thanks in advance for your help! |
Thanks for reporting this @arielbvargas! Could you run the stdump utility to observe stack traces of managed threads when your application is struggling from this issue and post the output here? This utility uses ClrMD to connect to your application (or parse mini dump file) with a debugging session and dump stack traces of all the running threads (printing them to the console output). With its help I will be able to understand how many worker threads issue that query and think what to do based on the output. |
Thanks a lot! How many concurrent queries in the sleeping state you see at the moment? I'd like to compare their number to the number of managed threads that execute that query in the process. |
Hi again. More than 770. I´m attaching the list.
I forgot to mention that we have two IIS running Hangfire. The first IIS is
from the dump previously sent. The second one has several application pools
(all pointing to the same code). Do you also need that dump ?
Thanks!
…On Mon, Dec 13, 2021 at 2:05 PM Sergey Odinokov ***@***.***> wrote:
Thanks a lot! How many concurrent queries in the sleeping state you see at
the moment? I'd like to compare their number to the number of managed
threads that execute that query in the process.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1962 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAFTC7EJJXYGFHIYBLMLLLUQYRU3ANCNFSM5HAGCH3A>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Sure, it would be very useful to see the whole picture. Appreciate your assistance! |
Ok, I´m attaching two more dumps from the second IIS. Thanks! |
Previous implementation was too relaxed, and the actual number of concurrent executions were much higher than expected "1". Relates to #1962
When hosted in IIS, when deploying new app version, when using autostart providers (probably) and when appdomain unload was delayed (probably), OperationCancelledException thrown in the while(true) or while loop with some external condition, for some reason doesn't lead to loop exit, and leads to new iteration instead without any delay. This behavior was observed when BackgroundExecution class was written and looked really weird, and it's better to avoid calling ThrowIfCancellationRequested in such loops to avoid bad things. Relates to #1962
@arielbvargas thank you so much for the dumps. I've fixed two possible reasons for this strange behaviour and released version 1.7.28. Please upgrade to it and perform application pool recycle after performing the upgrade to avoid old instances from being running in an AppDomain. If the issue persists feel free to re-open this issue. |
We are seeing the same issue in 1.7.25. I am going to upgrade to latest to see if it resolves the issue. Should I keep my config like this or change as per this ticket?
|
Please try installing the latest version of Microsoft.Data.SqlClient package (latest stable version at the moment is 5.0.0) and use the following connection factory in your Hangfire configuration logic in order to use .UseSqlServerStorage(
() => new Microsoft.Data.SqlClient.SqlConnection(connectionString)); All the recent strange issues with queries and connectivity were resolved by using a new package from Microsoft. Please let me know if it solves your issue! |
But please note there may be problems with authentication, Microsoft.Data.SqlClient changed defaults in 4.0.0 and sometimes some new keywords like |
Thanks - Sadly changing the connection string for us is pretty tricky - thousands of on-premise installations. What are your thoughts on just upgrading Hangfire as a solution until we can change the conn strings? Will that work? |
@odinserj - I have put the latest Hangfire onto my machine and monitored SQL. It does not appear to resolve the issue. I will try changing the settings as per above. |
@gregpakes it is also possible to use Microsoft.Data.SqlClient package of version 3.1.1, in this case no connection string changes required, because breaking changes were introduced only in version 4.0.0 (confirmed both on Windows and Linux). .UseSqlServerStorage(
() => new Microsoft.Data.SqlClient.SqlConnection(connectionString)); Unfortunately it's not possible to provide a general fix of this issue in Hangfire itself. Upcoming version 1.8.0 will contain breaking change for this and will use Microsoft.Data.SqlClient by default when this package is referenced to avoid issues with millions of queries or connection pool issues. But it is possible that some runtimes will be broken after this, but at least this problem will be documented in the upgrade guide and in most cases this problem will be raised in testing environments. System.Data.SqlClient can't be used by default anymore, because it's difficult to monitor this issue and difficult to understand what happens. But unfortunately I can't modify the patch versions (1.7.X) to use Microsoft.Data.SqlClient by default, because it's not expected by anyone that breaking changes required when upgrading between patch versions. Please try to install and use the Microsoft.Data.SqlClient package of version 3.1.1, because this packages resolved all the recent problems with SQL Server-based storage. |
Thanks @odinserj I've resolved the issue with:
Is this ok? |
I'm afraid in this case the probability of a problem is reduced, but the problem itself doesn't go away – I've seen the same symptoms with other queries as well, like with the Every customer who changed it to Microsoft.Data.SqlClient reported that all their problems gone. |
@odinserj - Sadly when I use Microsoft.Data,SqlClient 3.1.1 - I get the following issue: |
@gregpakes access violation, on .NET 😩 May I ask you to post here the full exception with stack trace, maybe you saved it somewhere? It would be very useful, because looks like that the referenced SO question relates to System.Data.SqlClient. |
Stacktrace atached:
|
I have gone back to Microsoft.Data.SQLClient 5.0.0 and the issue has returned. I am back to 60,000 requests per hour for this query.
I am definitely using Microsoft.Data.SqlClient. |
60,000 requests per hour (not millions) is an expected metric for that configuration. You can tune the |
The plot thickens. We can't use Microsoft.Data.SqlClient 5.0.0 due to this bug: dotnet/SqlClient#1418 They say a fix will be out by the end of the year, but the fix is included in 4.1.1. Correct me if I'm wrong, but 4.X still contains the Hangfire runaway issues? So I would need to go to 3.1? |
|
We update from 1.7.16 to 1.7.25 and have noticed high waits on IO. Any ideas?
We update our SQL storage options per recommendations to these settings:
CommandBatchMaxTimeout = TimeSpan.FromMinutes(5),
SlidingInvisibilityTimeout = TimeSpan.FromMinutes(5),
QueuePollInterval = TimeSpan.Zero,
PrepareSchemaIfNecessary = prepareSchema,
UseRecommendedIsolationLevel = true,
UsePageLocksOnDequeue = true,
DisableGlobalLocks = true
Previously, was these settings:
QueuePollInterval = TimeSpan.FromSeconds(queuePollInterval),
PrepareSchemaIfNecessary = prepareSchema,
UseRecommendedIsolationLevel = true,
UsePageLocksOnDequeue = true,
DisableGlobalLocks = true
Our DPA has captured this as the cause of high waits on IO:
/* (comment inserted by DPA)
Character Range: 308 to 595
Waiting on statement:
UPDATE top (1) JQ
SET FetchedAt = GETUTCDATE() output INSERTED.Id,
INSERTED.JobId,
INSERTED.Queue,
INSERTED.FetchedAt
FROM [HangFire].JobQueue JQ
WITH
(
forceseek,
paglock,
xlock
)
WHERE Queue in (@queues1)
AND (FetchedAt is null
OR FetchedAt < DATEADD(second, @Timeoutss, GETUTCDATE()))
/
(@queues1 nvarchar(4000),@Timeoutss int,@delayms int,@ENDms int)
set nocount on;
set xact_abort on;
set tran isolation level read committed;
declare
@EnD datetime2 = DATEADD(ms, @ENDms, SYSUTCDATETIME()),
@delay datetime = DATEADD(ms, @delayms, convert(DATETIME, 0));
WHILE (SYSUTCDATETIME() < @EnD)
BEGIN
/ BEGIN ACTIVE SECTION (comment inserted by DPA) /
UPDATE top (1) JQ
SET FetchedAt = GETUTCDATE() output INSERTED.Id,
INSERTED.JobId,
INSERTED.Queue,
INSERTED.FetchedAt
FROM [HangFire].JobQueue JQ
WITH
(
forceseek,
paglock,
xlock
)
WHERE Queue in (@queues1)
AND (FetchedAt is null
OR FetchedAt < DATEADD(second, @Timeoutss, GETUTCDATE()))
/ END ACTIVE SECTION (comment inserted by DPA) */
;
IF @@rowcount > 0
RETURN;
WAITFOR DELAY @delay;
END
The text was updated successfully, but these errors were encountered: