Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Millions/Billions of HangFire queries being highlight in SSMS Activity Monitor #2120

Open
djdd87 opened this issue Nov 9, 2022 · 19 comments

Comments

@djdd87
Copy link

djdd87 commented Nov 9, 2022

Hi, I'm trying to get to the bottom of some performance issues on our production environment and SSMS is highlighting the following query in the activity monitor under expensive queries:

<?query --
update top (1) JQ set FetchedAt = GETUTCDATE()
    output INSERTED.Id, INSERTED.JobId, INSERTED.Queue, INSERTED.FetchedAt
    from [HangFire].JobQueue JQ with (forceseek, readpast, updlock, rowlock)
    where Queue in (@queues1) and (FetchedAt is null or FetchedAt < DATEADD(second, @timeoutSs, GETUTCDATE()))
--?>

image

We are using the following settings, which are implied by similar issues already:

GlobalConfiguration.Configuration.UseSqlServerStorage(Globals.ConnectionString, new SqlServerStorageOptions
{
   SlidingInvisibilityTimeout = TimeSpan.FromMinutes(5),
   QueuePollInterval = TimeSpan.Zero
});

The query count seems to deviate between about 23,000 and >1.2 billion (as per the above screenshot). Why is this and have I misconfigured something? Here's a screenshot of all our recurring jobs (we only use HangFire for these recurring jobs):

image

@odinserj
Copy link
Member

Please try installing the latest version of Microsoft.Data.SqlClient package (latest stable version at the moment is 5.0.1) and use the following connection factory in your Hangfire configuration logic in order to use SqlConnection class from the newly installed package.

.UseSqlServerStorage(
    () => new Microsoft.Data.SqlClient.SqlConnection(connectionString));

Please note that Microsoft.Data.SqlClient package uses encryption by default since version 4.0.0, so you might be required in some cases to add TrustServerCertificate=True option to your connection string or disable encryption by Encrypt=false. If nothing works, try to downgrade the Microsoft.Data.SqlClient package to version 3.1.1 that doesn't contain contain changes related to the new defaults.

@djdd87
Copy link
Author

djdd87 commented Nov 10, 2022

Thanks for the reply @odinserj. However, before I make this change, could you please explain what it's doing and why it'll fix the issue? Also, why do I need to remove the SlidingInvisibilityTimeout and QueuePollInterval that other posts mention should be added?

@odinserj
Copy link
Member

Thanks for the reply @odinserj. However, before I make this change, could you please explain what it's doing and why it'll fix the issue?

I was pursuing this issue for years as it happens from time to time in specific environments under specific conditions. I've added protection to almost every layer of retries in Hangfire to avoid running them in an uncontrolled fashion, e.g. without any delay but nothing worked. But for such high number of queries there should be a retry loop with no delay between attempts for some reason.

This summer I noticed that with Microsoft.Data.SqlClient installed this problem goes away, and every customer reported so. So now I recommend everyone with this problem to use Microsoft.Data.SqlClient instead and gather feedback. I saw in its source code retry loop that potentially cause this, but wasn't able to reproduce the issue yet.

Also, why do I need to remove the SlidingInvisibilityTimeout and QueuePollInterval that other posts mention should be added?

No, you don't need to remove them at all.

@djdd87
Copy link
Author

djdd87 commented Nov 10, 2022

OK, I've left in SlidingInvisibilityTimeout and QueuePollInterval and changed it to Microsoft.Data.SqlClient. It'll take a week and a bit until it's in production, but I'll report back either way.

@djdd87
Copy link
Author

djdd87 commented Dec 13, 2022

Unfortunately the same issue is still present. With the suggested changes, it's still hitting billions of calls. We've been monitoring it for a few weeks. Occasionally it's like 15,000, but peaks to greater than one billion again:

GlobalConfiguration.Configuration.UseSqlServerStorage(() => new Microsoft.Data.SqlClient.SqlConnection(RP_Globals.ConnectionString), new SqlServerStorageOptions()
{
   SlidingInvisibilityTimeout = TimeSpan.FromMinutes(5),
   QueuePollInterval = TimeSpan.Zero
});

@odinserj
Copy link
Member

Thanks for your feedback @djdd87, working on it. By the way, do you notice high network incoming traffic to your SQL Server when facing this issue?

@odinserj
Copy link
Member

Also, does the CPU consumption shown by SSMS matches the same metric in the Task Manager dialog?

@djdd87
Copy link
Author

djdd87 commented Dec 15, 2022

Thanks for getting back to me! I'll speak to our production team and see if I can get that information.

What does setting SlidingInvisibilityTimeout and QueuePollInterval actually do? It's not business critical for us that our CRON tasks run exactly at 06:00:00, i.e. they could run at 06:00:15 and it'll still work perfectly.

Am I right in thinking that QueuePollInterval set to 0 just constantly spams SQL server as fast as it can and I could just change it to 15 seconds if I'm not bothered if it starts 06:00:00 or 06:00:15?

@odinserj
Copy link
Member

Thank you, will appreciate any additional information regarding this problem.

SlidingInvisibilityTimeout option defines how long background job is considered to be "active", when no heartbeats sent for it so serves as a fault detector. QueuePollInterval affects how often queues are polled, for sub second values only a single worker is able to poll the empty queue, and for above 1 second all the workers perform this. TimeSpan.Zero causes minimum value to be used that's 50 ms (maximum 20 queries per second) for 1.7 and it will be increased for 200ms (max 5 queries per second) for 1.8.

You can increase the value to 15 seconds, in this case I expect the probability of such an issue can be decreased. But nevertheless something strange sometimes happens with the connection pool itself so that a lot of open connections (much higher than the maximum configured number) are observed from the one side, and timeout exceptions with inability to get a connection from the pool on the other side.

Current fetching implementation can be simplified and waits on SQL Server's side can be avoided, so I will make these changes in 1.7.33.

@rzurlo
Copy link

rzurlo commented Apr 21, 2024

I have installed version 1.8.12, the latest one, and the issue persists. By setting QueuePollInterval = TimeSpan.MaxValue, the queries are executed every 7-8 seconds. I don't have any queues installed and only have one recurring job. I would like to completely disable polling because it's unnecessary. Is it possible?

@fiidim
Copy link

fiidim commented Jun 6, 2024

I'm in a similar boat. Regardless of QueuePollInterval setting value, I can see at least 3 polls per second. I've tried many different combinations of intervals and values, and it seems to ignore the values and poll every 200ms. I've tried the Microsoft SQL Client and no relief there either. My setup is an ASP.NET (MVC5) application, and SimpleInjector IOC containers. I'm not sure what version of Hangfire first induced the issue, but it still exists in current releases. I've ensured schemas are latest version. No other issues are being experienced, other than unable to control the frequency of database polling (SqlServer).

@djdd87
Copy link
Author

djdd87 commented Jun 6, 2024

We've just moved to redis instead.

@fiidim
Copy link

fiidim commented Jun 6, 2024

Ironic that the interval is exactly 200ms, regardless of what I set it to. Not sure why the value is ignored, even with primarily default settings.

@odinserj
Copy link
Member

odinserj commented Jun 7, 2024

@fiidim can you post here your settings related to Hangfire and Hangfire.SqlServer?

odinserj added a commit that referenced this issue Jun 7, 2024
Queries used for sliding invisibility timeout-based fetching are unified now, and we can use the same implementation for sub- and over second polling delay configured.

Possibly relates to #2120.
@fiidim
Copy link

fiidim commented Jun 7, 2024

Sure... Relevant packages I'm using are:

<PackageVersion Include="Hangfire.Console" Version="1.4.3" />
<PackageVersion Include="Hangfire.Core" Version="1.8.12" />
<PackageVersion Include="Hangfire.SqlServer" Version="1.8.12" />
<PackageVersion Include="Microsoft.Data.SqlClient" Version="5.2.1" />
<PackageVersion Include="SimpleInjector" Version="5.4.6" />

BackgroundJobServerOptions are registered first (I've tried leaving them all at default with no change in behaviour):

            container.Register<BackgroundJobServerOptions>(() => new BackgroundJobServerOptions()
            {
                StopTimeout = TimeSpan.FromSeconds(10),
                Queues = queues.Select(j => j.Name).ToArray(),
                HeartbeatInterval = new System.TimeSpan(0, 0, 30),
                ServerCheckInterval = new System.TimeSpan(0, 0, 30),
                SchedulePollingInterval = new System.TimeSpan(0, 0, 30)
            });

JobStorage registered next:

container.Register<JobStorage>(() => new SqlServerStorage(<connectionString>));

GlobalConfiguration set to mostly default (but have tried various combinations with no success):

            GlobalConfiguration.Configuration
                .SetDataCompatibilityLevel(CompatibilityLevel.Version_180) 
                .UseSimpleAssemblyNameTypeSerializer()
                .UseRecommendedSerializerSettings()
                .UseSqlServerStorage(<connectionStringName>, new SqlServerStorageOptions
                {
                    QueuePollInterval = TimeSpan.FromSeconds(5),
                    SlidingInvisibilityTimeout = null
                })
                .UseSimpleInjectorJobActivator(jobContainer);

Then server registered as such:

            container.Register<BackgroundJobServer>(() =>
            {
                var jobServer = new BackgroundJobServer(
                    container.GetInstance<BackgroundJobServerOptions>(),
                    container.GetInstance<JobStorage>());
                return jobServer;
            });

My result, regardless of how I set any variables results in a poll every 200ms, as can be seen from SQL Server Activity Monitor:

image

Let me know if you need any additional information that may help. Thanks.

@fiidim
Copy link

fiidim commented Jun 9, 2024

I made some headway on this. Seems that the SqlServerStorageOptions defined in the GlobalConfiguration get overridden by the ones defined on JobStorage (which weren't defined at all), when the BackgroundJobServer is registered with JobStorage parameter. Feels like it's my fault, but is peculiar behaviour to have to declare JobStorage to point to SqlServer, and also have to configure GlobalConfiguration to also use SqlServer, sending in the connection string twice.

@fiidim
Copy link

fiidim commented Jun 10, 2024

Further strange behaviour occurs when using a QueuePollInterval of non-Zero (example 5 seconds). Instead of previously having one worker polling 5 times per second, now all 20 workers want to poll simultaneously at the 5 second interval.

@odinserj
Copy link
Member

Yes, historically semaphore that limits polling only to a single worker was applied only to sub-second intervals due to their different implementation. Since now implementation is the same for sub-second and above-second intervals, I have changed this logic and since 1.8.14 released today, semaphore is applied in both cases. Please upgrade to the newest version and let me know if it solves the issue.

@fiidim
Copy link

fiidim commented Jun 11, 2024

Yes! That worked perfectly (1.8.14). I am very pleased with the behaviour! Thank you for taking a look and resolving my issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants