Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Triggers are getting blocked permanently #145

Closed
shelmling opened this issue May 22, 2017 · 22 comments
Closed

Triggers are getting blocked permanently #145

shelmling opened this issue May 22, 2017 · 22 comments

Comments

@shelmling
Copy link
Contributor

Dear Quartz Team,

We are using Quartz 2.2.1 in clustered-mode with JDBC job store to schedule jobs marked as @DisallowConcurrentExecution.

We have observed that occasionally triggers are getting stuck in trigger state BLOCKED without ever recovering automatically. Looking into the job store DB tables, the pattern is always the same:

  • The TRIGGER_STATE on <PREFIX>_TRIGGERS is in state BLOCKED

  • There is no corresponding record in <PREFIX>_FIRED_TRIGGERS

Obviously org.quartz.impl.jdbcjobstore.JobStoreSupport.clusterRecover(Connection, List<SchedulerStateRecord>) will not recover such triggers, so the only way to get out of this inconsistent state is to manually set the TRIGGER_STATE back to WAITING.

It is not yet clear under which circumstances this error occurs. However, our log files indicate that jobs getting stuck coincides with temporary database problems.

Below you can find an example of a NullPointerException in org.quartz.impl.jdbcjobstore.JobStoreSupport.triggersFired(List<OperableTrigger>). The exception itself was caused somewhere in the JDBC driver (Sybase jConnect) when trying to invoke rollback() on a JDBC connection. The log entry’s timestamp correlates exactly with the time the trigger got stuck.

2017 05 01 20:20:02#+00#ERROR#org.quartz.core.QuartzSchedulerThread##anonymous#ItOpScheduler_Clustered_QuartzSchedulerThread#Runtime error occurred in main trigger firing loop.java.lang.NullPointerException: while trying to invoke the method com.sybase.jdbc4.tds.TdsCursor.setRowNum(int) of a null object loaded from field com.sybase.jdbc4.tds.CurInfo3Token._cursor of an object loaded from local variable 'this'
	at com.sybase.jdbc4.tds.CurInfo3Token.getMetaInformation(CurInfo3Token.java:85)
	at com.sybase.jdbc4.tds.CurInfoToken.<init>(CurInfoToken.java:130)
	at com.sybase.jdbc4.tds.CurInfo3Token.<init>(CurInfo3Token.java:45)
	at com.sybase.jdbc4.tds.Tds.nextResult(Tds.java:3239)
	at com.sybase.jdbc4.tds.Tds.readCommandResults(Tds.java:4459)
	at com.sybase.jdbc4.tds.Tds.doCommand(Tds.java:4444)
	at com.sybase.jdbc4.tds.Tds.endTransaction(Tds.java:2602)
	at com.sybase.jdbc4.jdbc.SybConnection.rollback(SybConnection.java:1953)
	at sun.reflect.GeneratedMethodAccessor492.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at com.sap.core.persistence.jdbc.trace.TraceableBase$1.invoke(TraceableBase.java:44)
	at com.sun.proxy.$Proxy17.rollback(Unknown Source)
	at com.sap.core.persistence.jdbc.trace.TraceableConnection.rollback(TraceableConnection.java:239)
	at org.apache.commons.dbcp.DelegatingConnection.rollback(DelegatingConnection.java:368)
	at org.apache.commons.dbcp.DelegatingConnection.rollback(DelegatingConnection.java:368)
	at org.apache.commons.dbcp.PoolingDataSource$PoolGuardConnectionWrapper.rollback(PoolingDataSource.java:323)
	at sun.reflect.GeneratedMethodAccessor492.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.quartz.impl.jdbcjobstore.AttributeRestoringConnectionInvocationHandler.invoke(AttributeRestoringConnectionInvocationHandler.java:73)
	at com.sun.proxy.$Proxy143.rollback(Unknown Source)
	at org.quartz.impl.jdbcjobstore.JobStoreSupport.rollbackConnection(JobStoreSupport.java:3658)
	at org.quartz.impl.jdbcjobstore.JobStoreSupport.executeInNonManagedTXLock(JobStoreSupport.java:3817)
	at org.quartz.impl.jdbcjobstore.JobStoreSupport.triggersFired(JobStoreSupport.java:2908)
	at org.quartz.core.QuartzSchedulerThread.run(QuartzSchedulerThread.java:336)
|

Please let me know if you need additional details.

Thanks for your support,
Sebastian

@shelmling
Copy link
Contributor Author

I was able to reproduce the issue in the debugger.

If org.quartz.impl.jdbcjobstore.JobStoreSupport.triggerFired(Connection, OperableTrigger) throws a RuntimeException after the trigger state has been set to BLOCKED, the trigger will get stuck. The reason for this is that QuartzSchedulerThead.run() will call org.quartz.spi.JobStore.releaseAcquiredTrigger(OperableTrigger) in case of RuntimeExceptions, which will delete the record from <PREFIX>_FIRED_TRIGGERS but will not set back the trigger state from BLOCKED to WAITING.

Probably org.quartz.impl.jdbcjobstore.JobStoreSupport.releaseAcquiredTrigger(Connection, OperableTrigger) should set back the trigger state to WAITING from both ACQUIRED (which it already does) and from BLOCKED.

Best Regards,
Sebastian

@shaoxt
Copy link

shaoxt commented Oct 12, 2017

Is there anyone who can verify and release this fix?
I'm using 2.2.1 also. The job got stuck for same reason.

@mstead
Copy link

mstead commented Nov 6, 2017

I would also like to see this bug addressed. Any update on when the fix will get merged and built?

@AntonSemenovKazan
Copy link

Actual bug for me too.
And no visible solution.

We are going to change quartz.

@pbuckley
Copy link

pbuckley commented Dec 1, 2017

👍 looks like there's a PR to fixit

@sschwenker
Copy link

We're having this same issue. Any idea when this will get approved and get a patch out? It seems pretty significant.

@yagarwals
Copy link

As we can see that this has been around for a while now and this bug is fixed in dot net libraries of quartz. I t will be helpful for us to know on when this will be fixed, as we are affected by this.

(This pull request has been raised to address that on 22 May 2017 #146)

@MSudheer87
Copy link

MSudheer87 commented Oct 15, 2018

I am facing the same issue with 2.2.1, Is there a fix provided by Quartz for this ?
@shelmling @dx-pbuckley

@AntonSemenovKazan
Copy link

@MSudheer87 we reduced frequency of this problem when we created sepatared db.
Before we had only one DB for App and Quartz.

@MSudheer87
Copy link

@AntonSemenovKazan Thanks for your reply.
you mean to say you have created a separate schema for Quartz alone, correct ?

Let me give you few more details about my issue,

  • I have integrated Quartz with Spring Batch (The actual batch processing happens in Spring Batch,
    Scheduling alone is being taken care by Quartz)
  • I have a single schema where i have Spring batch tables and Quartz specific tables.
  • I am facing this issue only in clustered environment (4 node cluster), while i am running on a single/
    two node cluster, it works very well.

based on above facts, Do you suspect anything else other than separating out the DB Schema ? Please provide your suggestions, Thank you.

@AntonSemenovKazan
Copy link

AntonSemenovKazan commented Oct 16, 2018

@MSudheer87
Our case:
We have only one DB scheme where tables of our Application and Quartz tables live.
We have noticed that triggers become blocked when SQLException (like SQL Timeout exception) happened or transaction become disconnected.
These problems happen when SQL Server has long hard operations wih App data.

So we decided to move Quartz tables to their own separate scheme.
Also we thought about to move to separate node (SQL Server).

For example, you also can take sources and extend logging.
May be you can notice something special.

But all my team think that Quartz is a little bit strange and we quite often have some troubles.
But sometimes everything OK.

We have written monitoring system to control Quartz behavour.

Unfortunately I cannot say anything about Spring Batch.

@sunildabburi
Copy link

Going by @shelmling fix, a work-around for Spring managed datasource till his PR is merged:

  1. Create a CustomJobStore class that extends LocalDataSourceJobStore and override releaseAcquiredTrigger(Connection conn, OperableTrigger trigger)
  2. Create a CustomSchedulerFactory class that extends StdSchedulerFactory and override initialize(Properties props)
  3. Set schedulerFactoryClass during SchedulerBean creation
public class CustomSchedulerFactory extends StdSchedulerFactory {

	@Override
	public void initialize(Properties props) throws SchedulerException {
		props.put(StdSchedulerFactory.PROP_JOB_STORE_CLASS, CustomJobStore.class.getName());
		super.initialize(props);
	}
}
public class CustomJobStore extends LocalDataSourceJobStore {

	public CustomJobStore() {
		super();
	}

	@Override
	protected void releaseAcquiredTrigger(Connection conn, OperableTrigger trigger) throws JobPersistenceException {
		try {
			getDelegate().updateTriggerStateFromOtherState(conn, trigger.getKey(), STATE_WAITING, STATE_ACQUIRED);
			getDelegate().updateTriggerStateFromOtherState(conn, trigger.getKey(), STATE_WAITING, STATE_BLOCKED);
			getDelegate().deleteFiredTrigger(conn, trigger.getFireInstanceId());
		} catch (SQLException e) {
			throw new JobPersistenceException("Couldn't release acquired trigger: " + e.getMessage(), e);
		}
	}
}
SchedulerFactoryBean scheduler = new SchedulerFactoryBean();
...
scheduler.setSchedulerFactoryClass(CustomSchedulerFactory.class);

This should unblock you as of now 👍

@zemian
Copy link
Contributor

zemian commented Feb 12, 2019

Thank you @shelmling for the PR! It's now merged!

@carstenartur
Copy link

I cannot see this issue 145 to be fixed in any release at https://github.com/quartz-scheduler/quartz/releases .
Does this mean it is not fixed or there has never been a bug?

@AntonSemenovKazan
Copy link

I should say that eventually we changed Quartz to Hangfire and now we live happily.

We found out that Quartz stuck when HDD had high I/O operations.
We easily reproduced that case.

So then we compared Hangfire+MS SQL with high I/O operations and it worked without problems.
So ... we changed Quartz and now we have been using Hangfire more than one year.

@jnehlmeier
Copy link

I cannot see this issue 145 to be fixed in any release at https://github.com/quartz-scheduler/quartz/releases .
Does this mean it is not fixed or there has never been a bug?

@carstenartur It is fixed in 2.3.1+ as part of pull request #146

see commit: 3f65b28

@sunildabburi
Copy link

For those who are still seeing this issue and if you implemented the JobListener interface, make sure you handle the exception yourself within jobWasExecuted method as quartz does not handle exception thrown in that method and that could leave your job state in BLOCKED and never get recovered. We experienced it with Quartz version 2.3.0

@fernandoRSS
Copy link

For those who are still seeing this issue and if you implemented the JobListener interface, make sure you handle the exception yourself within jobWasExecuted method as quartz does not handle exception thrown in that method and that could leave your job state in BLOCKED and never get recovered. We experienced it with Quartz version 2.3.0

After 6 hours trying to find a solution I bumped into your answer and it was exactly what was happening in my code.
Thank you

@sww0825521xy
Copy link

For those who are still seeing this issue and if you implemented the JobListener interface, make sure you handle the exception yourself within jobWasExecuted method as quartz does not handle exception thrown in that method and that could leave your job state in BLOCKED and never get recovered. We experienced it with Quartz version 2.3.0

Most of these blocked triggers change the trigger state from BLOCKED to WAITING automatically since added the JobListner and upgrade quartz from 2.3.0 to the latest version 2.3.2.
But still exists one or two BLOCKED triggers in my case.

@kevinamasur
Copy link

@AntonSemenovKazan and @MSudheer87 our team was noticing issues similar to yours. The quartz queries on SQL Server would slow down significantly to the point of timing out when the database was under heavy load and waits started to increase (either CPU, Memory, or Disk).

One of the things we noticed was the query plans for quartz queries were very inefficient. When we deep dived we found something like the following query was being executed on the DB:
exec sp_executesql N'SELECT TRIGGER_NAME FROM test_localhost_TRIGGERS WHERE SCHED_NAME = ''QuartzScheduler_test_localhost'' AND TRIGGER_NAME = @P0 AND TRIGGER_GROUP = @P1',N'@P0 nvarchar(4000),@P1 nvarchar(4000)',N'Trigger.AnExampleTrigger',N'test_localhost'

The issue with the above is that parameters @P0 and @P1 are being passed to SQL Server as nvarchar, but the columns on the tables are actually varchar. This can cause the database to use very inefficient query plans when running queries.

We found we were using the SQL Server JDBC driver, and it has a setting for setSendStringParametersAsUnicode which defaults to on. This causes all string parameters to be sent as nvarchar, even if the column is varchar.

The quartz tables don't have any nvarchar columns, and based on Microsofts own documentation:

For optimal performance with the CHAR, VARCHAR, and LONGVARCHAR JDBC data types, an application should set the sendStringParametersAsUnicode property to "false"

We only recently found this out and deployed it to our production environment. I don't know if it will fully fix the quartz issues we have seen but we haven't had any issues since we made this switch last month, so I thought I would post it out incase it helps anyone else. The fix was simply setting the sendStringParametersAsUnicode on the jdbc url for the quartz connection pool.

@asookazian
Copy link

asookazian commented Aug 28, 2023

Our team has deployed latest version 2.3.2 quartz JAR last Friday in prod server. But we are still immediately experiencing BLOCKED triggers in the qrtz_triggers table for our email notification after catalina restart. Is there any followup/advice for this behavior? Would enabling TRACE logging on "org.quartz" package help? We have a clustered option set in the quartz.properties and the quartz tables are in the same db schema as our app tables.

Has anyone enabled JMX remote access to mbeans as a potential workaround to reset trigger state to WAITING and then immediately firing trigger? https://dzone.com/articles/how-manage-quartz-remotely

@Herman1998CHAN
Copy link

I encountered a similar issue in our team's new project. We have two app servers, but we only deployed the war file, which includes the Quartz job, on app A. In the related cluster configuration on the Quartz job XML, we set it to false.

Initially, the scheduled jobs functioned properly, but at certain times, they would become "blocked" and fail to resume.

We observed that only the jobs related to updating the status in the database would rerun and return to normal functioning. However, we couldn't find any relevant error logs on the app server. Does anyone have any insights into this issue?

On the other hand, we previously implemented the same configuration (two app servers, war file hosted only on app A) in another project without encountering similar issues. It's worth noting that the previous project used MSSQL, while the new project uses MYSQL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests