Exponentially backoff when out of background workers #4567

konskov · 2022-08-02T08:38:45Z

The scheduler detects the following three types of job failures:

1.Jobs that fail to launch (due to shortage of background workers)
2.Jobs that throw a runtime error
3.Jobs that crash due to a process crashing

In cases 2 and 3, additive backoff is applied in calculating the next
start time of a failed job.
In case 1 we previously retried to launch all jobs that failed to launch
simultaneously.

This commit introduces exponential backoff in case 1,
randomly selecting a wait time in [2, 2 + 2^f] seconds at microsecond granularity.
The aim is to reduce the collision probability for jobs that compete
for a background worker. The maximum backoff value is 1 minute.
It does not change the behavior for cases 2 and 3.

Fixes #4562

codecov · 2022-08-02T08:55:12Z

Codecov Report

Merging #4567 (56928c5) into main (1fa8373) will decrease coverage by 0.00%.
The diff coverage is 96.87%.

@@            Coverage Diff             @@
##             main    #4567      +/-   ##
==========================================
- Coverage   90.81%   90.81%   -0.01%     
==========================================
  Files         224      224              
  Lines       42272    42298      +26     
==========================================
+ Hits        38391    38412      +21     
- Misses       3881     3886       +5

Impacted Files	Coverage Δ
tsl/src/nodes/decompress_chunk/decompress_chunk.c	`95.20% <80.00%> (-0.14%)`	⬇️
src/bgw/job.c	`94.58% <100.00%> (ø)`
src/bgw/job_stat.c	`90.18% <100.00%> (+0.84%)`	⬆️
src/bgw/scheduler.c	`83.62% <100.00%> (-0.16%)`	⬇️
src/nodes/chunk_append/exec.c	`94.50% <100.00%> (ø)`
src/nodes/chunk_insert_state.c	`97.61% <100.00%> (+<0.01%)`	⬆️
src/loader/bgw_message_queue.c	`85.52% <0.00%> (-2.64%)`	⬇️
tsl/src/nodes/data_node_dispatch.c	`96.72% <0.00%> (+0.23%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ed212b4...56928c5. Read the comment docs.

mkindahl

In general, this looks good, but it would be good if you could add an outline in the commit message where you describe how:

Jobs that fail while executing are re-tried
Jobs that fail to start the background worker are re-tried (IIRC, this is different from the one above)
Jobs that succeed are re-tried

gayyappan

IIRC we had exponential backoff initially and then switched to additive backoff. The problem is with jobs that have fairly longer schedules. Say the job is scheduled to run once a day. With exponential backoff, you will quickly schedule a failed job too far into the future. A better way might be to add a hard upper bound check for the next start time after a failure.
i.e. if job failed:

      new_start_time = last_failure_time + schedule interval (with some back off)
      ub_start_time = last_failue_time + '1 h'
      if  ub_start_time < new_start_time
          new_start_time = worst_case.

Also make sure to test and catch potential overflow with exponential backoffs.

konskov · 2022-08-10T13:00:53Z

Thank you so much for reviewing! Currently there is an upper bound but it is probably too high (10 times the schedule interval, and for jobs that fail to launch due to background worker shortage, the upper bound is 1 minute). I will change it to something fixed and smaller, '1h' as you suggest.

mfundul · 2022-08-30T07:55:14Z

src/bgw/job_stat.c

+		else
+		{
+			// retry every 2 seconds
+			ival = IntervalPGetDatum(&retry_ival);
+		}


Suggested change

else

{

// retry every 2 seconds

ival = IntervalPGetDatum(&retry_ival);

}

else

// retry every 2 seconds

ival = IntervalPGetDatum(&retry_ival);

nit

mfundul · 2022-08-30T07:55:51Z

src/bgw/job_stat.c

+		else
+		{
+			ival_max = IntervalPGetDatum(&interval_max);
+		}


Suggested change

else

{

ival_max = IntervalPGetDatum(&interval_max);

}

else

ival_max = IntervalPGetDatum(&interval_max);

nit

mfundul · 2022-08-30T08:16:14Z

src/bgw/job_stat.c

+		/* arbitrarily choose 2 for exponential growth */
+		for (i = 0; i < exponent - 1; i++)
+		{
+			ival = DirectFunctionCall2(interval_mul, ival, Float8GetDatum(2));


Exponential backoff works when the scheduled time is picked randomly between now() and now() + ival (regardless of jitter). If we always pick now() + ival then we risk livelocking the conflicting jobs ad infinitum.

This is orthogonal to jitter, which helps with scheduling at non-integer multiples of time units, so as to not generate high load spikes in the system (e.g. due to everything happening at integer seconds instead of spreading them out).

You are right, we are risking that. So for first case (out of background workers) I will change the backoff calculation to randomly pick a backoff value between [0, 2^f - 1] where f is the number of consecutive failed launches of the job.

mfundul · 2022-08-31T09:21:00Z

src/bgw/job_stat.c

+	int64 micros_per_minute = 1000000;
+	// will get a random int in [0, (2^f - 1) * 1000]
+	// this represents a random amount of microseconds to backoff
+	int64 rand_backoff = random() % (max_slots * micros_per_minute);


Suggested change

int64 rand_backoff = random() % (max_slots * micros_per_minute);

int64 rand_backoff = random() % (max_slots * USECS_PER_SEC);

I don't see anything to do with minutes here.

src/bgw/job_stat.c

mfundul · 2022-08-31T10:03:14Z

src/bgw/job_stat.c

+		Interval interval_max = { .time = 60000000 };
+		Interval retry_ival = { .time = 2000000 };
+		retry_ival.time += rand_backoff;
+		if (!launch_failure)


Suggested change

if (!launch_failure)

if (launch_failure)

I think?

No, that should be the logic there. Launch failure is when the job couldn't be started because of lack of background workers, and in that case, we have the random backoff in seconds calculated above.
In other cases (job threw runtime error) we have a retry period of 5 minutes, and so the backoff there grows like 5, 10, 20 minutes until it reaches the schedule interval to take the retry period into account. Unless you meant to put the launch_failure case first which would be more readable so I'll do that

mfundul · 2022-09-01T09:02:40Z

src/bgw/scheduler.c

+			// if sjob->consecutive_failed_launches > 0, give the system some time to breathe,
+			// do not attempt to immediately re-run the job.


This comment is inside in ts_bgw_job_stat_next_start(), maybe remove?

The scheduler detects the following three types of job failures: 1.Jobs that fail to launch (due to shortage of background workers) 2.Jobs that throw a runtime error 3.Jobs that crash due to a process crashing In cases 2 and 3, additive backoff is applied in calculating the next start time of a failed job. In case 1 we previously retried to launch all jobs that failed to launch simultaneously. This commit introduces exponential backoff in case 1, randomly selecting a wait time in [2, 2 + 2^f] seconds at microsecond granularity. The aim is to reduce the collision probability for jobs that compete for a background worker. The maximum backoff value is 1 minute. It does not change the behavior for cases 2 and 3. Fixes timescale#4562

@jflambert

This release adds major new features since the 2.8.1 release. We deem it moderate priority for upgrading. This release includes these noteworthy features: * Hierarchical Continuous Aggregates (aka Continuous Aggregate on top of another Continuous Aggregate) * Improve `time_bucket_gapfill` function allowing specifying timezone to bucket * Use `alter_data_node()` to change the data node configuration. This function introduces the option to configure the availability of the data node. This release also includes several bug fixes. **Features** * #4476 Batch rows on access node for distributed COPY * #4567 Exponentially backoff when out of background workers * #4650 Show warnings when not following best practices * #4664 Introduce fixed schedules for background jobs * #4668 Hierarchical Continuous Aggregates * #4670 Add timezone support to time_bucket_gapfill * #4678 Add interface for troubleshooting job failures * #4718 Add ability to merge chunks while compressing * #4786 Extend the now() optimization to also apply to CURRENT_TIMESTAMP * #4820 Support parameterized data node scans in joins * #4830 Add function to change configuration of a data nodes * #4966 Handle DML activity when datanode is not available * #4971 Add function to drop stale chunks on a datanode **Bugfixes** * #4663 Don't error when compression metadata is missing * #4673 Fix now() constification for VIEWs * #4681 Fix compression_chunk_size primary key * #4696 Report warning when enabling compression on hypertable * #4745 Fix FK constraint violation error while insert into hypertable which references partitioned table * #4756 Improve compression job IO performance * #4770 Continue compressing other chunks after an error * #4794 Fix degraded performance seen on timescaledb_internal.hypertable_local_size() function * #4807 Fix segmentation fault during INSERT into compressed hypertable * #4822 Fix missing segmentby compression option in CAGGs * #4823 Fix a crash that could occur when using nested user-defined functions with hypertables * #4840 Fix performance regressions in the copy code * #4860 Block multi-statement DDL command in one query * #4898 Fix cagg migration failure when trying to resume * #4904 Remove BitmapScan support in DecompressChunk * #4906 Fix a performance regression in the query planner by speeding up frozen chunk state checks * #4910 Fix a typo in process_compressed_data_out * #4918 Cagg migration orphans cagg policy * #4941 Restrict usage of the old format (pre 2.7) of continuous aggregates in PostgreSQL 15. * #4955 Fix cagg migration for hypertables using timestamp without timezone * #4968 Check for interrupts in gapfill main loop * #4988 Fix cagg migration crash when refreshing the newly created cagg **Thanks** * @jflambert for reporting a crash with nested user-defined functions. * @jvanns for reporting hypertable FK reference to vanilla PostgreSQL partitioned table doesn't seem to work * @kou for fixing a typo in process_compressed_data_out * @xvaara for helping reproduce a bug with bitmap scans in transparent decompression * @byazici for reporting a problem with segmentby on compressed caggs * @tobiasdirksen for requesting Continuous aggregate on top of another continuous aggregate * @Xima for reporting a bug in Cagg migration

@jflambert

This release adds major new features since the 2.8.1 release. We deem it moderate priority for upgrading. This release includes these noteworthy features: * Hierarchical Continuous Aggregates (aka Continuous Aggregate on top of another Continuous Aggregate) * Improve `time_bucket_gapfill` function allowing specifying timezone to bucket * Use `alter_data_node()` to change the data node configuration. This function introduces the option to configure the availability of the data node. This release also includes several bug fixes. **Features** * #4476 Batch rows on access node for distributed COPY * #4567 Exponentially backoff when out of background workers * #4650 Show warnings when not following best practices * #4664 Introduce fixed schedules for background jobs * #4668 Hierarchical Continuous Aggregates * #4670 Add timezone support to time_bucket_gapfill * #4678 Add interface for troubleshooting job failures * #4718 Add ability to merge chunks while compressing * #4786 Extend the now() optimization to also apply to CURRENT_TIMESTAMP * #4820 Support parameterized data node scans in joins * #4830 Add function to change configuration of a data nodes * #4966 Handle DML activity when datanode is not available * #4971 Add function to drop stale chunks on a datanode **Bugfixes** * #4663 Don't error when compression metadata is missing * #4673 Fix now() constification for VIEWs * #4681 Fix compression_chunk_size primary key * #4696 Report warning when enabling compression on hypertable * #4745 Fix FK constraint violation error while insert into hypertable which references partitioned table * #4756 Improve compression job IO performance * #4770 Continue compressing other chunks after an error * #4794 Fix degraded performance seen on timescaledb_internal.hypertable_local_size() function * #4807 Fix segmentation fault during INSERT into compressed hypertable * #4822 Fix missing segmentby compression option in CAGGs * #4823 Fix a crash that could occur when using nested user-defined functions with hypertables * #4840 Fix performance regressions in the copy code * #4860 Block multi-statement DDL command in one query * #4898 Fix cagg migration failure when trying to resume * #4904 Remove BitmapScan support in DecompressChunk * #4906 Fix a performance regression in the query planner by speeding up frozen chunk state checks * #4910 Fix a typo in process_compressed_data_out * #4918 Cagg migration orphans cagg policy * #4941 Restrict usage of the old format (pre 2.7) of continuous aggregates in PostgreSQL 15. * #4955 Fix cagg migration for hypertables using timestamp without timezone * #4968 Check for interrupts in gapfill main loop * #4988 Fix cagg migration crash when refreshing the newly created cagg **Thanks** * @jflambert for reporting a crash with nested user-defined functions. * @jvanns for reporting hypertable FK reference to vanilla PostgreSQL partitioned table doesn't seem to work * @kou for fixing a typo in process_compressed_data_out * @xvaara for helping reproduce a bug with bitmap scans in transparent decompression * @byazici for reporting a problem with segmentby on compressed caggs * @tobiasdirksen for requesting Continuous aggregate on top of another continuous aggregate * @Xima for reporting a bug in Cagg migration

@jflambert

This release adds major new features since the 2.8.1 release. We deem it moderate priority for upgrading. This release includes these noteworthy features: * Hierarchical Continuous Aggregates (aka Continuous Aggregate on top of another Continuous Aggregate) * Improve `time_bucket_gapfill` function allowing specifying timezone to bucket * Use `alter_data_node()` to change the data node configuration. This function introduces the option to configure the availability of the data node. This release also includes several bug fixes. **Features** * #4476 Batch rows on access node for distributed COPY * #4567 Exponentially backoff when out of background workers * #4650 Show warnings when not following best practices * #4664 Introduce fixed schedules for background jobs * #4668 Hierarchical Continuous Aggregates * #4670 Add timezone support to time_bucket_gapfill * #4678 Add interface for troubleshooting job failures * #4718 Add ability to merge chunks while compressing * #4786 Extend the now() optimization to also apply to CURRENT_TIMESTAMP * #4820 Support parameterized data node scans in joins * #4830 Add function to change configuration of a data nodes * #4966 Handle DML activity when datanode is not available * #4971 Add function to drop stale chunks on a datanode **Bugfixes** * #4663 Don't error when compression metadata is missing * #4673 Fix now() constification for VIEWs * #4681 Fix compression_chunk_size primary key * #4696 Report warning when enabling compression on hypertable * #4745 Fix FK constraint violation error while insert into hypertable which references partitioned table * #4756 Improve compression job IO performance * #4770 Continue compressing other chunks after an error * #4794 Fix degraded performance seen on timescaledb_internal.hypertable_local_size() function * #4807 Fix segmentation fault during INSERT into compressed hypertable * #4822 Fix missing segmentby compression option in CAGGs * #4823 Fix a crash that could occur when using nested user-defined functions with hypertables * #4840 Fix performance regressions in the copy code * #4860 Block multi-statement DDL command in one query * #4898 Fix cagg migration failure when trying to resume * #4904 Remove BitmapScan support in DecompressChunk * #4906 Fix a performance regression in the query planner by speeding up frozen chunk state checks * #4910 Fix a typo in process_compressed_data_out * #4918 Cagg migration orphans cagg policy * #4941 Restrict usage of the old format (pre 2.7) of continuous aggregates in PostgreSQL 15. * #4955 Fix cagg migration for hypertables using timestamp without timezone * #4968 Check for interrupts in gapfill main loop * #4988 Fix cagg migration crash when refreshing the newly created cagg **Thanks** * @jflambert for reporting a crash with nested user-defined functions. * @jvanns for reporting hypertable FK reference to vanilla PostgreSQL partitioned table doesn't seem to work * @kou for fixing a typo in process_compressed_data_out * @xvaara for helping reproduce a bug with bitmap scans in transparent decompression * @byazici for reporting a problem with segmentby on compressed caggs * @tobiasdirksen for requesting Continuous aggregate on top of another continuous aggregate * @Xima for reporting a bug in Cagg migration

@jflambert

This release adds major new features since the 2.8.1 release. We deem it moderate priority for upgrading. This release includes these noteworthy features: * Hierarchical Continuous Aggregates (aka Continuous Aggregate on top of another Continuous Aggregate) * Improve `time_bucket_gapfill` function allowing specifying timezone to bucket * Use `alter_data_node()` to change the data node configuration. This function introduces the option to configure the availability of the data node. This release also includes several bug fixes. **Features** * #4476 Batch rows on access node for distributed COPY * #4567 Exponentially backoff when out of background workers * #4650 Show warnings when not following best practices * #4664 Introduce fixed schedules for background jobs * #4668 Hierarchical Continuous Aggregates * #4670 Add timezone support to time_bucket_gapfill * #4678 Add interface for troubleshooting job failures * #4718 Add ability to merge chunks while compressing * #4786 Extend the now() optimization to also apply to CURRENT_TIMESTAMP * #4820 Support parameterized data node scans in joins * #4830 Add function to change configuration of a data nodes * #4966 Handle DML activity when datanode is not available * #4971 Add function to drop stale chunks on a datanode **Bugfixes** * #4663 Don't error when compression metadata is missing * #4673 Fix now() constification for VIEWs * #4681 Fix compression_chunk_size primary key * #4696 Report warning when enabling compression on hypertable * #4745 Fix FK constraint violation error while insert into hypertable which references partitioned table * #4756 Improve compression job IO performance * #4770 Continue compressing other chunks after an error * #4794 Fix degraded performance seen on timescaledb_internal.hypertable_local_size() function * #4807 Fix segmentation fault during INSERT into compressed hypertable * #4822 Fix missing segmentby compression option in CAGGs * #4823 Fix a crash that could occur when using nested user-defined functions with hypertables * #4840 Fix performance regressions in the copy code * #4860 Block multi-statement DDL command in one query * #4898 Fix cagg migration failure when trying to resume * #4904 Remove BitmapScan support in DecompressChunk * #4906 Fix a performance regression in the query planner by speeding up frozen chunk state checks * #4910 Fix a typo in process_compressed_data_out * #4918 Cagg migration orphans cagg policy * #4941 Restrict usage of the old format (pre 2.7) of continuous aggregates in PostgreSQL 15. * #4955 Fix cagg migration for hypertables using timestamp without timezone * #4968 Check for interrupts in gapfill main loop * #4988 Fix cagg migration crash when refreshing the newly created cagg **Thanks** * @jflambert for reporting a crash with nested user-defined functions. * @jvanns for reporting hypertable FK reference to vanilla PostgreSQL partitioned table doesn't seem to work * @kou for fixing a typo in process_compressed_data_out * @xvaara for helping reproduce a bug with bitmap scans in transparent decompression * @byazici for reporting a problem with segmentby on compressed caggs * @tobiasdirksen for requesting Continuous aggregate on top of another continuous aggregate * @Xima for reporting a bug in Cagg migration

@jflambert

This release adds major new features since the 2.8.1 release. We deem it moderate priority for upgrading. This release includes these noteworthy features: * Hierarchical Continuous Aggregates (aka Continuous Aggregate on top of another Continuous Aggregate) * Improve `time_bucket_gapfill` function allowing specifying timezone to bucket * Use `alter_data_node()` to change the data node configuration. This function introduces the option to configure the availability of the data node. This release also includes several bug fixes. **Features** * #4476 Batch rows on access node for distributed COPY * #4567 Exponentially backoff when out of background workers * #4650 Show warnings when not following best practices * #4664 Introduce fixed schedules for background jobs * #4668 Hierarchical Continuous Aggregates * #4670 Add timezone support to time_bucket_gapfill * #4678 Add interface for troubleshooting job failures * #4718 Add ability to merge chunks while compressing * #4786 Extend the now() optimization to also apply to CURRENT_TIMESTAMP * #4820 Support parameterized data node scans in joins * #4830 Add function to change configuration of a data nodes * #4966 Handle DML activity when datanode is not available * #4971 Add function to drop stale chunks on a datanode **Bugfixes** * #4663 Don't error when compression metadata is missing * #4673 Fix now() constification for VIEWs * #4681 Fix compression_chunk_size primary key * #4696 Report warning when enabling compression on hypertable * #4745 Fix FK constraint violation error while insert into hypertable which references partitioned table * #4756 Improve compression job IO performance * #4770 Continue compressing other chunks after an error * #4794 Fix degraded performance seen on timescaledb_internal.hypertable_local_size() function * #4807 Fix segmentation fault during INSERT into compressed hypertable * #4822 Fix missing segmentby compression option in CAGGs * #4823 Fix a crash that could occur when using nested user-defined functions with hypertables * #4840 Fix performance regressions in the copy code * #4860 Block multi-statement DDL command in one query * #4898 Fix cagg migration failure when trying to resume * #4904 Remove BitmapScan support in DecompressChunk * #4906 Fix a performance regression in the query planner by speeding up frozen chunk state checks * #4910 Fix a typo in process_compressed_data_out * #4918 Cagg migration orphans cagg policy * #4941 Restrict usage of the old format (pre 2.7) of continuous aggregates in PostgreSQL 15. * #4955 Fix cagg migration for hypertables using timestamp without timezone * #4968 Check for interrupts in gapfill main loop * #4988 Fix cagg migration crash when refreshing the newly created cagg **Thanks** * @jflambert for reporting a crash with nested user-defined functions. * @jvanns for reporting hypertable FK reference to vanilla PostgreSQL partitioned table doesn't seem to work * @kou for fixing a typo in process_compressed_data_out * @xvaara for helping reproduce a bug with bitmap scans in transparent decompression * @byazici for reporting a problem with segmentby on compressed caggs * @tobiasdirksen for requesting Continuous aggregate on top of another continuous aggregate * @Xima for reporting a bug in Cagg migration

@jflambert

This release adds major new features since the 2.8.1 release. We deem it moderate priority for upgrading. This release includes these noteworthy features: * Hierarchical Continuous Aggregates (aka Continuous Aggregate on top of another Continuous Aggregate) * Improve `time_bucket_gapfill` function allowing specifying timezone to bucket * Use `alter_data_node()` to change the data node configuration. This function introduces the option to configure the availability of the data node. This release also includes several bug fixes. **Features** * #4476 Batch rows on access node for distributed COPY * #4567 Exponentially backoff when out of background workers * #4650 Show warnings when not following best practices * #4664 Introduce fixed schedules for background jobs * #4668 Hierarchical Continuous Aggregates * #4670 Add timezone support to time_bucket_gapfill * #4678 Add interface for troubleshooting job failures * #4718 Add ability to merge chunks while compressing * #4786 Extend the now() optimization to also apply to CURRENT_TIMESTAMP * #4820 Support parameterized data node scans in joins * #4830 Add function to change configuration of a data nodes * #4966 Handle DML activity when datanode is not available * #4971 Add function to drop stale chunks on a datanode **Bugfixes** * #4663 Don't error when compression metadata is missing * #4673 Fix now() constification for VIEWs * #4681 Fix compression_chunk_size primary key * #4696 Report warning when enabling compression on hypertable * #4745 Fix FK constraint violation error while insert into hypertable which references partitioned table * #4756 Improve compression job IO performance * #4770 Continue compressing other chunks after an error * #4794 Fix degraded performance seen on timescaledb_internal.hypertable_local_size() function * #4807 Fix segmentation fault during INSERT into compressed hypertable * #4822 Fix missing segmentby compression option in CAGGs * #4823 Fix a crash that could occur when using nested user-defined functions with hypertables * #4840 Fix performance regressions in the copy code * #4860 Block multi-statement DDL command in one query * #4898 Fix cagg migration failure when trying to resume * #4904 Remove BitmapScan support in DecompressChunk * #4906 Fix a performance regression in the query planner by speeding up frozen chunk state checks * #4910 Fix a typo in process_compressed_data_out * #4918 Cagg migration orphans cagg policy * #4941 Restrict usage of the old format (pre 2.7) of continuous aggregates in PostgreSQL 15. * #4955 Fix cagg migration for hypertables using timestamp without timezone * #4968 Check for interrupts in gapfill main loop * #4988 Fix cagg migration crash when refreshing the newly created cagg **Thanks** * @jflambert for reporting a crash with nested user-defined functions. * @jvanns for reporting hypertable FK reference to vanilla PostgreSQL partitioned table doesn't seem to work * @kou for fixing a typo in process_compressed_data_out * @xvaara for helping reproduce a bug with bitmap scans in transparent decompression * @byazici for reporting a problem with segmentby on compressed caggs * @tobiasdirksen for requesting Continuous aggregate on top of another continuous aggregate * @Xima for reporting a bug in Cagg migration

@jflambert

This release adds major new features since the 2.8.1 release. We deem it moderate priority for upgrading. This release includes these noteworthy features: * Hierarchical Continuous Aggregates (aka Continuous Aggregate on top of another Continuous Aggregate) * Improve `time_bucket_gapfill` function allowing specifying timezone to bucket * Use `alter_data_node()` to change the data node configuration. This function introduces the option to configure the availability of the data node. This release also includes several bug fixes. **Features** * #4476 Batch rows on access node for distributed COPY * #4567 Exponentially backoff when out of background workers * #4650 Show warnings when not following best practices * #4664 Introduce fixed schedules for background jobs * #4668 Hierarchical Continuous Aggregates * #4670 Add timezone support to time_bucket_gapfill * #4678 Add interface for troubleshooting job failures * #4718 Add ability to merge chunks while compressing * #4786 Extend the now() optimization to also apply to CURRENT_TIMESTAMP * #4820 Support parameterized data node scans in joins * #4830 Add function to change configuration of a data nodes * #4966 Handle DML activity when datanode is not available * #4971 Add function to drop stale chunks on a datanode **Bugfixes** * #4663 Don't error when compression metadata is missing * #4673 Fix now() constification for VIEWs * #4681 Fix compression_chunk_size primary key * #4696 Report warning when enabling compression on hypertable * #4745 Fix FK constraint violation error while insert into hypertable which references partitioned table * #4756 Improve compression job IO performance * #4770 Continue compressing other chunks after an error * #4794 Fix degraded performance seen on timescaledb_internal.hypertable_local_size() function * #4807 Fix segmentation fault during INSERT into compressed hypertable * #4822 Fix missing segmentby compression option in CAGGs * #4823 Fix a crash that could occur when using nested user-defined functions with hypertables * #4840 Fix performance regressions in the copy code * #4860 Block multi-statement DDL command in one query * #4898 Fix cagg migration failure when trying to resume * #4904 Remove BitmapScan support in DecompressChunk * #4906 Fix a performance regression in the query planner by speeding up frozen chunk state checks * #4910 Fix a typo in process_compressed_data_out * #4918 Cagg migration orphans cagg policy * #4941 Restrict usage of the old format (pre 2.7) of continuous aggregates in PostgreSQL 15. * #4955 Fix cagg migration for hypertables using timestamp without timezone * #4968 Check for interrupts in gapfill main loop * #4988 Fix cagg migration crash when refreshing the newly created cagg **Thanks** * @jflambert for reporting a crash with nested user-defined functions. * @jvanns for reporting hypertable FK reference to vanilla PostgreSQL partitioned table doesn't seem to work * @kou for fixing a typo in process_compressed_data_out * @xvaara for helping reproduce a bug with bitmap scans in transparent decompression * @byazici for reporting a problem with segmentby on compressed caggs * @tobiasdirksen for requesting Continuous aggregate on top of another continuous aggregate * @Xima for reporting a bug in Cagg migration

@jflambert

This release adds major new features since the 2.8.1 release. We deem it moderate priority for upgrading. This release includes these noteworthy features: * Hierarchical Continuous Aggregates (aka Continuous Aggregate on top of another Continuous Aggregate) * Improve `time_bucket_gapfill` function allowing specifying timezone to bucket * Use `alter_data_node()` to change the data node configuration. This function introduces the option to configure the availability of the data node. This release also includes several bug fixes. **Features** * #4476 Batch rows on access node for distributed COPY * #4567 Exponentially backoff when out of background workers * #4650 Show warnings when not following best practices * #4664 Introduce fixed schedules for background jobs * #4668 Hierarchical Continuous Aggregates * #4670 Add timezone support to time_bucket_gapfill * #4678 Add interface for troubleshooting job failures * #4718 Add ability to merge chunks while compressing * #4786 Extend the now() optimization to also apply to CURRENT_TIMESTAMP * #4820 Support parameterized data node scans in joins * #4830 Add function to change configuration of a data nodes * #4966 Handle DML activity when datanode is not available * #4971 Add function to drop stale chunks on a datanode **Bugfixes** * #4663 Don't error when compression metadata is missing * #4673 Fix now() constification for VIEWs * #4681 Fix compression_chunk_size primary key * #4696 Report warning when enabling compression on hypertable * #4745 Fix FK constraint violation error while insert into hypertable which references partitioned table * #4756 Improve compression job IO performance * #4770 Continue compressing other chunks after an error * #4794 Fix degraded performance seen on timescaledb_internal.hypertable_local_size() function * #4807 Fix segmentation fault during INSERT into compressed hypertable * #4822 Fix missing segmentby compression option in CAGGs * #4823 Fix a crash that could occur when using nested user-defined functions with hypertables * #4840 Fix performance regressions in the copy code * #4860 Block multi-statement DDL command in one query * #4898 Fix cagg migration failure when trying to resume * #4904 Remove BitmapScan support in DecompressChunk * #4906 Fix a performance regression in the query planner by speeding up frozen chunk state checks * #4910 Fix a typo in process_compressed_data_out * #4918 Cagg migration orphans cagg policy * #4941 Restrict usage of the old format (pre 2.7) of continuous aggregates in PostgreSQL 15. * #4955 Fix cagg migration for hypertables using timestamp without timezone * #4968 Check for interrupts in gapfill main loop * #4988 Fix cagg migration crash when refreshing the newly created cagg **Thanks** * @jflambert for reporting a crash with nested user-defined functions. * @jvanns for reporting hypertable FK reference to vanilla PostgreSQL partitioned table doesn't seem to work * @kou for fixing a typo in process_compressed_data_out * @xvaara for helping reproduce a bug with bitmap scans in transparent decompression * @byazici for reporting a problem with segmentby on compressed caggs * @tobiasdirksen for requesting Continuous aggregate on top of another continuous aggregate * @Xima for reporting a bug in Cagg migration

@jflambert

This release adds major new features since the 2.8.1 release. We deem it moderate priority for upgrading. This release includes these noteworthy features: * Hierarchical Continuous Aggregates (aka Continuous Aggregate on top of another Continuous Aggregate) * Improve `time_bucket_gapfill` function allowing specifying timezone to bucket * Use `alter_data_node()` to change the data node configuration. This function introduces the option to configure the availability of the data node. This release also includes several bug fixes. **Features** * #4476 Batch rows on access node for distributed COPY * #4567 Exponentially backoff when out of background workers * #4650 Show warnings when not following best practices * #4664 Introduce fixed schedules for background jobs * #4668 Hierarchical Continuous Aggregates * #4670 Add timezone support to time_bucket_gapfill * #4678 Add interface for troubleshooting job failures * #4718 Add ability to merge chunks while compressing * #4786 Extend the now() optimization to also apply to CURRENT_TIMESTAMP * #4820 Support parameterized data node scans in joins * #4830 Add function to change configuration of a data nodes * #4966 Handle DML activity when datanode is not available * #4971 Add function to drop stale chunks on a datanode **Bugfixes** * #4663 Don't error when compression metadata is missing * #4673 Fix now() constification for VIEWs * #4681 Fix compression_chunk_size primary key * #4696 Report warning when enabling compression on hypertable * #4745 Fix FK constraint violation error while insert into hypertable which references partitioned table * #4756 Improve compression job IO performance * #4770 Continue compressing other chunks after an error * #4794 Fix degraded performance seen on timescaledb_internal.hypertable_local_size() function * #4807 Fix segmentation fault during INSERT into compressed hypertable * #4822 Fix missing segmentby compression option in CAGGs * #4823 Fix a crash that could occur when using nested user-defined functions with hypertables * #4840 Fix performance regressions in the copy code * #4860 Block multi-statement DDL command in one query * #4898 Fix cagg migration failure when trying to resume * #4904 Remove BitmapScan support in DecompressChunk * #4906 Fix a performance regression in the query planner by speeding up frozen chunk state checks * #4910 Fix a typo in process_compressed_data_out * #4918 Cagg migration orphans cagg policy * #4941 Restrict usage of the old format (pre 2.7) of continuous aggregates in PostgreSQL 15. * #4955 Fix cagg migration for hypertables using timestamp without timezone * #4968 Check for interrupts in gapfill main loop * #4988 Fix cagg migration crash when refreshing the newly created cagg **Thanks** * @jflambert for reporting a crash with nested user-defined functions. * @jvanns for reporting hypertable FK reference to vanilla PostgreSQL partitioned table doesn't seem to work * @kou for fixing a typo in process_compressed_data_out * @xvaara for helping reproduce a bug with bitmap scans in transparent decompression * @byazici for reporting a problem with segmentby on compressed caggs * @tobiasdirksen for requesting Continuous aggregate on top of another continuous aggregate * @Xima for reporting a bug in Cagg migration

@byazici

This release adds major new features since the 2.8.1 release. We deem it moderate priority for upgrading. This release includes these noteworthy features: * Hierarchical Continuous Aggregates (aka Continuous Aggregate on top of another Continuous Aggregate) * Improve `time_bucket_gapfill` function to allow specifying the timezone to bucket * Introduce fixed schedules for background jobs and the ability to check job errors. * Use `alter_data_node()` to change the data node configuration. This function introduces the option to configure the availability of the data node. This release also includes several bug fixes. **Features** * #4476 Batch rows on access node for distributed COPY * #4567 Exponentially backoff when out of background workers * #4650 Show warnings when not following best practices * #4664 Introduce fixed schedules for background jobs * #4668 Hierarchical Continuous Aggregates * #4670 Add timezone support to time_bucket_gapfill * #4678 Add interface for troubleshooting job failures * #4718 Add ability to merge chunks while compressing * #4786 Extend the now() optimization to also apply to CURRENT_TIMESTAMP * #4820 Support parameterized data node scans in joins * #4830 Add function to change configuration of a data nodes * #4966 Handle DML activity when datanode is not available * #4971 Add function to drop stale chunks on a datanode **Bugfixes** * #4663 Don't error when compression metadata is missing * #4673 Fix now() constification for VIEWs * #4681 Fix compression_chunk_size primary key * #4696 Report warning when enabling compression on hypertable * #4745 Fix FK constraint violation error while insert into hypertable which references partitioned table * #4756 Improve compression job IO performance * #4770 Continue compressing other chunks after an error * #4794 Fix degraded performance seen on timescaledb_internal.hypertable_local_size() function * #4807 Fix segmentation fault during INSERT into compressed hypertable * #4822 Fix missing segmentby compression option in CAGGs * #4823 Fix a crash that could occur when using nested user-defined functions with hypertables * #4840 Fix performance regressions in the copy code * #4860 Block multi-statement DDL command in one query * #4898 Fix cagg migration failure when trying to resume * #4904 Remove BitmapScan support in DecompressChunk * #4906 Fix a performance regression in the query planner by speeding up frozen chunk state checks * #4910 Fix a typo in process_compressed_data_out * #4918 Cagg migration orphans cagg policy * #4941 Restrict usage of the old format (pre 2.7) of continuous aggregates in PostgreSQL 15. * #4955 Fix cagg migration for hypertables using timestamp without timezone * #4968 Check for interrupts in gapfill main loop * #4988 Fix cagg migration crash when refreshing the newly created cagg * #5054 Fix segfault after second ANALYZE * #5086 Reset baserel cache on invalid hypertable cache **Thanks** * @byazici for reporting a problem with segmentby on compressed caggs * @jflambert for reporting a crash with nested user-defined functions. * @jvanns for reporting hypertable FK reference to vanilla PostgreSQL partitioned table doesn't seem to work * @kou for fixing a typo in process_compressed_data_out * @kyrias for reporting a crash when ANALYZE is executed on extended query protocol mode with extension loaded. * @tobiasdirksen for requesting Continuous aggregate on top of another continuous aggregate * @Xima for reporting a bug in Cagg migration * @xvaara for helping reproduce a bug with bitmap scans in transparent decompression

@byazici

This release adds major new features since the 2.8.1 release. We deem it moderate priority for upgrading. This release includes these noteworthy features: * Hierarchical Continuous Aggregates (aka Continuous Aggregate on top of another Continuous Aggregate) * Improve `time_bucket_gapfill` function to allow specifying the timezone to bucket * Introduce fixed schedules for background jobs and the ability to check job errors. * Use `alter_data_node()` to change the data node configuration. This function introduces the option to configure the availability of the data node. This release also includes several bug fixes. **Features** * #4476 Batch rows on access node for distributed COPY * #4567 Exponentially backoff when out of background workers * #4650 Show warnings when not following best practices * #4664 Introduce fixed schedules for background jobs * #4668 Hierarchical Continuous Aggregates * #4670 Add timezone support to time_bucket_gapfill * #4678 Add interface for troubleshooting job failures * #4718 Add ability to merge chunks while compressing * #4786 Extend the now() optimization to also apply to CURRENT_TIMESTAMP * #4820 Support parameterized data node scans in joins * #4830 Add function to change configuration of a data nodes * #4966 Handle DML activity when datanode is not available * #4971 Add function to drop stale chunks on a datanode **Bugfixes** * #4663 Don't error when compression metadata is missing * #4673 Fix now() constification for VIEWs * #4681 Fix compression_chunk_size primary key * #4696 Report warning when enabling compression on hypertable * #4745 Fix FK constraint violation error while insert into hypertable which references partitioned table * #4756 Improve compression job IO performance * #4770 Continue compressing other chunks after an error * #4794 Fix degraded performance seen on timescaledb_internal.hypertable_local_size() function * #4807 Fix segmentation fault during INSERT into compressed hypertable * #4822 Fix missing segmentby compression option in CAGGs * #4823 Fix a crash that could occur when using nested user-defined functions with hypertables * #4840 Fix performance regressions in the copy code * #4860 Block multi-statement DDL command in one query * #4898 Fix cagg migration failure when trying to resume * #4904 Remove BitmapScan support in DecompressChunk * #4906 Fix a performance regression in the query planner by speeding up frozen chunk state checks * #4910 Fix a typo in process_compressed_data_out * #4918 Cagg migration orphans cagg policy * #4941 Restrict usage of the old format (pre 2.7) of continuous aggregates in PostgreSQL 15. * #4955 Fix cagg migration for hypertables using timestamp without timezone * #4968 Check for interrupts in gapfill main loop * #4988 Fix cagg migration crash when refreshing the newly created cagg * #5054 Fix segfault after second ANALYZE * #5086 Reset baserel cache on invalid hypertable cache **Thanks** * @byazici for reporting a problem with segmentby on compressed caggs * @jflambert for reporting a crash with nested user-defined functions. * @jvanns for reporting hypertable FK reference to vanilla PostgreSQL partitioned table doesn't seem to work * @kou for fixing a typo in process_compressed_data_out * @kyrias for reporting a crash when ANALYZE is executed on extended query protocol mode with extension loaded. * @tobiasdirksen for requesting Continuous aggregate on top of another continuous aggregate * @Xima for reporting a bug in Cagg migration * @xvaara for helping reproduce a bug with bitmap scans in transparent decompression

konskov force-pushed the exponential_backoff branch from de6cd06 to 6877d50 Compare August 2, 2022 08:43

konskov self-assigned this Aug 2, 2022

konskov force-pushed the exponential_backoff branch 2 times, most recently from bae852d to f6a7043 Compare August 4, 2022 07:13

konskov marked this pull request as ready for review August 4, 2022 08:18

mkindahl reviewed Aug 4, 2022

View reviewed changes

konskov force-pushed the exponential_backoff branch 2 times, most recently from add0155 to 6983392 Compare August 8, 2022 06:28

konskov requested review from mfundul and fabriziomello August 9, 2022 14:03

gayyappan reviewed Aug 9, 2022

View reviewed changes

konskov force-pushed the exponential_backoff branch 4 times, most recently from b02e65e to faed7cc Compare August 23, 2022 12:29

konskov marked this pull request as draft August 23, 2022 13:14

konskov force-pushed the exponential_backoff branch 5 times, most recently from 132023a to ee27e02 Compare August 24, 2022 15:34

konskov marked this pull request as ready for review August 25, 2022 06:33

svenklemm approved these changes Aug 25, 2022

View reviewed changes

fabriziomello approved these changes Aug 25, 2022

View reviewed changes

konskov force-pushed the exponential_backoff branch from ee27e02 to e2ed742 Compare August 29, 2022 06:32

mfundul suggested changes Aug 30, 2022

View reviewed changes

konskov force-pushed the exponential_backoff branch 2 times, most recently from 1c5a38f to 2cf5891 Compare August 31, 2022 06:33

mfundul reviewed Aug 31, 2022

View reviewed changes

src/bgw/job_stat.c Outdated Show resolved Hide resolved

konskov force-pushed the exponential_backoff branch from 669792d to 07b074f Compare August 31, 2022 09:31

mfundul reviewed Aug 31, 2022

View reviewed changes

src/bgw/job_stat.c Outdated Show resolved Hide resolved

konskov force-pushed the exponential_backoff branch from 07b074f to 8053f1b Compare August 31, 2022 10:01

mfundul reviewed Aug 31, 2022

View reviewed changes

konskov force-pushed the exponential_backoff branch from 8053f1b to b122b46 Compare August 31, 2022 10:18

konskov requested a review from mfundul August 31, 2022 10:31

konskov force-pushed the exponential_backoff branch 3 times, most recently from 3ef6794 to 68f49eb Compare August 31, 2022 16:53

konskov changed the title ~~Exponential backoff for jobs~~ Exponentially backoff when out of background workers Aug 31, 2022

konskov force-pushed the exponential_backoff branch from 68f49eb to 5d22571 Compare August 31, 2022 17:02

mfundul reviewed Sep 1, 2022

View reviewed changes

mfundul approved these changes Sep 1, 2022

View reviewed changes

konskov force-pushed the exponential_backoff branch from 5d22571 to 56928c5 Compare September 1, 2022 10:02

konskov enabled auto-merge (rebase) September 1, 2022 12:13

konskov merged commit fca9078 into timescale:main Sep 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exponentially backoff when out of background workers #4567

Exponentially backoff when out of background workers #4567

konskov commented Aug 2, 2022 •

edited

Loading

codecov bot commented Aug 2, 2022 •

edited

Loading

mkindahl left a comment

gayyappan left a comment •

edited

Loading

konskov commented Aug 10, 2022

mfundul Aug 30, 2022

mfundul Aug 30, 2022

mfundul Aug 30, 2022

konskov Aug 30, 2022

mfundul Aug 31, 2022

konskov Aug 31, 2022

mfundul Aug 31, 2022

konskov Aug 31, 2022

mfundul Sep 1, 2022

	int64 rand_backoff = random() % (max_slots * micros_per_minute);
	int64 rand_backoff = random() % (max_slots * USECS_PER_SEC);

		// if sjob->consecutive_failed_launches > 0, give the system some time to breathe,
		// do not attempt to immediately re-run the job.

Exponentially backoff when out of background workers #4567

Exponentially backoff when out of background workers #4567

Conversation

konskov commented Aug 2, 2022 • edited Loading

codecov bot commented Aug 2, 2022 • edited Loading

Codecov Report

mkindahl left a comment

Choose a reason for hiding this comment

gayyappan left a comment • edited Loading

Choose a reason for hiding this comment

konskov commented Aug 10, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

konskov commented Aug 2, 2022 •

edited

Loading

codecov bot commented Aug 2, 2022 •

edited

Loading

gayyappan left a comment •

edited

Loading