Fix backfill.sql compression job rescheduling #26

coiax · 2022-05-16T10:11:38Z

In move_compression_job(), the existing implementation was
INNER JOINing on timescaledb_information.jobs and
timescaledb_information.job_stats.

This was both redundant, (since we were not using any information from
the job_stats view, and also prone to failure, since a job that had
never run would have no entry in the job_stats view.

This would cause the job not to be rescheduled, causing it to be run at
the same time as the backfill proceedure, leading to data corruption.

Secondly, if the Compression Policy job was already running when the
jobs table was examined, the next_start would be Timescale's
DT_NOBEGIN (sometimes displayed as -infinity), which would cause the
reschedule call at the end of the backfill to fail.

By sleeping in a loop until the job is no longer runnning, we avoid this
edge case. The majority of the time, no sleep is required however.

Some RAISE NOTICE statements have been included to make future
debugging of this kind easier, and providing more visibility into the
actions the backfill proceedure is taking.

Co-authored-by: Simon Schmidt simon.schmidt@infogrid.io
Co-authored-by: Nick Pope nick.pope@infogrid.io

Resolves #25

In `move_compression_job()`, the existing implementation was `INNER JOIN`ing on `timescaledb_information.jobs` and `timescaledb_information.job_stats`. This was both redundant, (since we were not using any information from the `job_stats` view, and also prone to failure, since a job that had never run would have no entry in the `job_stats` view. This would cause the job not to be rescheduled, causing it to be run at the same time as the backfill proceedure, leading to data corruption. --- Secondly, if the Compression Policy job was already running when the jobs table was examined, the `next_start` would be Timescale's `DT_NOBEGIN` (sometimes displayed as `-infinity`), which would cause the reschedule call at the end of the backfill to fail. By sleeping in a loop until the job is no longer runnning, we avoid this edge case. The majority of the time, no sleep is required however. --- Some `RAISE NOTICE` statements have been included to make future debugging of this kind easier, and providing more visibility into the actions the backfill proceedure is taking. Co-authored-by: Simon Schmidt <simon.schmidt@infogrid.io> Co-authored-by: Nick Pope <nick.pope@infogrid.io>

CLAassistant · 2022-05-16T10:12:13Z

All committers have signed the CLA.

Much as removing whitespace is a good thing, it's making the diff of our PR noisy, so let's remove it for now.

ngnpope · 2022-09-05T10:45:42Z

@fabriziomello (Sorry to ping you directly, but it's hard to tell who looks after this repository...)

Please can we have this fix reviewed and merged if acceptable?

Currently this backfilling solution is referenced in the documentation but the race condition that this MR fixes is pretty much, from our testing, guaranteed to occur, resulting in data loss.

fabriziomello · 2022-09-07T18:57:42Z

backfill.sql

+    -- Push the compression job out for some period of time so we don't end up compressing a decompressed chunk
+    -- Don't disable completely because at least then if we fail and fail to move it back things won't get completely weird
+    LOOP
+        SELECT
+            move_compression_job(
+                hypertable_row.id,
+                hypertable_row.schema_name,
+                hypertable_row.table_name,
+                now() + compression_job_push_interval
+            )
+        INTO old_compression_job_time;
+        IF old_compression_job_time = '-infinity' :: timestamptz THEN
+            ROLLBACK;
+            RAISE NOTICE 'Compression job already running, sleeping...';
+            PERFORM pg_sleep(10);
+        ELSE
+            COMMIT;
+            RAISE NOTICE 'Compression job not already running, proceeding as normal...';
+            EXIT;
+        END IF;
+    END LOOP;


If there are consecutive failures and you configure the max_retries option for the job the next_start will return NULL, so I guess you're not dealing with this case here. I mean the flow will continue to proceeding as normal... is this the expected behavior? Wondering if we'll not enter in a infinity loop here.

Not sure, to be honest - it's a long time since we looked at this.

All I know is that we've used this successfully and not hit an infinite loop 🤷🏻

fabriziomello · 2022-09-07T18:59:05Z

backfill.sql

+    SELECT
+        move_compression_job(
+            hypertable_row.id,
+            hypertable_row.schema_name,
+            hypertable_row.table_name,
+            old_compression_job_time
+        ) INTO old_compression_job_time;


We prefer to send separated PRs for code formatting because we can add it to .git-blame-ignore-revs

@coiax Could you revert this formatting change as it's unrelated?

xbarra · 2023-05-29T15:09:50Z

@ngnpope did you have a chance to change your fix based on Fabrizio's comments?

ngnpope · 2023-06-03T17:45:22Z

@xbarra This PR was by one of my colleagues, so I can't edit this. Will see if he is willing to polish this off.

ngnpope · 2023-06-03T17:46:52Z

backfill.sql

@@ -88,21 +88,23 @@ BEGIN
    SELECT split_part(extversion, '.', 1)::INT INTO version FROM pg_catalog.pg_extension WHERE extname='timescaledb' LIMIT 1;

    IF version = 1 THEN
-        SELECT job_id INTO compression_job_id FROM _timescaledb_config.bgw_policy_compress_chunks b WHERE b.hypertable_id = move_compression_job.hypertable_id; 
+        SELECT job_id INTO compression_job_id FROM _timescaledb_config.bgw_policy_compress_chunks b WHERE b.hypertable_id = move_compression_job.hypertable_id;


Please revert this whitespace change to keep the diff minimal:

Suggested change

SELECT job_id INTO compression_job_id FROM _timescaledb_config.bgw_policy_compress_chunks b WHERE b.hypertable_id = move_compression_job.hypertable_id;

SELECT job_id INTO compression_job_id FROM _timescaledb_config.bgw_policy_compress_chunks b WHERE b.hypertable_id = move_compression_job.hypertable_id;

ngnpope · 2023-06-03T17:47:50Z

backfill.sql

        IF version = 1 THEN
            PERFORM alter_job_schedule(compression_job_id, next_start=> new_time);
-        ELSE 
+        ELSE


Ditto, please revert:

Suggested change

ELSE

ELSE

ngnpope · 2023-06-03T17:48:40Z

backfill.sql

    END IF;

-    IF compression_job_id IS NULL THEN 
+    IF compression_job_id IS NULL THEN


Ditto, please revert:

Suggested change

IF compression_job_id IS NULL THEN

IF compression_job_id IS NULL THEN

ngnpope · 2023-06-03T17:54:42Z

backfill.sql

+    -- Push the compression job out for some period of time so we don't end up compressing a decompressed chunk
+    -- Don't disable completely because at least then if we fail and fail to move it back things won't get completely weird
+    LOOP
+        SELECT
+            move_compression_job(
+                hypertable_row.id,
+                hypertable_row.schema_name,
+                hypertable_row.table_name,
+                now() + compression_job_push_interval
+            )
+        INTO old_compression_job_time;
+        IF old_compression_job_time = '-infinity' :: timestamptz THEN
+            ROLLBACK;
+            RAISE NOTICE 'Compression job already running, sleeping...';
+            PERFORM pg_sleep(10);
+        ELSE
+            COMMIT;
+            RAISE NOTICE 'Compression job not already running, proceeding as normal...';
+            EXIT;
+        END IF;
+    END LOOP;


Not sure, to be honest - it's a long time since we looked at this.

All I know is that we've used this successfully and not hit an infinite loop 🤷🏻

Removes whitespace trimming

31d6867

Much as removing whitespace is a good thing, it's making the diff of our PR noisy, so let's remove it for now.

coiax force-pushed the backfill-fix branch from 5cbefa5 to 31d6867 Compare May 16, 2022 10:48

fabriziomello reviewed Sep 7, 2022

View reviewed changes

ngnpope suggested changes Jun 3, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix backfill.sql compression job rescheduling #26

Fix backfill.sql compression job rescheduling #26

coiax commented May 16, 2022 •

edited

Loading

CLAassistant commented May 16, 2022 •

edited

Loading

ngnpope commented Sep 5, 2022

fabriziomello Sep 7, 2022

ngnpope Jun 3, 2023

fabriziomello Sep 7, 2022

ngnpope Jun 3, 2023

xbarra commented May 29, 2023

ngnpope commented Jun 3, 2023

ngnpope Jun 3, 2023

ngnpope Jun 3, 2023

ngnpope Jun 3, 2023

ngnpope Jun 3, 2023

	SELECT job_id INTO compression_job_id FROM _timescaledb_config.bgw_policy_compress_chunks b WHERE b.hypertable_id = move_compression_job.hypertable_id;
	SELECT job_id INTO compression_job_id FROM _timescaledb_config.bgw_policy_compress_chunks b WHERE b.hypertable_id = move_compression_job.hypertable_id;

	IF compression_job_id IS NULL THEN
	IF compression_job_id IS NULL THEN

Fix backfill.sql compression job rescheduling #26

Are you sure you want to change the base?

Fix backfill.sql compression job rescheduling #26

Conversation

coiax commented May 16, 2022 • edited Loading

CLAassistant commented May 16, 2022 • edited Loading

ngnpope commented Sep 5, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xbarra commented May 29, 2023

ngnpope commented Jun 3, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coiax commented May 16, 2022 •

edited

Loading

CLAassistant commented May 16, 2022 •

edited

Loading