Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix backfill.sql compression job rescheduling #26

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

coiax
Copy link

@coiax coiax commented May 16, 2022

In move_compression_job(), the existing implementation was
INNER JOINing on timescaledb_information.jobs and
timescaledb_information.job_stats.

This was both redundant, (since we were not using any information from
the job_stats view, and also prone to failure, since a job that had
never run would have no entry in the job_stats view.

This would cause the job not to be rescheduled, causing it to be run at
the same time as the backfill proceedure, leading to data corruption.


Secondly, if the Compression Policy job was already running when the
jobs table was examined, the next_start would be Timescale's
DT_NOBEGIN (sometimes displayed as -infinity), which would cause the
reschedule call at the end of the backfill to fail.

By sleeping in a loop until the job is no longer runnning, we avoid this
edge case. The majority of the time, no sleep is required however.


Some RAISE NOTICE statements have been included to make future
debugging of this kind easier, and providing more visibility into the
actions the backfill proceedure is taking.

Co-authored-by: Simon Schmidt simon.schmidt@infogrid.io
Co-authored-by: Nick Pope nick.pope@infogrid.io

Resolves #25

In `move_compression_job()`, the existing implementation was
`INNER JOIN`ing  on `timescaledb_information.jobs` and
`timescaledb_information.job_stats`.

This was both redundant, (since we were not using any information from
the `job_stats` view, and also prone to failure, since a job that had
never run would have no entry in the `job_stats` view.

This would cause the job not to be rescheduled, causing it to be run at
the same time as the backfill proceedure, leading to data corruption.

---

Secondly, if the Compression Policy job was already running when the
jobs table was examined, the `next_start` would be Timescale's
`DT_NOBEGIN` (sometimes displayed as `-infinity`), which would cause the
reschedule call at the end of the backfill to fail.

By sleeping in a loop until the job is no longer runnning, we avoid this
edge case. The majority of the time, no sleep is required however.

---

Some `RAISE NOTICE` statements have been included to make future
debugging of this kind easier, and providing more visibility into the
actions the backfill proceedure is taking.

Co-authored-by: Simon Schmidt <simon.schmidt@infogrid.io>
Co-authored-by: Nick Pope <nick.pope@infogrid.io>
@CLAassistant
Copy link

CLAassistant commented May 16, 2022

CLA assistant check
All committers have signed the CLA.

Much as removing whitespace is a good thing, it's making the diff of our
PR noisy, so let's remove it for now.
@ngnpope
Copy link

ngnpope commented Sep 5, 2022

@fabriziomello (Sorry to ping you directly, but it's hard to tell who looks after this repository...)

Please can we have this fix reviewed and merged if acceptable?

Currently this backfilling solution is referenced in the documentation but the race condition that this MR fixes is pretty much, from our testing, guaranteed to occur, resulting in data loss.

Comment on lines +175 to +195
-- Push the compression job out for some period of time so we don't end up compressing a decompressed chunk
-- Don't disable completely because at least then if we fail and fail to move it back things won't get completely weird
LOOP
SELECT
move_compression_job(
hypertable_row.id,
hypertable_row.schema_name,
hypertable_row.table_name,
now() + compression_job_push_interval
)
INTO old_compression_job_time;
IF old_compression_job_time = '-infinity' :: timestamptz THEN
ROLLBACK;
RAISE NOTICE 'Compression job already running, sleeping...';
PERFORM pg_sleep(10);
ELSE
COMMIT;
RAISE NOTICE 'Compression job not already running, proceeding as normal...';
EXIT;
END IF;
END LOOP;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are consecutive failures and you configure the max_retries option for the job the next_start will return NULL, so I guess you're not dealing with this case here. I mean the flow will continue to proceeding as normal... is this the expected behavior? Wondering if we'll not enter in a infinity loop here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, to be honest - it's a long time since we looked at this.

All I know is that we've used this successfully and not hit an infinite loop 🤷🏻

Comment on lines +312 to +318
SELECT
move_compression_job(
hypertable_row.id,
hypertable_row.schema_name,
hypertable_row.table_name,
old_compression_job_time
) INTO old_compression_job_time;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We prefer to send separated PRs for code formatting because we can add it to .git-blame-ignore-revs

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coiax Could you revert this formatting change as it's unrelated?

@xbarra
Copy link

xbarra commented May 29, 2023

@ngnpope did you have a chance to change your fix based on Fabrizio's comments?

@ngnpope
Copy link

ngnpope commented Jun 3, 2023

@xbarra This PR was by one of my colleagues, so I can't edit this. Will see if he is willing to polish this off.

@@ -88,21 +88,23 @@ BEGIN
SELECT split_part(extversion, '.', 1)::INT INTO version FROM pg_catalog.pg_extension WHERE extname='timescaledb' LIMIT 1;

IF version = 1 THEN
SELECT job_id INTO compression_job_id FROM _timescaledb_config.bgw_policy_compress_chunks b WHERE b.hypertable_id = move_compression_job.hypertable_id;
SELECT job_id INTO compression_job_id FROM _timescaledb_config.bgw_policy_compress_chunks b WHERE b.hypertable_id = move_compression_job.hypertable_id;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this whitespace change to keep the diff minimal:

Suggested change
SELECT job_id INTO compression_job_id FROM _timescaledb_config.bgw_policy_compress_chunks b WHERE b.hypertable_id = move_compression_job.hypertable_id;
SELECT job_id INTO compression_job_id FROM _timescaledb_config.bgw_policy_compress_chunks b WHERE b.hypertable_id = move_compression_job.hypertable_id;

IF version = 1 THEN
PERFORM alter_job_schedule(compression_job_id, next_start=> new_time);
ELSE
ELSE
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, please revert:

Suggested change
ELSE
ELSE

END IF;

IF compression_job_id IS NULL THEN
IF compression_job_id IS NULL THEN
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, please revert:

Suggested change
IF compression_job_id IS NULL THEN
IF compression_job_id IS NULL THEN

Comment on lines +175 to +195
-- Push the compression job out for some period of time so we don't end up compressing a decompressed chunk
-- Don't disable completely because at least then if we fail and fail to move it back things won't get completely weird
LOOP
SELECT
move_compression_job(
hypertable_row.id,
hypertable_row.schema_name,
hypertable_row.table_name,
now() + compression_job_push_interval
)
INTO old_compression_job_time;
IF old_compression_job_time = '-infinity' :: timestamptz THEN
ROLLBACK;
RAISE NOTICE 'Compression job already running, sleeping...';
PERFORM pg_sleep(10);
ELSE
COMMIT;
RAISE NOTICE 'Compression job not already running, proceeding as normal...';
EXIT;
END IF;
END LOOP;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, to be honest - it's a long time since we looked at this.

All I know is that we've used this successfully and not hit an infinite loop 🤷🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Race condition in backfill.sql when worker scheduler has not started
5 participants