-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix backfill.sql compression job rescheduling #26
base: master
Are you sure you want to change the base?
Conversation
In `move_compression_job()`, the existing implementation was `INNER JOIN`ing on `timescaledb_information.jobs` and `timescaledb_information.job_stats`. This was both redundant, (since we were not using any information from the `job_stats` view, and also prone to failure, since a job that had never run would have no entry in the `job_stats` view. This would cause the job not to be rescheduled, causing it to be run at the same time as the backfill proceedure, leading to data corruption. --- Secondly, if the Compression Policy job was already running when the jobs table was examined, the `next_start` would be Timescale's `DT_NOBEGIN` (sometimes displayed as `-infinity`), which would cause the reschedule call at the end of the backfill to fail. By sleeping in a loop until the job is no longer runnning, we avoid this edge case. The majority of the time, no sleep is required however. --- Some `RAISE NOTICE` statements have been included to make future debugging of this kind easier, and providing more visibility into the actions the backfill proceedure is taking. Co-authored-by: Simon Schmidt <simon.schmidt@infogrid.io> Co-authored-by: Nick Pope <nick.pope@infogrid.io>
Much as removing whitespace is a good thing, it's making the diff of our PR noisy, so let's remove it for now.
@fabriziomello (Sorry to ping you directly, but it's hard to tell who looks after this repository...) Please can we have this fix reviewed and merged if acceptable? Currently this backfilling solution is referenced in the documentation but the race condition that this MR fixes is pretty much, from our testing, guaranteed to occur, resulting in data loss. |
-- Push the compression job out for some period of time so we don't end up compressing a decompressed chunk | ||
-- Don't disable completely because at least then if we fail and fail to move it back things won't get completely weird | ||
LOOP | ||
SELECT | ||
move_compression_job( | ||
hypertable_row.id, | ||
hypertable_row.schema_name, | ||
hypertable_row.table_name, | ||
now() + compression_job_push_interval | ||
) | ||
INTO old_compression_job_time; | ||
IF old_compression_job_time = '-infinity' :: timestamptz THEN | ||
ROLLBACK; | ||
RAISE NOTICE 'Compression job already running, sleeping...'; | ||
PERFORM pg_sleep(10); | ||
ELSE | ||
COMMIT; | ||
RAISE NOTICE 'Compression job not already running, proceeding as normal...'; | ||
EXIT; | ||
END IF; | ||
END LOOP; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are consecutive failures and you configure the max_retries
option for the job the next_start
will return NULL, so I guess you're not dealing with this case here. I mean the flow will continue to proceeding as normal... is this the expected behavior? Wondering if we'll not enter in a infinity loop here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, to be honest - it's a long time since we looked at this.
All I know is that we've used this successfully and not hit an infinite loop 🤷🏻
SELECT | ||
move_compression_job( | ||
hypertable_row.id, | ||
hypertable_row.schema_name, | ||
hypertable_row.table_name, | ||
old_compression_job_time | ||
) INTO old_compression_job_time; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We prefer to send separated PRs for code formatting because we can add it to .git-blame-ignore-revs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@coiax Could you revert this formatting change as it's unrelated?
@ngnpope did you have a chance to change your fix based on Fabrizio's comments? |
@xbarra This PR was by one of my colleagues, so I can't edit this. Will see if he is willing to polish this off. |
@@ -88,21 +88,23 @@ BEGIN | |||
SELECT split_part(extversion, '.', 1)::INT INTO version FROM pg_catalog.pg_extension WHERE extname='timescaledb' LIMIT 1; | |||
|
|||
IF version = 1 THEN | |||
SELECT job_id INTO compression_job_id FROM _timescaledb_config.bgw_policy_compress_chunks b WHERE b.hypertable_id = move_compression_job.hypertable_id; | |||
SELECT job_id INTO compression_job_id FROM _timescaledb_config.bgw_policy_compress_chunks b WHERE b.hypertable_id = move_compression_job.hypertable_id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert this whitespace change to keep the diff minimal:
SELECT job_id INTO compression_job_id FROM _timescaledb_config.bgw_policy_compress_chunks b WHERE b.hypertable_id = move_compression_job.hypertable_id; | |
SELECT job_id INTO compression_job_id FROM _timescaledb_config.bgw_policy_compress_chunks b WHERE b.hypertable_id = move_compression_job.hypertable_id; |
IF version = 1 THEN | ||
PERFORM alter_job_schedule(compression_job_id, next_start=> new_time); | ||
ELSE | ||
ELSE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto, please revert:
ELSE | |
ELSE |
END IF; | ||
|
||
IF compression_job_id IS NULL THEN | ||
IF compression_job_id IS NULL THEN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto, please revert:
IF compression_job_id IS NULL THEN | |
IF compression_job_id IS NULL THEN |
-- Push the compression job out for some period of time so we don't end up compressing a decompressed chunk | ||
-- Don't disable completely because at least then if we fail and fail to move it back things won't get completely weird | ||
LOOP | ||
SELECT | ||
move_compression_job( | ||
hypertable_row.id, | ||
hypertable_row.schema_name, | ||
hypertable_row.table_name, | ||
now() + compression_job_push_interval | ||
) | ||
INTO old_compression_job_time; | ||
IF old_compression_job_time = '-infinity' :: timestamptz THEN | ||
ROLLBACK; | ||
RAISE NOTICE 'Compression job already running, sleeping...'; | ||
PERFORM pg_sleep(10); | ||
ELSE | ||
COMMIT; | ||
RAISE NOTICE 'Compression job not already running, proceeding as normal...'; | ||
EXIT; | ||
END IF; | ||
END LOOP; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, to be honest - it's a long time since we looked at this.
All I know is that we've used this successfully and not hit an infinite loop 🤷🏻
In
move_compression_job()
, the existing implementation wasINNER JOIN
ing ontimescaledb_information.jobs
andtimescaledb_information.job_stats
.This was both redundant, (since we were not using any information from
the
job_stats
view, and also prone to failure, since a job that hadnever run would have no entry in the
job_stats
view.This would cause the job not to be rescheduled, causing it to be run at
the same time as the backfill proceedure, leading to data corruption.
Secondly, if the Compression Policy job was already running when the
jobs table was examined, the
next_start
would be Timescale'sDT_NOBEGIN
(sometimes displayed as-infinity
), which would cause thereschedule call at the end of the backfill to fail.
By sleeping in a loop until the job is no longer runnning, we avoid this
edge case. The majority of the time, no sleep is required however.
Some
RAISE NOTICE
statements have been included to make futuredebugging of this kind easier, and providing more visibility into the
actions the backfill proceedure is taking.
Co-authored-by: Simon Schmidt simon.schmidt@infogrid.io
Co-authored-by: Nick Pope nick.pope@infogrid.io
Resolves #25