-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jobs: migrate jobs to write their Payload and Progress to the jobs_info table #97762
Labels
A-disaster-recovery
A-jobs
branch-release-23.1
Used to mark GA and release blockers, technical advisories, and bugs for 23.1
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
GA-blocker
T-jobs
Comments
adityamaru
added
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
A-jobs
labels
Feb 27, 2023
adityamaru
added a commit
to adityamaru/cockroach
that referenced
this issue
Mar 12, 2023
This change adds a migration and corresponding cluster version after which every job entry in the system.jobs table will have its Payload and Progress written to two rows in the system.job_info table. Informs: cockroachdb#97762 Release note: None
adityamaru
added
GA-blocker
branch-release-23.1
Used to mark GA and release blockers, technical advisories, and bugs for 23.1
A-disaster-recovery
labels
Mar 13, 2023
cc @cockroachdb/disaster-recovery |
cc @amruss |
adityamaru
added a commit
to adityamaru/cockroach
that referenced
this issue
Mar 14, 2023
This change adds a migration and corresponding cluster version after which every job entry in the system.jobs table will have its Payload and Progress written to two rows in the system.job_info table. Informs: cockroachdb#97762 Release note: None
craig bot
pushed a commit
that referenced
this issue
Mar 14, 2023
97138: ui: add error code to stmt and txn insights details pages r=gtr a=gtr Part of: #87785. Previously, the stmt and txn insights details pages did not show any further information for failed executions. This commit adds an "error code" column to the insights table for a failed execution in the stmt and txn insights details pages. Additionally, a "status" column was added to the stmt and txn workload insights tables which is either "Completed" or "Failed". Future work involves adding the error message string in addition to the error code but it needs to be redacted first. Additionally, the txn status is missing the implementation of a "Cancelled" status. Note to reviewers: only consider the second commit, as the first is required to get the txn status. - Loom [demo](https://www.loom.com/share/e82b97ff9f034d82b98640170eb54408). Release note (ui change): Adds error code column to the insights table for a failed execution in the stmt and txn insights details page. Adds status column to the stmt and txn workload insights tables. 98410: cluster-ui: tenants use sqlstats-supplied regions r=matthewtodd a=matthewtodd Fixes #98056. As of #95449, the SQL Activity pages in the DB Console can draw regions information directly from the sqlstats tables, rather than having to translate node IDs to regions on page load. Here, we make that switch, but for non-system tenants only, because: 1. The ephemeral nature of serverless nodes made this view-time mapping especially problematic in that context. (See further notes in #95449.) 2. The system-tenant views also include KV node IDs in a special Regions/Nodes column, which we are unable to recreate given the backend storage structure. (Future design work might suggest removing these node IDs altogether, for a unified UI.) # Screenshots! ## Statements, with and without regions filter <img width="1372" alt="statements" src="https://user-images.githubusercontent.com/5261/225033247-739df90a-9173-4aab-a666-a61a1ceeb579.png"> <img width="1372" alt="statements - filtered" src="https://user-images.githubusercontent.com/5261/225033271-1c0d0f82-3dd4-48ea-bdef-11f19af97a85.png"> ## Statement details <img width="1372" alt="statement details" src="https://user-images.githubusercontent.com/5261/225033338-6dff4a6e-a4a3-48c6-863a-84f1375b0a61.png"> ## Transactions, with and without regions filter <img width="1372" alt="transactions" src="https://user-images.githubusercontent.com/5261/225033366-65f44e95-3549-47cc-b0f2-67ad48a1a1fa.png"> <img width="1372" alt="transactions - filtered" src="https://user-images.githubusercontent.com/5261/225033391-50b9a2dc-e9a1-457b-84b1-837426eba35e.png"> ## Transaction details <img width="1372" alt="transaction details" src="https://user-images.githubusercontent.com/5261/225033505-3fdeceef-35dc-4e06-af25-ab4d0c53518f.png"> Release note: None 98464: jobs,upgrades: add migration to backfill job_info table r=dt a=adityamaru This change adds a migration and corresponding cluster version after which every job entry in the system.jobs table will have its Payload and Progress written to two rows in the system.job_info table. Informs: #97762 Release note: None 98510: backupccl: update restore/nodeshutdown tests to use new roachtest framework r=adityamaru a=msbutler The restore/nodeshutdown tests have been using a very old workload that will not be restorable when #93804 lands. This patch changes the restore/nodeshutdown workload to a 80GB tpce restore and moves the tests to run on aws instead of gcp. Release note: None Epic: None 98579: upgrade/upgrades: skip TestUpgradeSchemaChangerElements r=smg260 a=smg260 Refs: #98062 Reason: flaky test Generated by bin/skip-test. Release justification: non-production code changes Release note: None Epic: None Co-authored-by: gtr <gerardo@cockroachlabs.com> Co-authored-by: Matthew Todd <todd@cockroachlabs.com> Co-authored-by: adityamaru <adityamaru@gmail.com> Co-authored-by: Michael Butler <butler@cockroachlabs.com> Co-authored-by: Miral Gadani <miral@cockroachlabs.com>
adityamaru
added a commit
to adityamaru/cockroach
that referenced
this issue
Mar 16, 2023
This change touches all the parts of the code that were relying on the `system.jobs` table to read the payload and/or progress corresponding to a job. If the cluster version has advanced past V23_1JobInfoTableIsBackfilled, every job in the system.jobs table has a payload and progress entry written to the `job_info` table. This change migrates callers to use this new table when reading the payload and progress of a job. The most important changes are in the logic that drives `crdb_internal.system_jobs`, which in turn drives `crdb_internal.jobs` and `SHOW JOBS`. Additionally, there are changes in how the registry resolves the progress and payload when loading or resuming a job. Several tests that read the payload and progress from `system.jobs` now rely on `crdb_internal.system_jobs` to fetch this information from either the jobs table or job_info table depending on the cluster version. Fixes: cockroachdb#97762 Release note: None
adityamaru
added a commit
to adityamaru/cockroach
that referenced
this issue
Mar 31, 2023
This change introduces a cluster version after which the payload and progress of a job will not be written to the system.jobs table. This will ensure that the system.job_info table is the single, source of truth for these two pieces of information. This cluster version has an associated upgrade that schema changes the `payload` column of the `system.jobs` table to be nullable, thereby allowing us to stop writing to it. This upgrade step is necessary for a future patch where we will drop the payload and progress columns. Without this intermediate upgrade step the `ALTER TABLE ... DROP COLUMN` upgrade job will attempt to write to dropped columns as part of its execution thereby failing to run the upgrade. Informs: cockroachdb#97762 Release note: None
I'm going to mark this as done, all the relevant pieces have landed in 23.1. Any future improvements can be tracked separately. |
annrpom
added a commit
to annrpom/cockroach
that referenced
this issue
May 6, 2024
In cockroachdb#97762, we started writing a job's payload (and progress) information to the `system.jobs_info` table. As a result, we had to change the parts of our code that relied on the `system.jobs` table to use `crdb_internal.system_jobs` instead (since that table would inaccurately report that some `payload`s were `NULL`). This change did not occur for the in-memory representation of the jobs table created by debug doctor -- which can result in missing job false-positives. This patch updates debug doctor's representation of the jobs table by referring to `crdb_internal.system_jobs` instead. Epic: none Fixes: cockroachdb#122675 Release note: None
annrpom
added a commit
to annrpom/cockroach
that referenced
this issue
May 15, 2024
In cockroachdb#97762, we started writing a job's payload (and progress) information to the `system.jobs_info` table. As a result, we had to change the parts of our code that relied on the `system.jobs` table to use `crdb_internal.system_jobs` instead (since that table would inaccurately report that some `payload`s were `NULL`). This change did not occur for the in-memory representation of the jobs table created by debug doctor -- which can result in missing job false-positives. This patch updates debug doctor's representation of the jobs table by referring to `crdb_internal.system_jobs` instead. Epic: none Fixes: cockroachdb#122675 Release note: None
craig bot
pushed a commit
that referenced
this issue
May 16, 2024
123298: cli/doctor, doctor: use right jobs table, skip dropped descs r=fqazi a=annrpom ### cli/doctor: doctor should read from the right jobs table In #97762, we started writing a job's payload (and progress) information to the `system.jobs_info` table. As a result, we had to change the parts of our code that relied on the `system.jobs` table to use `crdb_internal.system_jobs` instead (since that table would inaccurately report that some `payload`s were `NULL`). This change did not occur for the in-memory representation of the jobs table created by debug doctor -- which can result in missing job false-positives. This patch updates debug doctor's representation of the jobs table by referring to `crdb_internal.system_jobs` instead. Epic: none Fixes: #122675 Release note: None --- ### doctor: skip validation for dropped descriptors In some cases, dropped descriptors appear in our `system.descriptors` table with dangling job mutations without an associated job. This patch teaches debug doctor examine to skip validation on such dropped descriptors. Epic: none Fixes: #123477 Fixes: #122956 Release note: none Co-authored-by: Annie Pompa <annie@cockroachlabs.com>
blathers-crl bot
pushed a commit
that referenced
this issue
May 16, 2024
In #97762, we started writing a job's payload (and progress) information to the `system.jobs_info` table. As a result, we had to change the parts of our code that relied on the `system.jobs` table to use `crdb_internal.system_jobs` instead (since that table would inaccurately report that some `payload`s were `NULL`). This change did not occur for the in-memory representation of the jobs table created by debug doctor -- which can result in missing job false-positives. This patch updates debug doctor's representation of the jobs table by referring to `crdb_internal.system_jobs` instead. Epic: none Fixes: #122675 Release note: None
miyamo2
added a commit
to miyamo2/cockroach
that referenced
this issue
May 17, 2024
mod comment lease: remove outdated NightlyStress call This test config is not used anymore. Release note: None roachtest: increase timeout of acceptance/multitenant-multiregion This has flaked due to timeout in CI. Release note: None sqlerrors: move gloval var errNoZoneConfigApplies into sqlerrors package This commit is a mechnical change that moves the gloval variable `errNoZoneConfigApplies` from package `sql` to package `sqlerrors`, as it will be used by the DSC when we support zone config in it. Informs: cockroachdb#117574 Release note: None scbuildstmt: Rerewrite a few function with better APIs This commit cleaned up a few functions using the preferred .FilterXxx() API to retrieve element from an element set. No functional change is introduced. Informs: cockroachdb#117574 Release note: None roachtest: update dsc job compat to mixed version v241-v242 This patch updates the dsc job compat roachtest to `declarative_schema_changer/job-compatibility-mixed-version-V241-V242`. Epic: CRDB-35306 Informs: cockroachdb#116395 Release note: None admission: introduce elastic io token exhausted duration metric This patch adds a new metric `elastic_io_tokens_exhausted_duration.kv`. This is similar to the existing `admission.granter.io_tokens_exhausted_duration.kv`, but for elastic traffic. The patch also does some code refactoring to make it easier to use both regular and elastic equivalent metrics. Informs cockroachdb#121574. Release note: None sql: expose an ability to request redacted stmt bundles This commit extends the stmt bundle collection infrastructure to support requesting redacted stmt bundles. Previously, this was only possible via an explicit `EXPLAIN ANALYZE (DEBUG, REDACT)` stmt, but this commit adds the support via overloads to `crdb_internal.request_statement_bundle` builtin as well as the server API. This paves the way for requesting the redacted bundles via the DB Console, but it'll be done separately. To support this we need to add another boolean column to `system.statement_diagnostics_requests` table to indicate whether a request is for the redacted bundle or not. This requires adding a migration in which we populate the existing rows with all `false` values. In order to reduce the amount of code churn and simplify this change, this commit repurposes some of the existing code that we introduced the last time we added a migration for plan-gist matching in stmt bundles. In other words, this commit simultenously removes some stale code and handles the new column. Release note: None sql: simplify usage of some enum settings This commit adjusts a few enum settings to use `EnumSetting.String` when defining the session variables to clean up the code a bit. Release note: None execinfra: add a sanity check that DistSQL version is not bumped The last bump was in 23.1, and we won't ever bump DistSQL version in backwards-incompatible going forward. Epic: None Release note: None Updated the AUTHORS file to include Naveen Setlur sqlstats: remove unnecessary StatementFingerprintID construction Previously we were constructing the stmt fingerprint id unnecessarily in 2 functions in sqlstats: - The Next function for the StmtStatsIterator - PopAllStats, which returns the sql stats collected thus far and resets the in-memory sql stats containers. The ID is available in the recorded statement in both of these cases- there's no need to reconstruct this value and doing is needlessly expensive. Epic: none Release note: None pgwire: add a metric for number of pipelined requests Release note (ops change): Added the sql.pgwire.pipeline.count metric, which is a gauge that measures how many wire protocol commands have been received by the server, but have not yet begun processing. This metric will only grow if clients are using the "pipeline mode" of the Postgres wire protocol. roachtest: skip multi-store-remove while debugging This roachtest has failed consistently for the past few days, since being enabled in cockroachdb#123694. Skip while we debug, to avoid noise. Touches: cockroachdb#123989. Release note: None. clusterversion: remove deprecated internal versions Remove deprecated 23.2 gates - these features will always be enabled in any cluster that a 24.2 node is part of. `TODODelete_V23_1` has a lot of uses and will be removed separately. Epic: none Release note: None sql: fix the type of copy_num_retries_per_batch We were incorrectly using the boolean getter for this integer variable. The bug was introduced when this variable was added in 1c1d313. Release note: None sql: fix cost_scans_with_default_col_size This commit fixes an omission where `cost_scans_with_default_col_size` session variable didn't actually use the corresponding cluster setting for its default value. In other words, the cluster setting actually had no effect. The bug has been present since this variable was introduced in af32d9d. Release note: None sql: fix up defaults for a few session variables This commit adjusts global defaults for a few session variables to use the postgres style (so that the global default value and session variable value would be stored alike). Release note: None sql: improve env collection in stmt bundles This commit refactors how we collect session variables and cluster settings into `env.sql` file in the stmt bundle. Previously, we used a hard-coded list of session variables that were always included while ignoring all variables not in the list. If a variable had a value different from the "binary default" (meaning the value that you got right after the cluster startup, before any cluster setting modifications are applied), then it would be in a SET statement that would run on the `bundle recreate`; if the value was the same as "binary default", then the SET statement was commented out. Also, all cluster setting values were included into the file as a comment. This commit makes it so that we include all session variables and cluster settings that differ from their defaults. This allows us to recreate the environment while also reducing the overhead of including everything into `env.sql`. For session variables further clarification is warranted: - all read-only variables have been audited, and most are explicitly excluded. Out of 23 currently present session variables, only 9 were deemed to be possibly useful during investigations, so they are not excluded (one of 9 is `crdb_version` which we print out separately, so it's actually excluded too). - for writable variables, we include it if its value differs from the "binary" default (the one that you get on a fresh cluster) or from the "cluster" default (the one that you get on a fresh session on the current cluster). The latter kind is obtained via the provided `settings.Values` container whereas the former is obtained via a global singleton container. Additionally, this commit adjusts the SET and SET CLUSTER SETTING statements to have single quotes around the setting values that need them. For cluster settings all types work with having single quotes, but for session variables some (like integers) don't work with quotes while others (like strings) need them. This commit adds a map for those that need them as well as a simple test to catch some of the missing ones (the list might be incomplete given that the test exercises the default config). All values are adjusted to fit on a single line (we have some cluster settings that might span multiple lines). It also adjusts `EXPLAIN (OPT, ENV)` to include the list of cluster settings with non-default values. Release note: None roachprod: fetch secrets from cloud store Due to the complexity of fetching the secrets from the secrets manager, the secrets are now maintained in cloud storage. Fixes: cockroachdb#117125 Epic: none roachprod: deploy cockroach Previously, to upgrade a cluster it had to be done manually by running several commands to copy over new binaries, drain and stop cockroach and then running the same start command. This change introduces a `deploy` command that works similar as `stage` but also deploys the new version to the cluster by doing all of the above. It does it node for node to ensure a cluster remains in a healthy state especially if a load balancer is used to manage connections. Epic: None Release Note: None roachprod: add deploy applications explanation Epic: None Release Notes: None roachtest: add cluster settings operations This change adds operations to mutate cluster settings. The values are mutated based on a preset frequency and RNG. For example, if the frequency of a cluster setting operation is set to "30 minutes" it will only change to a new value every 30 minutes, even if the operation has been run multiple times within the same 30-minute window. Running in the same window will just set the same value again. Epic: None Release Note: None roachtest: fix `OperationRequiresNodes` dependency check Epic: None Release Note: None roachprod: cluster settings operation owners Epic: None Release Notes: None roachprod: fix UT to ignore the yaml format currently, the test is looking for the exact match of the yaml. this can fail as the yaml is generated from a map and the values may not be in the same sequence. a better approach is to use a yaml parser, which will be done very soon. Fixes: cockroachdb#124276 Epic: none pgwire: deflake TestPipelineMetric The wrong context was used for the server, which meant that it could get cancelled too early. Release note: None roachprod, roachtest: use same cluster name sanitization Previously, roachtest had it's own function to sanitize cluster names, while roachprod had it's own function to verify cluster names. This change removes both and opts instead to use vm.DNSSafeName. This change also introduces MalformedClusterNameError which gives a hint on what is wrong with the name and tells roachtest not to retry cluster creation. intentresolver: mark the test heavy to prevent OOM Epic: none Release note: none ui: move enum type to avoid circular dependencies When importing the ViewMode enum into a unit test, an error is thrown in localStorage.reducer.ts relating to ViewMode being undefined. This is occurring due to a circular dependency. To fix, ViewMode is moved to a separate types.ts file Release note: None ui: prevent /_status/nodes calls for secondary tenants /_status/nodes is not implemented for secondary tenants, resulting in 501 responses in cloud for serverless tenants. To fix, this api is now gated by a tenant check and will no longer be called by serverless / secondary tenants. Fixes: CC-26900 Epic: None Release note: None cli/doctor: doctor should read from the right jobs table In cockroachdb#97762, we started writing a job's payload (and progress) information to the `system.jobs_info` table. As a result, we had to change the parts of our code that relied on the `system.jobs` table to use `crdb_internal.system_jobs` instead (since that table would inaccurately report that some `payload`s were `NULL`). This change did not occur for the in-memory representation of the jobs table created by debug doctor -- which can result in missing job false-positives. This patch updates debug doctor's representation of the jobs table by referring to `crdb_internal.system_jobs` instead. Epic: none Fixes: cockroachdb#122675 Release note: None doctor: skip validation for dropped descriptors In some cases, dropped descriptors appear in our `system.descriptors` table with dangling job mutations without an associated job. This patch teaches debug doctor examine to skip validation on such dropped descriptors. Epic: none Fixes: cockroachdb#123477 Fixes: cockroachdb#122956 Release note: none backupccl: test that we actually downloaded all data during online restore Release note: none Epic: none roachtest: check downloaded spans in online restore roundtrip test Epic: none Release note: none workflows: set timeouts for GitHub Actions Essential CI jobs I've selected the timeouts according to how long each job seems to run in practice, with a big buffer to allow for load on the remote execution cluster, etc. The longest-running jobs I've given 60 minutes, with less time for the shorter jobs. Epic: CRDB-8308 Release note: None master: Update pkg/testutils/release/cockroach_releases.yaml Update pkg/testutils/release/cockroach_releases.yaml with recent values. Epic: None Release note: None Release justification: test-only updates cdctest: simplify orderValidator.NoteRow code This patch eliminates unnecessary nesting in orderValidator's NoteRow method. Release note: None cdctest: rename MakeCountValidator to NewCountValidator This patch renames `MakeCountValidator` to `NewCountValidator` to reflect that it returns a pointer, not a struct. Release note: None roachtest: rename cdc/sink-chaos to cdc/kafka-chaos and add more logging This patch renames the `cdc/sink-chaos` test to `cdc/kafka-chaos` to more accurately reflect that it is a Kafka-only test. It also adds logging for the Kafka chaos loop iteration number so that we can tell when the Kafka cluster is restarting from the logs. Release note: None roachtest: add order validation to CDC Kafka roachtests This patch adds order validation to CDC Kafka roachtests so that we can build more confidence in our ordering guarantees. It can be enabled for a roachtest either by directly setting the `validateOrder` flag on a `kafkaManager` before creating consumers, or indirectly by setting the `validateOrder` flag on `kafkaFeedArgs` for tests that use `cdcTester`. Release note: None sql: alter primary key hangs if index references exist Previously, ALTER PRIMARY KEY could hang if an index reference existed from either a view or function using the index selection syntax. This was becasue neither the declarative schema changer or legacy schema changer support updating these references. To address this, this patch will prevent ALTER PRIMARY KEY if such references exist to any indexes that will be recreated. Fixes: cockroachdb#123017 Release note (bug fix): ALTER PRIMARY KEY could hang if the table had any indexes which were referred to by views or functions using the force index syntax. roachprod: remove logging from updatePrometheusTargets It's hard to tell what the logging is about when it happens in the context of a roachtest. It's also verbose as it prints one log line for each VM whenever a test starts cockroach. Epic: none Release note: None roachtest: explicitly close monitor reader after closing session Previously the reader did not close, even though the session is closed. This change ensures the reader is closed by closing it after the session has been closed. This seems to have only been a problem with local clusters. Fixes: cockroachdb#116314 Epic: None Release Note: None roachtest: lower max-upgrades in `tpcc/mixed-headroom` Avoids timeouts. Fixes: cockroachdb#124264 Release note: None scripts: update `drtprod` create drt-large Bring the creation in line with what is currently running on cockroach-drt. Uses zone expansion to balance the nodes evenly across regions. Epic: None Release Note: None scripts: update `drtprod` drt-large region In an effort to save cost on regional network transfers the Warsaw Europe Zone will now be switched to Columbus. Epic: None Release Note: None backupccl: deflake TestDataDriven_external_connections_privileges This test already tests the permissions end-to-end by asserting that the user can or cannot complete the expected actions. No need to print the system database as well. Fixes cockroachdb#123379 Release note: None externalconn: implement `SHOW EXTERNAL CONNECTIONS`. This new query displays redacted URIs of external connections along with other valuable information. Epic: CRDB-15001 Fixes: cockroachdb#85905 Release note (sql change): New queries `SHOW EXTERNAL CONNECTIONS` and `SHOW EXTERNAL CONNECTION <connection name>` have been added. These queries display redacted connection URIs and other useful information such as the connection type. Access to these queries is restricted to the owner of the connection or users with `USAGE` privilege. Admin or root users will have unrestricted access to all connections. kvstorage: speed up scan for range descriptors On startup we currently scan through all `/Local/Range/` keys in order to find the range descriptors. This involves going through all `/Local/Range/Transaction` records and tombstones so we end up reading through a lot of data. This change switches to using `SeekGE`: - when we encounter a key that has a suffix before `rdsc` like `/Local/Range/foo/prbe`, we seek directly to the corresponding descriptor key `/Local/Range/foo/rdsc`; - after we decode a descriptor, we seek straight to the next possible range descriptor `/Local/Range/EndKey/rdsc` using the `EndKey` in the descriptor. Note that inside Pebble, seeks to keys that are very close to the current position are optimized to try to iterate through the next few keys before doing a real seek. So this change should not be detrimental even when there aren't a lot of keys to skip. I experimented on a store captured after running kv0 with batch size 2 against a single node. Before this change we step through 353 keys but the iterator has to internally step through 160k tombstones: ``` range descriptor iteration done: 339 keys, 65 range descriptors (by suffix: map[qlpt:81 rdsc:65 txn-:193]); scan stats: stepped 353 times (169.743k internal); seeked 2 times (2 internal)... ``` After, we seek around the transaction records and end up doing just a seek per range: ``` range descriptor iteration done: 65 range descriptors, 0 intents, 0 tombstones +‹(interface (dir, seek, step): (fwd, 67, 0), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 67, 0), (rev, 0, 0)),› +‹(internal-stats: (block-bytes: (total 563KB, cached 26KB, read-time 755.128µs))... ``` Fixes: cockroachdb#109740 kvserver: read lease under mutex when switching lease type A race could occur when a replica queue and post lease application both attempted to switch the lease type. This race would cause the queue to not process the replica because the lease type had already changed. As a result, lease preference violations might not have been quickly resolved by the lease queue. Read the lease under the same mutex used for requesting the lease, when possibly switching the lease type. Resolves: cockroachdb#123998 Release note: None logictest: skip flaky select_for_update test Informs: cockroachdb#124197. Epic: None Release note: None mod comment roachtest: save cluster earlier if debug-always is set When the --debug-always flag is passed, we mark the cluster as saved to let the cluster destroy know not to destroy it. Previously, we only marked a cluster as saved at the end of a test. However, if the caller terminated the test through ctrl c, which also destroys all clusters, it would lead to a race condition where sometimes the cluster would not be saved in time and get destroyed. This change moves the cluster save to immediantly after the cluster is created. Release note: none Fixes: none Epic: none
annrpom
added a commit
to annrpom/cockroach
that referenced
this issue
May 17, 2024
In cockroachdb#97762, we started writing a job's payload (and progress) information to the `system.jobs_info` table. As a result, we had to change the parts of our code that relied on the `system.jobs` table to use `crdb_internal.system_jobs` instead (since that table would inaccurately report that some `payload`s were `NULL`). This change did not occur for the in-memory representation of the jobs table created by debug doctor -- which can result in missing job false-positives. This patch updates debug doctor's representation of the jobs table by referring to `crdb_internal.system_jobs` instead. Epic: none Fixes: cockroachdb#122675 Release note: None
annrpom
added a commit
to annrpom/cockroach
that referenced
this issue
May 20, 2024
In cockroachdb#97762, we started writing a job's payload (and progress) information to the `system.jobs_info` table. As a result, we had to change the parts of our code that relied on the `system.jobs` table to use `crdb_internal.system_jobs` instead (since that table would inaccurately report that some `payload`s were `NULL`). This change did not occur for the in-memory representation of the jobs table created by debug doctor -- which can result in missing job false-positives. This patch updates debug doctor's representation of the jobs table by referring to `crdb_internal.system_jobs` instead. Epic: none Fixes: cockroachdb#122675 Release note: None
annrpom
added a commit
to annrpom/cockroach
that referenced
this issue
Jun 25, 2024
In cockroachdb#97762, we started writing a job's payload (and progress) information to the `system.jobs_info` table. As a result, we had to change the parts of our code that relied on the `system.jobs` table to use `crdb_internal.system_jobs` instead (since that table would inaccurately report that some `payload`s were `NULL`). This change did not occur for the in-memory representation of the jobs table created by debug doctor -- which can result in missing job false-positives. This patch updates debug doctor's representation of the jobs table by referring to `crdb_internal.system_jobs` instead. Epic: none Fixes: cockroachdb#122675 Release note: None
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-disaster-recovery
A-jobs
branch-release-23.1
Used to mark GA and release blockers, technical advisories, and bugs for 23.1
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
GA-blocker
T-jobs
In #84866 we introduced a
system.jobs_info
table that is to be used to store job-specific state. This table allows a job to access and update information about its execution without adversely impacting the execution of other jobs. This is explained in detail in #82638.As a first step, we want to teach jobs to write their Payload and Progress in their current form but to the jobs_info table instead of the jobs table. This migration will roughly involve:
Double writes for the Payload and Progress of jobs to both the jobs_info and jobs table.
A migration to backfill job_info with the Payload and Progress of all job records in the system.jobs table.
A change to read methods in the registry to pull the Payload and Progress from the job_info table instead of the jobs table.
In the future, jobs may want to decompose their Payload and Progress protos further and write different fields as their own rows in the jobs_info table. This is outside the scope of this issue.
Jira issue: CRDB-24874
Epic CRDB-8964
The text was updated successfully, but these errors were encountered: