Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jobs,jobspb: rename ExecutionLog, populate #69370

Merged
merged 4 commits into from
Aug 27, 2021

Conversation

ajwerner
Copy link
Contributor

@ajwerner ajwerner commented Aug 25, 2021

I got cold feet on the ExecutionLog in the Payload being populated at the
beginning of job execution. I become uncomfortable with the idea of writing to
the Payload an extra time for every job execution in the happy case. These
things can get big. Instead we only populate events when a retriable error
occurs. Given that, the events have been renamed to RetriableExecutionError.

This data is exposed in the crdb_internal.jobs table via the execution_errors
string array. The decision to use an array of strings instead of json was
predicated upon the idea that these errors are for human consumption. The
structured data continues to exist in the protobuf.

These events are retained in the Payload in RetriableExecutionErrorLog.
A finite number of events are retained, and that number is controlled by the new
non-public cluster setting jobs.execution_errors.max_entries. The size of the
events is also controlled by a cluster setting: jobs.execution_errors.max_entry_size.

If an event would be larger than this size, the error it would contain is formatted
to a string and truncated to the max size. This loses structure but at least retains
some observability.

Fixes #68800.

Release justification: bug fixes and low-risk updates to new functionality.

Release note (general change): When jobs encounter retriable errors during
execution, they will now record these errors into their state. The errors as
well as metadata about the execution can be inspected via the newly added
execution_errors field of crdb_internal.jobs which is a STRING[] column.

@ajwerner ajwerner requested review from sajjadrizvi and a team August 25, 2021 16:52
@ajwerner ajwerner requested a review from a team as a code owner August 25, 2021 16:52
@ajwerner ajwerner requested review from a team and rhu713 and removed request for a team August 25, 2021 16:52
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r1, 14 of 14 files at r2, 15 of 15 files at r3, 5 of 5 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rhu713 and @sajjadrizvi)

Copy link

@sajjadrizvi sajjadrizvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: mod a nit

Reviewed 2 of 2 files at r1, 14 of 14 files at r2, 15 of 15 files at r3, 5 of 5 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @rhu713)


pkg/jobs/config.go, line 69 at r4 (raw file):

	// defaultExecutionErrorsMaxEntrySize is the maximum allowed size of an
	// error. If this size is exceeded, the error will formatted as a string

nit: s/ will formatted/will be formatted

Release justification: non-production code changes

Release note: None
@ajwerner ajwerner force-pushed the ajwerner/execution-errors-in-jobs branch 2 times, most recently from 7649f96 to d52689c Compare August 26, 2021 14:46
Also moves the code to a new file.

Release justification: category 2

Release note: None
@ajwerner ajwerner force-pushed the ajwerner/execution-errors-in-jobs branch from d52689c to 0c9833b Compare August 26, 2021 16:23
@ajwerner ajwerner changed the title jobs,jobspb: remove ExecutionLog, populate ResumeErrors jobs,jobspb: rename ExecutionLog, populate Aug 26, 2021
@ajwerner ajwerner force-pushed the ajwerner/execution-errors-in-jobs branch 4 times, most recently from d762bcb to 4825d71 Compare August 27, 2021 01:52
I got cold feet on the `ExecutionLog` in the `Payload` being populated at the
beginning of job execution. I become uncomfortable with the idea of writing to
the `Payload` an extra time for every job execution in the happy case. These
things can get big. Instead we only populate events when a retriable error
occurs. Given that, the events have been renamed to `RetriableExecutionError`.

This data is exposed in the crdb_internal.jobs table via the `execution_errors`
string array. The decision to use an array of strings instead of json was
predicated upon the idea that these errors are for human consumption. The
structured data continues to exist in the protobuf.

Work to truncate these logs will come in the next commit.

Release justification: bug fixes and low-risk updates to new functionality.

Release note (general change): When jobs encounter retriable errors during
execution, they will now record these errors into their state. The errors as
well as metadata about the execution can be inspected via the newly added
`execution_errors` field of `crdb_internal.jobs` which is a `STRING[]` column.
@ajwerner ajwerner force-pushed the ajwerner/execution-errors-in-jobs branch from 4825d71 to 870ff19 Compare August 27, 2021 04:15
This change limits the size of execution error entries based on
cluster settings (which are not public).

Release justification: low risk change to new functionality

Release note: None
@ajwerner ajwerner force-pushed the ajwerner/execution-errors-in-jobs branch from 870ff19 to b7301c6 Compare August 27, 2021 12:47
@ajwerner
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Aug 27, 2021

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

jobs: record retriable errors in the job upon failure
4 participants