Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

execgen: large dependency tree causes lots of recompilation under bazel #77234

Closed
irfansharif opened this issue Mar 1, 2022 · 3 comments · Fixed by #77260
Closed

execgen: large dependency tree causes lots of recompilation under bazel #77234

irfansharif opened this issue Mar 1, 2022 · 3 comments · Fixed by #77260
Assignees
Labels
A-build-system C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-dev-inf T-sql-queries SQL Queries Team

Comments

@irfansharif
Copy link
Contributor

irfansharif commented Mar 1, 2022

Describe the problem

Bazel builds can some times be slower than corresponding make builds for the same files changed. I tried the following experiment:

  1. make buildshort and dev build short first, on a clean checkout (to priming both caches).
  2. Applied this random (buggy) diff to one of our base go packages:
$ git diff
diff --git i/pkg/roachpb/api.go w/pkg/roachpb/api.go
index f4d0b4c8aa..5b42b71539 100644
--- i/pkg/roachpb/api.go
+++ w/pkg/roachpb/api.go
@@ -143,7 +143,7 @@ func IsTransactional(args Request) bool {
 // IsLocking returns true if the request acquires locks when used within
 // a transaction.
 func IsLocking(args Request) bool {
-       return (args.flags() & isLocking) != 0
+       return (isLocking) != 0
 }
  1. Re-run make buildshort and dev build short, timing each run:
$ time make buildshort
[...]
________________________________________________________
Executed in   70.72 secs    fish           external
   usr time  174.42 secs   40.00 micros  174.42 secs
   sys time   24.85 secs  537.00 micros   24.85 secs

$ dev build short -- --profile=experiment.tar.gz
[...]
INFO: Elapsed time: 97.947s, Critical Path: 91.80s
INFO: 312 processes: 23 remote cache hit, 1 internal, 288 darwin-sandbox.
INFO: Build completed successfully, 312 total actions

When looking at the bazel profile (chrome://tracing), I observed that we re-compiled execgen, which forced recompilation of all package targets that depended on execgen output. How come make doesn’t go and rebuild execgen + re-run execgen + recompile all dependants? Looking at our Makefile, we only re-compile bin/execgen when proto files change. And we don't (re-)generate .eg.go files unless explicitly running make generate. In short: we've not fully specified execgen dependencies in Make and things are faster because of it. Bazel builds by contrast are fully specified and will re-compile + re-gen execgen output when any of execgen's go dependencies are updated. It's more "correct" but slower because of it.

To Reproduce

See above. To see the dependency chain between execgen and roachpb:

$  bazel query "allpaths(//pkg/sql/colexec/execgen/cmd/execgen, //pkg/roachpb)"

To see the full set of dependencies execgen has (any of which, if changed, would cause slower builds than Make):

$ bazel query "deps(//pkg/sql/colexec/execgen/cmd/execgen)" --output package | grep -v '^@' | grep -v '^build/bazelutil'

Expected behavior

I'm only filing this issue to later point people to for one possible reason for the observed slowdown. We should revisit transitive dependencies execgen has given its position in our build system -- any trim down would drastically help.

Epic CRDB-8036

Jira issue: CRDB-13494

@irfansharif irfansharif added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Mar 1, 2022
@rickystewart
Copy link
Collaborator

Looks like this was not meant to be closed?

@rickystewart rickystewart reopened this Mar 1, 2022
@irfansharif
Copy link
Contributor Author

I did actually, I didn't have a concrete suggestion/next step here. If we want to use this issue to coordinate the trimming down of execgen deps, that works for me. I'm not familiar with the code there to suggest how. @ajwerner or @yuzefovich perhaps?

@rickystewart
Copy link
Collaborator

Yeah, I'm inclined to leave it open to track the problem.

@yuzefovich yuzefovich self-assigned this Mar 3, 2022
@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Mar 3, 2022
craig bot pushed a commit that referenced this issue Mar 3, 2022
76858: kvserver: allow circuit-breaker to serve reads r=erikgrinaker a=tbg

This commit revamps an earlier implementation (#71806) of per-Replica
circuit breakers (#33007). The earlier implementation relied on context
cancellation and coarsely failed all requests addressing the Replica
when the breaker was tripped.

This had two downsides: First, there was a (small) performance overhead
for implementing the cancellation that was paid even in the common case
of a healthy Replica. Second, and more importantly, the coarseness meant
that we'd potentially fail many requests that would otherwise succeed,
and in particular follower reads.

`@nvanbenschoten` suggested in #74799 that latching could be extended with
the concept of "poisoning" and that this could result in fine-grained
circuit breaker behavior where only requests that are truly affected by
unavailability (at the replication layer) would be rejected.

This commit implements that strategy:

A request's latches are poisoned if its completion is predicated on the
replication layer being healthy.  In other words, when the breaker
trips, all inflight proposals have their latches poisoned and new
proposals are failed fast. However, and this is the big difference,
reads can potentially still be accepted in either of two scenarios:

- a valid follower read remains valid regardless of the circuit breaker
  status, and also regardless of inflight proposals (follower reads
  don't observe latches).
- a read that can be served under the current lease and which does
  not conflict with any of the stuck proposals in the replication
  layer (= poisoned latches) can also be served.

In short, reads only fail fast if they encounter a poisoned latch or
need to request a lease. (If they opted out of fail-fast behavior,
they behave as today).

Latch poisoning is added as a first-class concept in the `concurrency`
package, and a structured error `PoisonError` is introduced. This error
in particular contains the span and timestamp of the poisoned latch that
prompted the fail-fast.

Lease proposals now always use `poison.Policy_Wait`, leaving the
fail-fast behavior to the caller. This simplifies things since multiple
callers with their own `poison.Policy` can multiplex onto a single
inflight lease proposal.

Addresses #74799.

Release note: None
Release justification: 22.1 project work

77221: changefeedccl: Fix flaky test. r=miretskiy a=miretskiy

Fix flaky TestChangefeedHandlesDrainingNodes test.
The source of the flake was that cluster setting updates propagate
asynchronously to the other nodes in the cluster.  Thus, it was possible
for the test to flake because some of the nodes were observing the
old value for the setting.

The flake is fixed by introducing testing utility function that
sets the setting and ensures the setting propagates to all nodes in
the test cluster.

Fixes #76806

Release Notes: none

77317: util/log: remove Safe in favor of redact.Safe r=yuzefovich a=yuzefovich

My desire to make this change is to break the dependency of `treewindow`
on `util/log` (which is a part of the effort to clean up the
dependencies of `execgen`).

Addresses: #77234.

Release note: None

Release justification: low risk change to clean up the dependencies
a bit.

Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Co-authored-by: Yevgeniy Miretskiy <yevgeniy@cockroachlabs.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
craig bot pushed a commit that referenced this issue Mar 3, 2022
…77328 #77335

75751: sql: Add DateStyle/IntervalStyle visitor r=e-mbrown a=e-mbrown

The DateStyle visitor allows for cast expressions with string
to interval and date/interval types to string cast to be rewritten.
These stable cast cause issues with DateStyle/IntervalStyle formatting so they
need to be wrapped in builtins containing their immutable version.

Release note: None
Release justification: Low risk update to new functionality

76705: backupccl: add prototype metadata.sst r=rhu713 a=rhu713

This adds writing of an additional file to the completion of BACKUP. This new file
is an sstable that contains the same metadata currently stored in the BACKUP_MANIFEST
file and statistics files, but organizes that data differently.

The current BACKUP_MANIFEST file contains a single binary-encoded protobuf message
of type BackupManifest, that in turn has several fields some of which are repeated
to contain e.g. the TableDescriptor for every table backed up, or every revision to
every table descriptor backed up. This can result in these manifests being quite large
in some cases, which is potentially concerning because as a single protobuf message,
one has to read and unmarshal the entire struct into memory to read any field(s) of it.

Organizing this metadata into an SSTable where repeated fields are instead stored as
separate messages under separate keys should instead allow reading it incrementally:
one can seek to a particular key or key prefix and then scan, acting on whatever data
is found as it is read, without loading the entire file at once (when opened using the
same seek-ing remote SST reader we use to read backup data ssts).

This initial prototype adds only the writer -- RESTORE does not rely on, or even open,
this new file at this time.

Release note: none.

77018: release: automate orchestration version update r=celiala a=rail

Previously, as a part of the release process we had to bump the
orchestration versions using `sed` with some error-prone regexes.

This patch adds `set-orchestration-version` subcommand to the release
tool. It uses templates in order to generate the orchestration files.

Release note: None

77055: sql: change index backfill merger to use batch api r=rhu713 a=rhu713

Use Batch API instead of txn.Scan() in order to limit the number of bytes per
batch response in the index backfill merger.

Fixes #76685.

Release note: None

77065: bazel: use test sharding more liberally r=rail a=rickystewart

Closes #76376.

Release note: None

77109: ccl/sqlproxyccl: add helpers related to connection migration r=JeffSwenson,andy-kimball a=jaylim-crl

#### ccl/sqlproxyccl: add helpers related to connection migration 

Informs #76000. Extracted from #76805.

This commit adds helpers related to connection migration. This includes support
for retrieving the transfer state through SHOW TRANSFER STATE, as well as
deserializing the session through crdb_internal.deserialize_session.

Release note: None

Release justification: Helpers added in this commit are needed for the
connection migration work. Connection migration is currently not being used
in production, and CockroachCloud is the only user of sqlproxy.
  
#### ccl/sqlproxyccl: fix math for defaultBufferSize in interceptors 

Previously, we incorrectly defined defaultBufferSize as 16K bytes. Note that
2 << 13 is 16K bytes. This commit fixes that behavior to match the original
intention of 8K bytes.

Release note: None

Release justification: This fixes an unintentional buglet within the sqlproxy
code that was introduced with the interceptors back then. Not having this in
means we're using double the memory for each connection within the sqlproxy.



77307: sql: add cluster setting to limit max size of serialized session r=otan,jaylim-crl a=rafiss

fixes #77302

The sql.session_transfer.max_session_size cluster setting can be used to
limit the max size of a session that is serialized using
crdb_internal.serialize_session.

No release note since this is not a public setting.

Release justification: high priority fix for new functionality.

Release note: None

77318: roachpb: extract keysbase to break some dependencies r=yuzefovich a=yuzefovich

This commit extracts a couple of things out of `roachpb` into new
`keysbase` package in order to break the dependency of `util/json` and
`sql/inverted` on `roachpb` (which is a part of the effort to clean up
the dependencies of `execgen`).

Addresses: #77234.

Release note: None

Release justification: low risk change to clean up the dependencies.

77319: sessiondatapb: move one enum definition into lex package r=yuzefovich a=yuzefovich

This commit moves the definition of `BytesEncodeFormat` enum from
`sessiondatapb` to `lex`. This is done in order to make `lex` not depend
on a lot of stuff (eventually on `roachpb`) and is a part of the effort
to clean up the dependencies of `execgen`. Note that the proto package
name is not changed, so this change is backwards-compatible.

Informs: #77234.

Release note: None

Release justification: low risk change to clean up the dependencies.

77328: roachtest: log stdout and stderr in sstable corruption test r=itsbilal a=nicktrav

To aid in debugging #77321, log the contents stdout and stderr if the
manifest dump command fails.

Release justification: Tests only.

Release note: None.

77335: kvserver: fix race that caused truncator to truncate non-alive replica r=tbg,erikgrinaker a=sumeerbhola

This was causing truncated state to be written to such a
replica, which would then get picked up as the
HardState.Commit value when a different replica was later
added back for the same range. See
#77030 (comment)
for the detailed explanation.

Also restore the default value of
kv.raft_log.loosely_coupled_truncation.enabled to true.

Fixes #77030

Release justification: Bug fix.
Release note: None

Co-authored-by: e-mbrown <ebsonari@gmail.com>
Co-authored-by: David Taylor <tinystatemachine@gmail.com>
Co-authored-by: Rui Hu <rui@cockroachlabs.com>
Co-authored-by: Rail Aliiev <rail@iqchoice.com>
Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>
Co-authored-by: Jay <jay@cockroachlabs.com>
Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: Nick Travers <travers@cockroachlabs.com>
Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>
@craig craig bot closed this as completed in 3160bcf Mar 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-build-system C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-dev-inf T-sql-queries SQL Queries Team
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants