Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wip: add pgvector implementation #12

Open
wants to merge 965 commits into
base: master
Choose a base branch
from
Open

wip: add pgvector implementation #12

wants to merge 965 commits into from

Conversation

jordanlewis
Copy link
Owner

@jordanlewis jordanlewis commented Apr 14, 2024

This change is Reviewable

Copy link

coderabbitai bot commented Apr 14, 2024

Important

Auto Review Skipped

More than 25% of the files skipped due to max files limit. The review is being skipped to prevent a low-quality review.

233 files out of 298 files are above the max files limit of 50. Please upgrade to Pro plan to get higher limits.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

xinhaoz and others added 29 commits May 7, 2024 15:49
This commit simplifies the assignment of this field on the conn
executor's `extraTxnState`.

Epic: none

Release note: None
Add the recorded txn stats so we can do some testing validation
on the stats collected.

Epic: none

Release note: None
Previously, the `{pgurl}` expansion would force the `cluster`
connection parameter to `system`. This means that if the caller
previously configured a different default virtual cluster, `pgurl`
would ignore that and continue connecting to `system`, which is
quite surprising.

In this commit, we only set an explicit `cluster` connection parameter
if the user passed one; this should allow `pgurl` users to connect to
the default tenant by default.

Epic: none

Release note: None
Use a different server for each subtest, and close it if there was a
connection timeout. Also, make the connection timeout longer.

Release note: None
Both these options are not supported for roachprod azure yet,
resulting in cluster_creation errors.

Informs: cockroachdb#120621
Fixes: none
Epic: none
Release note: none
In cockroachdb#109655, we stopped using the transaction id of a previously executed
transaction for `BEGIN`s. This change meant that `BEGIN` will be logged
with no transaction id. This led to logs related to cluster setting
`sql.log.user_audit.reduced_config.enabled` spewing error messages about
a nil transaction (since the logic behind said cluster setting requires
us to call on `MemberOfWithAdminOption` -- which uses a transaction for
its lookup). This patch ensures that we create an internal transaction
before we perform the `MemberOfWithAdminOption` lookup to handle the
nil BEGIN statement case.

Epic: none
Fixes: cockroachdb#123592

Release note: None
123769: roachtest: disable azure tests that use zfs or specify ssd counts r=srosenberg,renatolabs a=DarrylWong

Both these options are not supported for roachprod azure yet, resulting in cluster_creation errors.

Informs: cockroachdb#120621
Fixes: none
Epic: none
Release note: none

Co-authored-by: DarrylWong <darryl@cockroachlabs.com>
* [x] Update `build/teamcity/internal/release/build-and-publish-patched-go/impl.sh` with the new version and adjust SHA256 sums as necessary.
* [x] Adjust `GO_VERSION` and `GO_FIPS_COMMIT` for the FIPS Go toolchain ([source](./teamcity/internal/release/build-and-publish-patched-go/impl-fips.sh)).
* [x] Run the `Internal / Cockroach / Build / Toolchains / Publish Patched Go for Mac` build configuration in TeamCity with your latest version of the script above. Note the job depends on another job `Build and Publish Patched Go`. That job prints out the SHA256 of all tarballs, which you will need to copy-paste into `WORKSPACE` (see below). `Publish Patched Go for Mac` is an extra step that publishes the *signed* `go` binaries for macOS. That job also prints out the SHA256 of the Mac tarballs in particular.
* [x] Adjust `--@io_bazel_rules_go//go/toolchain:sdk_version` in [.bazelrc](../.bazelrc).
* [x] Bump the version in `WORKSPACE` under `go_download_sdk`. You may need to bump [rules_go](https://github.com/bazelbuild/rules_go/releases). Also edit the filenames listed in `sdks` and update all the hashes to match what you built in the step above.
* [x] Bump the version in `WORKSPACE` under `go_download_sdk` for the FIPS version of Go (`go_sdk_fips`).
* [x] Run `./dev generate bazel` to refresh `distdir_files.bzl`, then `bazel fetch @distdir//:archives` to ensure you've updated all hashes to the correct value.
* [x] Bump the default installed version of Go in `bootstrap-debian.sh` ([source](./bootstrap/bootstrap-debian.sh)).

Epic: none
Release note: None
Release note: None
Epic: none
123753: logictest: preserve logs for cockroach-go/testserver in remote execution r=rafiss a=rafiss

Epic: None
Release note: None

Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
The pre-data bundle is read from during the restore but we've already
read everything we need from it by the time we write the download job,
and have marked it as dropped. We don't need to download it just to drop
it.

Release note: none.
Epic: none.
123730: jobs: exempt stats jobs from logs r=dt a=dt

Unlike most jobs that represent some meaningful user/system action or event, where the time the event started and ended are rare and often significant, stats are continiously updated by the system as an automatic background process.

Unfortunately that process creates a distinct job each time it chooses to update a table, resulting in this constant background process appearing as constant stream of separate jobs, rather than one continuous one. To reduce the amount this stream of jobs adds noise to the logs about significant jobs, this patch moves log messages about job state transitions behind a verbose check for these jobs.

Release note: none.
Epic: none.

Co-authored-by: David Taylor <tinystatemachine@gmail.com>
123791: Revert "changefeedccl: make mock pulsar sink synchronous" r=rharding6373 a=rharding6373

Reverts cockroachdb#123228

This is causing flaky pulsar tests (e.g., cockroachdb#123720)

Epic: None

Release note: none

Co-authored-by: Rachael Harding <rharding6373@users.noreply.github.com>
This separates the metamorphic functionality out into its own package
from `pkg/util`. This is useful for some future functionality that I
would like to implement:

* Simplify/make explicit metamorphic constant logging
* Remove the `metamorphic_disable` build tag (see cockroachdb#123319)

This is a re-implementation of cockroachdb#102474.

Epic: none
Release note: None
This commit fixes - what I believe is - an oversight where we applied
the huge cost to the plans with full scans of virtual tables when
`disallow_full_table_scans` variable is set. The intent behind this
penalty is to make plans with full scans prohibitively expensive so
that other plans (without full scans) would be preferred, and if
there is no other plan, then the query would error out. Implementation
of this approach was incomplete for virtual tables: namely, we'd apply
the cost penalty, but then we would not actually error out the query.
Additionally, given that we don't have stats on virtual tables, we
would apply the cost penalty to all full scans of virtual tables.

As a result, we would either pick a plan without the full scan (that
might actually be worse) or would allow the plan with the full scan to
get executed still. We would also favor plans with fewer full scans of
virtual tables (which might also be worse than had we not applied the
cost penalty). This commit adjusts the code to not apply the cost
penalty to full scans of virtual tables.

This change is gated based on the new session variable
`optimizer_apply_full_scan_penalty_to_virtual_tables` which defaults to
`false` on master. We can discuss on the backport PRs whether it should
be default to `true` there to require users to opt in for this PR to
have an effect.

Release note: None
123432: util/metamorphic: introduce `metamorphic` package r=srosenberg,RaduBerinde a=rickystewart

This separates the metamorphic functionality out into its own package from `pkg/util`. This is useful for some future functionality that I would like to implement:

* Simplify/make explicit metamorphic constant logging
* Remove the `metamorphic_disable` build tag (see cockroachdb#123319)

This is a re-implementation of cockroachdb#102474.

Epic: none
Release note: None

Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>
123765: roachprod: don't overwrite tenant in `pgurl` expansion r=DarrylWong,srosenberg a=renatolabs

Previously, the `{pgurl}` expansion would force the `cluster` connection parameter to `system`. This means that if the caller previously configured a different default virtual cluster, `pgurl` would ignore that and continue connecting to `system`, which is quite surprising.

In this commit, we only set an explicit `cluster` connection parameter if the user passed one; this should allow `pgurl` users to connect to the default tenant by default.

Epic: none

Release note: None

Co-authored-by: Renato Costa <renato@cockroachlabs.com>
Release note: none.
Epic: none.
123424: operations: add a few more variants to the ADD COLUMN op r=rafiss a=rafiss

This adds some coverage for computed columns and NOT NULL constraints.

fixes cockroachdb#122251
Release note: None

123674: rttanalysis: deflake Jobs/jobs_page_type_filtered r=rafiss a=rafiss

The test sometimes detects a lower number of round trips, which is actually more desirable anyway.

fixes cockroachdb#123591
fixes cockroachdb#123691
Release note: None

123773: pgwire: deflake TestParseClientProvidedSessionParameters r=rafiss a=rafiss

Use a different server for each subtest, and close it if there was a connection timeout. Also, make the connection timeout longer.

fixes cockroachdb#123484
fixes cockroachdb#123098
Release note: None

Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
This benchmark runs quicker than BenchmarkDropLargeDatabase and can be
used to study how DROP DATABASE scales with larger numbers of objects.

Release note: None
Release note: none.
Epic: none.
122285: sql,multiregionccl: rename multiregion system DB cluster setting r=rafiss a=rafiss

Avoid using the word "preview" in the name. That shouldn't be needed, since the cluster setting is hidden, so users would only know about it if we tell them to enable it. Keeping "preview" in the name isn't a huge problem, but it's just slightly annoying if we decide that we want to keep the setting around even after the feature leaves the "preview" status, since it requires updating docs and any customer code that was using the old name.

Also, this removes the mention of the setting in error messages. An error should not refer to an undocumented setting, since a user would have no way of learning more about it.

informs: cockroachdb#63365
Epic: CRDB-33032
Release note: None

Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
The code was iterating over the list of tests, and setting
spot VM value to true. But, as the values are passed as copies,
the flag was not set.

The change is to set the value to the test list itself.
Also, removed the 0.5 probability and running all benchmark
tests on spot.

Fixes: cockroachdb#116778
Epic: none
123733: changefeedccl: skip flaky pulsar tests r=rharding6373 a=wenyihu6

This patch skips flaky tests for pulsar sinks. We will add them back when pulsar
is fully supported.

Release note: none
Fixes: cockroachdb#123718
Fixes: cockroachdb#123719
Fixes: cockroachdb#123720
Fixes: cockroachdb#123721

Co-authored-by: Wenyi Hu <wenyi@cockroachlabs.com>
If a mixed-version test fails before user-provided functions had a
chance to run, then it means that *something* went wrong while setting
up the cluster. In these cases, it doesn't make sense to create an
issue with the team that owns the test as it adds noise (e.g., cockroachdb#123610).

With this commit, we now redirect these failures to test-eng, who will
be better positioned to diagnose and fix any issues.

Epic: none

Release note: None
Using the default Set method is sufficient.

Release note: None
123810: sql_test: add BenchmarkDropLargeDatabaseWithGenerateTestObjects r=rafiss a=rafiss

This benchmark runs quicker than BenchmarkDropLargeDatabase and can be used to study how DROP DATABASE scales with larger numbers of objects.

fixes cockroachdb#100850
Release note: None

Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
craig[bot] and others added 29 commits May 15, 2024 19:42
123859: sql: expose an ability to request redacted stmt bundles r=yuzefovich a=yuzefovich

This commit extends the stmt bundle collection infrastructure to support requesting redacted stmt bundles. Previously, this was only possible via an explicit `EXPLAIN ANALYZE (DEBUG, REDACT)` stmt, but this commit adds the support via overloads to `crdb_internal.request_statement_bundle` builtin as well as the server API. This paves the way for requesting the redacted bundles via the DB Console, but it'll be done separately.

To support this we need to add another boolean column to `system.statement_diagnostics_requests` table to indicate whether a request is for the redacted bundle or not. This requires adding a migration in which we populate the existing rows with all `false` values.

In order to reduce the amount of code churn and simplify this change, this commit repurposes some of the existing code that we introduced the last time we added a migration for plan-gist matching in stmt bundles. In other words, this commit simultenously removes some stale code and handles the new column.

Epic: CRDB-37839

Release note: None 

Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
This roachtest has failed consistently for the past few days, since
being enabled in cockroachdb#123694. Skip while we debug, to avoid noise.

Touches: cockroachdb#123989.

Release note: None.
124189: sql: simplify usage of some enum settings r=yuzefovich a=yuzefovich

This commit adjusts a few enum settings to use `EnumSetting.String` when defining the session variables to clean up the code a bit.

Epic: None

Release note: None

124237: execinfra: add a sanity check that DistSQL version is not bumped r=yuzefovich a=yuzefovich

The last bump was in 23.1, and we won't ever bump DistSQL version in backwards-incompatible going forward.

Epic: None

Release note: None

124241: Updated the AUTHORS file to include Naveen Setlur r={msbutler} a=navsetlur

As stated in the title. Basic PR. 

Epic: None

Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: Naveen Setlur <naveen.setlur@cockroachlabs.com>
124226: sqlstats: remove unnecessary StatementFingerprintID construction r=xinhaoz a=xinhaoz

Previously we were constructing the stmt fingerprint id unnecessarily
in 2 functions in sqlstats:
- The Next function for the StmtStatsIterator
- PopAllStats, which returns the sql stats collected thus far and
resets the in-memory sql stats containers.

The ID is available in the recorded statement in both of these cases-
there's no need to reconstruct this value and doing is needlessly
expensive.

Epic: none

Release note: None

Co-authored-by: Xin Hao Zhang <xzhang@cockroachlabs.com>
Release note (ops change): Added the sql.pgwire.pipeline.count metric,
which is a gauge that measures how many wire protocol commands have been
received by the server, but have not yet begun processing. This metric
will only grow if clients are using the "pipeline mode" of the Postgres
wire protocol.
124231: pgwire: add a metric for number of pipelined requests r=rafiss a=rafiss

fixes cockroachdb#123912
Release note (ops change): Added the sql.pgwire.pipeline.count metric, which is a gauge that measures how many wire protocol commands have been received by the server, but have not yet begun processing. This metric will only grow if clients are using the "pipeline mode" of the Postgres wire protocol.

Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
124177: roachtest: skip multi-store-remove while debugging r=itsbilal,sumeerbhola a=nicktrav

This roachtest has failed consistently for the past few days, since being enabled in cockroachdb#123694. Skip while we debug, to avoid noise.

Touches: cockroachdb#123989.

Release note: None.

Epic: None.

Co-authored-by: Nick Travers <travers@cockroachlabs.com>
Remove deprecated 23.2 gates - these features will always be enabled
in any cluster that a 24.2 node is part of.

`TODODelete_V23_1` has a lot of uses and will be removed separately.

Epic: none
Release note: None
124013: clusterversion: remove deprecated internal versions r=RaduBerinde a=RaduBerinde

Remove deprecated 23.2 gates - these features will always be enabled
in any cluster that a 24.2 node is part of.

`TODODelete_V23_1` has a lot of uses and will be removed separately.

Epic: none
Release note: None

Co-authored-by: Radu Berinde <radu@cockroachlabs.com>
We were incorrectly using the boolean getter for this integer variable.
The bug was introduced when this variable was added in
1c1d313.

Release note: None
This commit fixes an omission where `cost_scans_with_default_col_size`
session variable didn't actually use the corresponding cluster setting
for its default value. In other words, the cluster setting actually had
no effect. The bug has been present since this variable was introduced
in af32d9d.

Release note: None
This commit adjusts global defaults for a few session variables to use
the postgres style (so that the global default value and session
variable value would be stored alike).

Release note: None
This commit refactors how we collect session variables and cluster
settings into `env.sql` file in the stmt bundle. Previously, we used
a hard-coded list of session variables that were always included while
ignoring all variables not in the list. If a variable had a value
different from the "binary default" (meaning the value that you got
right after the cluster startup, before any cluster setting
modifications are applied), then it would be in a SET statement that
would run on the `bundle recreate`; if the value was the same as
"binary default", then the SET statement was commented out. Also, all
cluster setting values were included into the file as a comment.

This commit makes it so that we include all session variables and
cluster settings that differ from their defaults. This allows us to
recreate the environment while also reducing the overhead of including
everything into `env.sql`.

For session variables further clarification is warranted:
- all read-only variables have been audited, and most are explicitly
excluded. Out of 23 currently present session variables, only 9 were
deemed to be possibly useful during investigations, so they are not
excluded (one of 9 is `crdb_version` which we print out separately, so
it's actually excluded too).
- for writable variables, we include it if its value differs from the
"binary" default (the one that you get on a fresh cluster) or from the
"cluster" default (the one that you get on a fresh session on the
current cluster). The latter kind is obtained via the provided
`settings.Values` container whereas the former is obtained via a global
singleton container.

Additionally, this commit adjusts the SET and SET CLUSTER SETTING
statements to have single quotes around the setting values that need them.
For cluster settings all types work with having single quotes, but for
session variables some (like integers) don't work with quotes while
others (like strings) need them. This commit adds a map for those that
need them as well as a simple test to catch some of the missing ones
(the list might be incomplete given that the test exercises the default
config). All values are adjusted to fit on a single line (we have some
cluster settings that might span multiple lines).

It also adjusts `EXPLAIN (OPT, ENV)` to include the list of cluster
settings with non-default values.

Release note: None
123794: sql: improve env collection in stmt bundles r=yuzefovich a=yuzefovich

This PR improves stmt bundle collection to include all session variables and cluster settings that differ from their defaults.

Fixes: cockroachdb#119786
Fixes: cockroachdb#123830.

Release note: None

Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Due to the complexity of fetching the secrets from the secrets
manager, the secrets are now maintained in cloud storage.

Fixes: cockroachdb#117125
Epic: none
124099: roachprod: fetch secrets from cloud store r=srosenberg a=nameisbhaskar

Due to the complexity of fetching the secrets from the secrets manager, the secrets are now maintained in cloud storage.

Fixes: cockroachdb#117125
Epic: none

Co-authored-by: Bhaskarjyoti Bora <bhaskar.bora@cockroachlabs.com>
This change adds operations to mutate cluster settings. The values are mutated
based on a preset frequency and RNG. For example, if the frequency of a cluster
setting operation is set to "30 minutes" it will only change to a new value
every 30 minutes, even if the operation has been run multiple times within the
same 30-minute window. Running in the same window will just set the same value
again.

Epic: None
Release Note: None
Epic: None
Release Notes: None
Previously, to upgrade a cluster it had to be done manually by running several
commands to copy over new binaries, drain and stop cockroach and then running
the same start command.

This change introduces a `deploy` command that works similar as `stage` but also
deploys the new version to the cluster by doing all of the above. It does it
node for node to ensure a cluster remains in a healthy state especially if a
load balancer is used to manage connections.

Epic: None
Release Note: None
Epic: None
Release Notes: None
123461: roachprod: deploy cockroach r=renatolabs,srosenberg a=herkolategan

Previously, to upgrade a cluster it had to be done manually by running several commands to copy over new binaries, drain and stop cockroach and then running the same start command.

This change introduces a `deploy` command that works similar as `stage` but also deploys the new version to the cluster by doing all of the above. It does it node for node to ensure a cluster remains in a healthy state especially if a load balancer is used to manage connections.

Epic: None
Release Note: None

Co-authored-by: Herko Lategan <herko@cockroachlabs.com>
123806: roachtest: add cluster settings operations r=itsbilal a=herkolategan

This change adds operations to mutate cluster settings. The values are mutated
based on a preset frequency and RNG. For example, if the frequency of a cluster
setting operation is set to "30 minutes" it will only change to a new value
every 30 minutes, even if the operation has been run multiple times within the
same 30-minute window. Running in the same window will just set the same value
again.

Epic: None
Release Note: None

Co-authored-by: Herko Lategan <herko@cockroachlabs.com>
Release note (sql change): implement pgvector encoding, decoding, and
operators, without index acceleration.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.