Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

importccl: Allow users to IMPORT into an existing table #26834

Closed
15 tasks
nstewart opened this issue Jun 19, 2018 · 0 comments
Closed
15 tasks

importccl: Allow users to IMPORT into an existing table #26834

nstewart opened this issue Jun 19, 2018 · 0 comments
Assignees
Labels
A-disaster-recovery C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Milestone

Comments

@nstewart
Copy link
Contributor

nstewart commented Jun 19, 2018

Currently IMPORT can only create a new table into which it imports data.

However many users would like to IMPORT into an existing table. In some cases this is because they want to configure a tables replication settings before importing the data. In others they want to do the import in stages, for example because they simply cannot fit all the data to import in a single source location.

Other users want to IMPORT new data at regular intervals (e.g. nightly dumps from some other system) though such cases raise additional questions around if and how the import can modify existing data.

The two main differences between importing a new table and into an existing table are that an existing table can have a) traffic interacting with it and b) existing data.

The one of biggest reasons IMPORT is so much faster than bulk-INSERTs is that it skips the majority of the transactional writing infrastructure, however this makes point a) tricky: we cannot ensure transactional safety of the normal sql traffic -- for which transaction safety is implied -- while bulk ingestion is happening to the same data. The easiest way to reduce importing into an existing table to the current import w.r.t. a) is to simply remove the traffic -- schema change the table to an offline state, then import, then bring it back online.

Another reason bulk-ingestion is faster is that it prepares entire SSTs that go directly into RocksDB, replacing any values they happen to contain. In an empty, just-created table, that replacement is not a concern, but for a table that contains data, it is. Collision detection is possible, but not cheap -- it'd require scanning the existing and importing data, either prior to or after ingestion, as well as some sort of resolution. For the primary key, one option is to just let the IMPORT replace the existing value, but detecting and handling conflicts in unique secondary indexes is a problem the implementation will need to address.

An initial prototype of this take-table-off-line-then-import approach was added in #37451.

Still TODO for the initial version to be considered complete:

  • Audit / Test that the online/offline transition is respected everywhere
    • SQL leasing (direct, via FKs, etc)
    • SQL direct resolution (e.g. DDLs)
    • Background queues
    • AS OF SYSTEM TIME before, at, after each of the the transition points.
  • Reading/checking/plumbing/using specified target columns ala INSERT
  • Testing IMPORT INTO itself
  • What do we do about uniqueness and indexes?
  • What do we do about Check/FK constraints?
    • CHECK: no partial replacement of rows so we just need to be sure we eval on the import row.
    • FK: Obviously we're not checking FKs. Move to validating state and re-validate after? What do we do on failure?
      when taking it offline, move to unvalidated if it was validated and record name of constraints moved in job.
      after reverting on failure, can go straight back if it was validated before
      otherswise we'll kick off validation.
  • What do we do on failure? importccl: rollback IMPORT INTO on failure #38518
  • Audit Txn use jobs, *: refactor planning vs exec APIs #37691
@knz knz added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jul 21, 2018
@knz knz changed the title IMPORT: Allow users to IMPORT into an existing table importccl: Allow users to IMPORT into an existing table Jul 23, 2018
@dt dt self-assigned this May 21, 2019
@dt dt added this to the 19.2 milestone May 21, 2019
craig bot pushed a commit that referenced this issue Jul 23, 2019
38449: libroach: Adds logic to run AddSSTable with shadowing disabled. r=adityamaru27 a=adityamaru27

This change enhances AddSSTable to be run with the option of
disallowing ingestion of keys which would shadow already existing
data.
Used to safeguard IMPORT INTO which would otherwise
break secondary indexes. (#38044)
(Part of the larger roadmap #26834)

TODO: Add IMPORT INTO test(s) which trigger the safeguard.

Co-authored-by: Aditya Maru <adityamaru@cockroachlabs.com>
craig bot pushed a commit that referenced this issue Jul 29, 2019
39023: pkg: Add support for user specified target cols in IMPORT INTO r=adityamaru27 a=adityamaru27

This change adds the ability for users to specify a subset of the columns of an existing table to import into, from a CSV data source. 
(Part of the larger roadmap #26834)

39043: distsqlrun: lookup join efficiency improvements r=jordanlewis a=jordanlewis

- distsqlrun: don't save lookup rows twice
- distsqlrun: stream lookup join output
- distsql: remove unused field from lookup join spec

Previously, lookup join buffered its rendered output rows before
emitting any of them, because this used to be a requirement when lookup
join was also responsible for doing a second layer of lookups against an
index. This was no longer necessary.

Now, after the result of a lookup batch is accumulated into memory, rows
are rendered and emitted in a row-at-a-time streaming fashion.

Also, remove the unused extra index filter expression field from the lookup join spec.

39118: opt: add IndexOrdinal alias r=RaduBerinde a=RaduBerinde

Adding a `cat.IndexOrdinal` alias to make it more explicit when a
value is an index ordinal. It is only an alias because a separate type
would make things like loops more annoying.

Release note: None

39146: sql/tree: store From as value in SelectClause AST node r=nvanbenschoten a=nvanbenschoten

Drive by cleanup. The value is assumed to be non-nil in a number of
places, so we should avoid the unnecessary indirection.

Release note: None

Co-authored-by: Aditya Maru <adityamaru@cockroachlabs.com>
Co-authored-by: Jordan Lewis <jordanthelewis@gmail.com>
Co-authored-by: Radu Berinde <radu@cockroachlabs.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@dt dt closed this as completed Aug 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

3 participants