Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Ingest External SST #187

Open
wangnengjie opened this issue Aug 16, 2022 · 8 comments
Open

Proposal: Ingest External SST #187

wangnengjie opened this issue Aug 16, 2022 · 8 comments

Comments

@wangnengjie
Copy link
Contributor

This issue describe a possible way to ingest external SST file to a running agatedb instance just as rocksdb.

Background

TiKV use ingest external SST feature for some use case. To integrate agatedb into TiKV, we need this feature.

From Creating and Ingesting SST files · facebook/rocksdb Wiki (github.com), ingesting a SST needs follow steps:

  1. copy or link the file(s) into the DB directory
  2. block (not skip) writes to the DB and assign correct sequence number to ingested file
  3. flush memtable if range overlap (SST file's key range overlap with memtable)
  4. assign the file to LSM-tree

Things go different in agatedb.

Analysis

Step 1 and step 4 are the same.

For step 3, it is unnecessary to flush memtable when overlap for agatedb. Agatedb reads all levels when get and picks the latest one (less or equal to read_ts), so it's okay to have new keys below old keys in LSM-tree. This really makes the whole process efficient. (Please point out if I'm wrong)

For step 2, it is the most difficult: How to protect ACID during and after ingestion?

Agatedb has ACID transaction (SSI). The architecture is totally different from rocksdb. We have a read_ts and a commit_ts and we have conflict check when commit.

Possible Implementation

Main idea: Regard ingesting SST files as writing large range of keys and make the whole process a transaction.

Ingesting files shall be an expensive job.

Time Point: before commit

  1. Add files, fetch new file id from LevelsController then move or copy files to DB dir with allocated file id(name).
  2. Verify file checksum (optional) and get file infomation (just open Table?).
  3. Make a new transaction, mark update as true (add a ingest flag for transaction?).
  4. (TBD) other checks…

Check files before starting a txn.

Time Point: commit

  1. Same as current, but add ingest range field to CommittedTxn in CommitInfo which contains smallest and largest keys in each ingest files.
    This is for fast but inaccurate conflict check, otherwise we need to calc every key hash in SST which really takes time.
  2. Set commit_ts as Table's global_version
  3. Send to write channel as usual (see below)
  4. Process Ingest, find suitable level for each files. (Which will take write lock of LevelHandler , and will block read process of other txn)
  5. Write manifest file. The file's global_version is stored in manifest.

A question here. I see we protect the order of Requests send to write channel the same as commit_ts and I wonder why? For WAL in order?

If ingest job can be out-of-order, I think it's possible and better to make ingest job running in current thread and not sending to write channel.

You can see that the whole commit process has only one small I/O — append to manifest. But when picking level, it takes time if there are many tables in this level and this happens under a write lock of LevelHandler .

Conflict Check

There are two ways for conflict check as I described in TP: commit (step 1) .

  1. Calculate every key's hash in ingested files and then conflict check works as same right now. For performance problem, we can calculate key hash before transaction.
  2. Add ingest range field to CommittedTxn in CommitInfo .
    And for any update txn, when adding read keys, mark smallest and largest read key for this txn. When checking conflict, beside checking hash, also check if read range overlap with ingest range.
    This is really inaccurate and will cause many txn conflict but I think it makes sense.

The second way needs to refactor transaction much more.

(A more simple way is to break ACID…hhh)

Global Version

This concept is the same as rocksdb. For ingested files, all keys has wrong version and it's unreasonable to modify each. When file has a global version, every key in this file has this version.

In latest rocksdb, this value is stored in manifest to avoid random write to ingested files.

We need to add a field in meta.proto and update Table iterator (also block iterator) to fit this feature.

global_version may have ambiguity. Maybe another name when impl.

Discuss

  1. A new transaction struct only for ingest or refactor current?
  2. Which way to check conflict? (or any better idea?)
  3. Send to write channel or done in current thread?
  4. Any good idea to impliment this feature?
  5. Strategy to pick level. I think from the bottom to check if there is overlap is nice.
  6. What if external files (in one job) has overlap range? Rocksdb will put all files at level0.
@wangnengjie wangnengjie changed the title Proposal: Ingest External SST [WIP] Proposal: Ingest External SST Aug 16, 2022
@GanZiheng
Copy link
Contributor

A question here. I see we protect the order of Requests send to write channel the same as commit_ts and I wonder why? For WAL in order?

That's because we should make transactions stored in WAL contiguously to achieve atomic. Badger will check the consistency of WAL and truncate it if necessary.

I think maybe it's not necessary to use transaction and consider write channel. We could use oracle to perform conflict check and get commit timestamp, and after getting the commit timestamp successfully, we just update the file and place it on level 0?

@wangnengjie
Copy link
Contributor Author

wangnengjie commented Aug 16, 2022

We could use oracle to perform conflict check and get commit timestamp

The api of oracle needs a txn struct, I just don't want to break the api. And the conflict check of oracle needs write keys hash of transaction which is fields conflict_keys in Transaction.
So it's unavoidable to record write keys in ingested files.

pub(crate) fn new_commit_ts(&self, txn: &mut Transaction) -> (u64, bool) {

fn has_conflict(&self, txn: &Transaction) -> bool {

after getting the commit timestamp successfully, we just update the file and place it on level 0

Due to the particularity of level0, put SST just in level0 may cause significant read amplification and write stall. Put files in lower level can reduce read/write amplification caused by compaction but will cause space amplification.

I have a look to the strategy of rocksdb.

We pick the lowest level in the LSM-Tree that satisfies these conditions

  • The file can fit in the level
  • The file key range don't overlap with any keys in upper layers
  • The file don't overlap with the outputs of running compactions going to this level

It makes sence and I forget the third point~.

For the second point, it's because rocksdb protects new keys in higher level which is unnecessary in agatedb.

I'll try and see the result of ingest a file in rocksdb.

@GanZiheng
Copy link
Contributor

The api of oracle needs a txn struct, I just don't want to break the api. And the conflict check of oracle needs write keys hash of transaction which is fields conflict_keys in Transaction.
So it's unavoidable to record write keys in ingested files.

We certainly should record write keys for the conflict check. I mean add method for oracle to get commit timestamp for the ingesting file scenario.

Due to the particularity of level0, put SST just in level0 may cause significant read amplification and write stall. Put files in lower level can reduce read/write amplification caused by compaction but will cause space amplification.

Yes it will cause write stall if there are many files being ingested in the same time, but in my opinion, it will not cause read amplification since we always read every SSTable in the LSM tree.

Maybe we could refer to compacting to level base and consider to put the file to level base first?

@wangnengjie
Copy link
Contributor Author

We certainly should record write keys for the conflict check. I mean add method for oracle to get commit timestamp for the ingesting file scenario.

Okay, I got it.

Yes it will cause write stall if there are many files being ingested in the same time, but in my opinion, it will not cause read amplification since we always read every SSTable in the LSM tree.

Maybe we could refer to compacting to level base and consider to put the file to level base first?

Oh yes, I was wrong for level0's read amplification but the higher we put ingested files the more read/write amplification cause by compaction we get.

What does level base mean? Target for L0 to compact to or else?

@GanZiheng
Copy link
Contributor

What does level base mean? Target for L0 to compact to or else?

Yes, the target level into which level 0 will merge its data.

@wangnengjie
Copy link
Contributor Author

About how to check files to ingested overlap with running compaction output.

I saw we have CompactStatus to trace running compaction. When picking and checking L i ,I should check file does not overlap with levels[i].ranges. But one situation may happen.

Suppose a running compaction pick L i 1 range [1..50],and we have below tables at L i

Level i-1: [10..50] <- pick this to compact to Level i
Level   i: [1..5] [20..30] [40..60]

The next_level (next_range) we collect to trace range is [20..60], but the actual compaction output is [10..60]. And if the file to ingest has range [6..15], it doesn't overlap with [20..60] but overlap with [10..60].

So I can't use CompactStatus to check overlap based on current implimentation.

Any good ideas? I think we can extend next_range with this_range but I'm not sure whether it's possible. @GanZiheng

@GanZiheng
Copy link
Contributor

I think we can extend next_range with this_range

I think that's ok.

By the way, I think this problem only happens when ingesting files, and will not appear in current implementation.

@wangnengjie
Copy link
Contributor Author

By the way, I think this problem only happens when ingesting files, and will not appear in current implementation.

Yes, it only happens when ingesting files. I didn't make myself clear.

@wangnengjie wangnengjie changed the title [WIP] Proposal: Ingest External SST Proposal: Ingest External SST Aug 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants