Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export Import sst files #5135

Closed
wants to merge 3 commits into from
Closed

Export Import sst files #5135

wants to merge 3 commits into from

Conversation

vpallipadi
Copy link

Refresh of the earlier change here - #3822
With the same basic idea. But, import API changed to import during a CF creation instead of arbitrary time (based on earlier feedback). Comments please @siying @ajkr

This is a review request for code change needed for - #3469
"Add support for taking snapshot of a column family and creating column family from a given CF snapshot"

We have an implementation for this that we have been testing internally. We have two new APIs that together provide this functionality.

(1) ExportColumnFamily() - This API is modelled after CreateCheckpoint() as below.
// Gets the live set of SST files for a particular Column Family in
// export_dir and metadata about the files in metadata.
// The SST files will be created through hard links when the directory is in
// the same partition as the db, copied otherwise.
// The directory should not already exist and will be created by this API.
// The directory will be an absolute path. Triggers a flush.
virtual Status ExportColumnFamilyFiles(
ColumnFamilyHandle* column_family,
std::vector* metadata,
const std::string& export_dir);

Internally, the API will DisableFileDeletions(), GetColumnFamilyMetaData(), Parse through
metadata, creating links/copies of all the sst files, EnableFileDeletions() and complete the call by
returning the list of file metadata.

(2) CreateColumnFamilyWithImport() - This API is modeled after IngestExternalFile(), but invoked only during a CF creation as below.
// CreateColumnFamilyWithImport() will import external SST files into a newly created CF as specified.
//
// (1) External SST files can be created using SstFileWriter.
// (2) External SST files can be exported from a particular column family in
// an existing DB.
virtual Status CreateColumnFamilyWithImport(
const ColumnFamilyOptions& /options/,
const std::string& /column_family_name/,
const ImportColumnFamilyOptions& /import_options/,
const std::vector& /metadata/,
ColumnFamilyHandle** /handle/);

Internally, this API creates a new CF, parses all the sst files and adds it to the specified column family, at the same level and with same sequence number as in the metadata. Also performs safety checks with respect to overlaps between the sst files being imported.

If incoming sequence number is higher than current local sequence number, local sequence
number is updated to reflect this.

Note, as the sst files is are being moved across Column Families, Column Family name in sst file
will no longer match the actual column family on destination DB. The API does not modify Column
Family name or id in the sst files being imported.

Venki Pallipadi added 3 commits April 2, 2019 00:08
Introduce a new API ExportColumnFamily to export all the live sst
files of a particular column family and metadata information to be used
to import these sst files.

This is part 1 of the change for
#3469
" Add support for taking snapshot of a CF and creating a CF from a given
  CF snapshot"

Partially Solves #3469
This change adds a new API to create a column family and importing sst
files into the column family. The files imported can come from
ExportColumnFamily API or can be created with SST file writer.

This is part 2 of the change for
#3469
" Add support for taking snapshot of a CF and creating a CF from a given
  CF snapshot"

Partially Solves #3469
Add new unit tests for ExportColumnFamily() and
CreateColumnFamilyWithImport() APIs.

This is part 3 of the change for
#3469
" Add support for taking snapshot of a CF and creating a CF from a given
  CF snapshot"

Partially Solves #3469
Copy link
Contributor

@siying siying left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow, such a great job! It's so impressive. I gave a quick look and the basic workflow looks good to me and the quality of the code is great. I'll take a closer a look soon. Thank you for your contribution!

@siying
Copy link
Contributor

siying commented Jun 4, 2019

We only accept PR to master branch. Can you rebase against the master and sent it to there? If you have concern about doing rebasing multiple times, you can do it after I gave it a careful pass.

Copy link
Contributor

@siying siying left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks very good to me. I look forward to seeing this one rebased against master and merge it there!

// (1) External SST files can be created using SstFileWriter.
// (2) External SST files can be exported from a particular column family in
// an existing DB.
// External files are deleted on a successful return.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to mention that some failure in the middle may end up with the column family created but is empty. But it will never end up with partial data loaded.

Also need to mention that the SST files are tried to be hardlinked (right? document whatever it is). Also mention that a failure can end up with part of files being renamed, and what data will be GC by DB reopening and the caller needs to be responsible to what files.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea was to cleanup column family in case of any errors. Earlier change was only doing DropCF on failure. Added DestroyCF as well now and that should ensure that there should not be any CF leftover. Added test to cover this as well.

Documented the external set file handling on success/failure cases.

// - export_dir should be specified with its absolute path.
// - Always triggers a flush.
virtual Status ExportColumnFamily(ColumnFamilyHandle* handle,
std::vector<LiveFileMetaData>* metadata,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe Google C++ Style which we are following suggests the output argument to the end: https://google.github.io/styleguide/cppguide.html#Output_Parameters

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

const auto final_nonslash_idx = export_dir.find_last_not_of('/');
if (export_dir[0] != '/' || final_nonslash_idx == std::string::npos) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember whether we do it in other places or not. Do we always expect the leading char to be "/"? If not, I suggest we not take assumption here either.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

live_file_metadata.smallestkey = std::move(file_metadata.smallestkey);
live_file_metadata.largestkey = std::move(file_metadata.largestkey);
live_file_metadata.level = level_metadata.level;
metadata->emplace_back(live_file_metadata);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes a difference in performance in this case, but emplace_back() with the element type feels a little bit odd to me. If you don't mind, I suggest we change it to push_back().

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

s = db_->GetEnv()->CreateDir(tmp_export_dir);

if (s.ok()) {
s = db_->Flush(rocksdb::FlushOptions(), handle);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set FlushOptions.allow_write_stall = false in this case?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allow_write_stall = false is the default.

ASSERT_OK(checkpoint->ExportColumnFamily(db_->DefaultColumnFamily(),
&metadata, export_path));
verify_files_exported(2);
test::DestroyDir(env_, export_path);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we don't always follow this but a better place to do the cleaning up is in the destructor of the test class. The code here will not be executed if we hit an assert failure above. If it's not too hard for you to move to the destructor, it's worth doing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


#ifndef __STDC_FORMAT_MACROS
#define __STDC_FORMAT_MACROS
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we cleaned this up and included #include <cinttypes.h> in master now.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

const ColumnFamilyOptions& options, const std::string& column_family_name,
const ImportColumnFamilyOptions& import_options,
const std::vector<LiveFileMetaData>& metadata,
ColumnFamilyHandle** handle) = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a big deal but I'm thinking of whether there is a way to further prevent users from shooting their own shoe. The biggest risk of misuse of the API is that they set up a different comparator from the source data. If there a way to pass out the comparator or name of the comparator while exporting the data and pass it in here and do some sanity check?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Added comparator name in export and a check for that on import.

const ColumnFamilyOptions& options, const std::string& column_family_name,
const ImportColumnFamilyOptions& import_options,
const std::vector<LiveFileMetaData>& metadata,
ColumnFamilyHandle** handle) = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't have to be included here but is there a way we support import the data to default column family, assuming it is empty? This feels an interesting idea too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. Will update and push against the master in next few days.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This can be useful. Will handle this separately.

facebook-github-bot pushed a commit that referenced this pull request Jul 17, 2019
Summary:
Refresh of the earlier change here - #5135

This is a review request for code change needed for - #3469
"Add support for taking snapshot of a column family and creating column family from a given CF snapshot"

We have an implementation for this that we have been testing internally. We have two new APIs that together provide this functionality.

(1) ExportColumnFamily() - This API is modelled after CreateCheckpoint() as below.
// Exports all live SST files of a specified Column Family onto export_dir,
// returning SST files information in metadata.
// - SST files will be created as hard links when the directory specified
//   is in the same partition as the db directory, copied otherwise.
// - export_dir should not already exist and will be created by this API.
// - Always triggers a flush.
virtual Status ExportColumnFamily(ColumnFamilyHandle* handle,
                                  const std::string& export_dir,
                                  ExportImportFilesMetaData** metadata);

Internally, the API will DisableFileDeletions(), GetColumnFamilyMetaData(), Parse through
metadata, creating links/copies of all the sst files, EnableFileDeletions() and complete the call by
returning the list of file metadata.

(2) CreateColumnFamilyWithImport() - This API is modeled after IngestExternalFile(), but invoked only during a CF creation as below.
// CreateColumnFamilyWithImport() will create a new column family with
// column_family_name and import external SST files specified in metadata into
// this column family.
// (1) External SST files can be created using SstFileWriter.
// (2) External SST files can be exported from a particular column family in
//     an existing DB.
// Option in import_options specifies whether the external files are copied or
// moved (default is copy). When option specifies copy, managing files at
// external_file_path is caller's responsibility. When option specifies a
// move, the call ensures that the specified files at external_file_path are
// deleted on successful return and files are not modified on any error
// return.
// On error return, column family handle returned will be nullptr.
// ColumnFamily will be present on successful return and will not be present
// on error return. ColumnFamily may be present on any crash during this call.
virtual Status CreateColumnFamilyWithImport(
    const ColumnFamilyOptions& options, const std::string& column_family_name,
    const ImportColumnFamilyOptions& import_options,
    const ExportImportFilesMetaData& metadata,
    ColumnFamilyHandle** handle);

Internally, this API creates a new CF, parses all the sst files and adds it to the specified column family, at the same level and with same sequence number as in the metadata. Also performs safety checks with respect to overlaps between the sst files being imported.

If incoming sequence number is higher than current local sequence number, local sequence
number is updated to reflect this.

Note, as the sst files is are being moved across Column Families, Column Family name in sst file
will no longer match the actual column family on destination DB. The API does not modify Column
Family name or id in the sst files being imported.
Pull Request resolved: #5495

Differential Revision: D16018881

fbshipit-source-id: 9ae2251025d5916d35a9fc4ea4d6707f6be16ff9
@siying
Copy link
Contributor

siying commented Sep 6, 2019

#5495 is already merged.

@siying siying closed this Sep 6, 2019
merryChris pushed a commit to merryChris/rocksdb that referenced this pull request Nov 18, 2019
Summary:
Refresh of the earlier change here - facebook#5135

This is a review request for code change needed for - facebook#3469
"Add support for taking snapshot of a column family and creating column family from a given CF snapshot"

We have an implementation for this that we have been testing internally. We have two new APIs that together provide this functionality.

(1) ExportColumnFamily() - This API is modelled after CreateCheckpoint() as below.
// Exports all live SST files of a specified Column Family onto export_dir,
// returning SST files information in metadata.
// - SST files will be created as hard links when the directory specified
//   is in the same partition as the db directory, copied otherwise.
// - export_dir should not already exist and will be created by this API.
// - Always triggers a flush.
virtual Status ExportColumnFamily(ColumnFamilyHandle* handle,
                                  const std::string& export_dir,
                                  ExportImportFilesMetaData** metadata);

Internally, the API will DisableFileDeletions(), GetColumnFamilyMetaData(), Parse through
metadata, creating links/copies of all the sst files, EnableFileDeletions() and complete the call by
returning the list of file metadata.

(2) CreateColumnFamilyWithImport() - This API is modeled after IngestExternalFile(), but invoked only during a CF creation as below.
// CreateColumnFamilyWithImport() will create a new column family with
// column_family_name and import external SST files specified in metadata into
// this column family.
// (1) External SST files can be created using SstFileWriter.
// (2) External SST files can be exported from a particular column family in
//     an existing DB.
// Option in import_options specifies whether the external files are copied or
// moved (default is copy). When option specifies copy, managing files at
// external_file_path is caller's responsibility. When option specifies a
// move, the call ensures that the specified files at external_file_path are
// deleted on successful return and files are not modified on any error
// return.
// On error return, column family handle returned will be nullptr.
// ColumnFamily will be present on successful return and will not be present
// on error return. ColumnFamily may be present on any crash during this call.
virtual Status CreateColumnFamilyWithImport(
    const ColumnFamilyOptions& options, const std::string& column_family_name,
    const ImportColumnFamilyOptions& import_options,
    const ExportImportFilesMetaData& metadata,
    ColumnFamilyHandle** handle);

Internally, this API creates a new CF, parses all the sst files and adds it to the specified column family, at the same level and with same sequence number as in the metadata. Also performs safety checks with respect to overlaps between the sst files being imported.

If incoming sequence number is higher than current local sequence number, local sequence
number is updated to reflect this.

Note, as the sst files is are being moved across Column Families, Column Family name in sst file
will no longer match the actual column family on destination DB. The API does not modify Column
Family name or id in the sst files being imported.
Pull Request resolved: facebook#5495

Differential Revision: D16018881

fbshipit-source-id: 9ae2251025d5916d35a9fc4ea4d6707f6be16ff9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants