Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support remote storage, step2, only for be: hot data trans to cold data. clean cold data when drop table #7529

Closed
wants to merge 12 commits into from

Conversation

pengxiangyu
Copy link
Contributor

@pengxiangyu pengxiangyu commented Dec 29, 2021

Proposed changes

  1. When the hot data need to be changed to cold data, MigrationHandler will create a new Job of SchemaChangeV2, call the CreateReplicaTask and AlterReplicaTask. It will do something:
    1.1 FE create a new shadow tablet with remote path, send this request to BE.(MigrationHandler.java)
    1.2 BE will create local cache dir for remote path, upload data from local to s3. (schema_change.cpp beta_rowset.cpp)
    1.3 Write tablet_uid(used for remote path) in cache path. (schema_change.cpp) tablet_uid will be used to create the remote path on S3. When the remote path need to be deleted, tablet_uid is useful.
    1.4 update meta to olap meta. (schema_change.cpp)
  2. hot data to cold data, for FE:
    2.1 Add StorageColdMedium for DataProperty. change every DataProperty() function.
    2.2 Add Type S3 to Cold Storage. (PropertyAnalyzer.java)
  3. Whe the cold data need to be read:
    3.1 RemoteBlockManager will be used, all the operations will be done in it.
    3.2 When a select operation arrived, it will be send to RemoteBlockManager, download the files from remote path on S3 to cache path, read the files in cache path and return.
  4. When the cold data need to be dropped:
    3.1 move remote data to trash path on s3.(data_dir.cpp move_to_trash())
    3.2 move local data to trash path on local disk.
    3.3 delete files in trash dir on remote and local.(storage_engine.cpp start_trash_sweep() _do_sweep())

Types of changes

What types of changes does your code introduce to Doris?
Put an x in the boxes that apply

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation Update (if none of the other choices apply)
  • Code refactor (Modify the code structure, format the code, etc...)
  • Optimization. Including functional usability improvements and performance improvements.
  • Dependency. Such as changes related to third-party components.
  • Other.

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

  • I have created an issue on (Fix [Feature] Support storage of remote cluster(BOS/S3) for doris data. #7097) and described the bug/feature there in detail
  • Compiling and unit tests pass locally with my changes
  • I have added tests that prove my fix is effective or that my feature works
  • If these changes need document changes, I have updated the document
  • Any dependent changes have been merged

Further comments

After this patch, get remote data when select is called.

@morningman morningman self-assigned this Jan 5, 2022
@morningman morningman added area/remote-storage kind/feature Categorizes issue or PR as related to a new feature. labels Jan 5, 2022
@morningman
Copy link
Contributor

link to #7575

@yiguolei
Copy link
Contributor

I have two questions:

  1. Shoud set the partition to freeze state to avoid insert data to cold partitions?
  2. How to deal with schema change for the data in S3?

@morningman
Copy link
Contributor

Please update the PR comment to describe the new implementation

@pengxiangyu
Copy link
Contributor Author

I have two questions:

  1. Shoud set the partition to freeze state to avoid insert data to cold partitions?
  2. How to deal with schema change for the data in S3?
  1. partition need not to freeze, when migration, old tablet and new shadow tablet will both be inserted new data. Just like schema_change.
  2. data in S3 is cold data, this type of data is not writable, schema_change is not supported. If cold data is in a partition, the partition can't be writable, but partition of hot data is writable.


OLAPStatus upload_files_to(const FilePathDesc& dir_desc) override;
OLAPStatus upload_files_to(const FilePathDesc& dir_desc,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why need this api?

@@ -110,7 +110,7 @@ OLAPStatus AlphaRowset::link_files_to(const FilePathDesc& dir_desc, RowsetId new
return OLAP_SUCCESS;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not support alpha rowset any more, just return error and stop migration job if this rowset is alpharowset

@@ -28,9 +29,10 @@ extern MetricPrototype METRIC_query_scan_bytes;
extern MetricPrototype METRIC_query_scan_rows;
extern MetricPrototype METRIC_query_scan_count;

BaseTablet::BaseTablet(TabletMetaSharedPtr tablet_meta, DataDir* data_dir)
BaseTablet::BaseTablet(TabletMetaSharedPtr tablet_meta, const StorageParamPB& storage_param, DataDir* data_dir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not add a new parameter, StorageParam should be part of TabletMeta

@@ -776,18 +783,18 @@ void StorageEngine::_clean_unused_txns() {
}
}

OLAPStatus StorageEngine::_do_sweep(const string& scan_root, const time_t& local_now,
OLAPStatus StorageEngine::_do_sweep(const FilePathDesc& scan_root_desc, const time_t& local_now,
const int32_t expire) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cold data will have no trash or garbage, not deal with it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be deleted later, but I think is safer when migration operation is online for the first time.

std::shared_ptr<Env> env = Env::get_env(path_desc);
if (env == nullptr) {
LOG(INFO) << "remote storage is not exist, create it. storage_name: " << request.storage_param.storage_name;
RETURN_WITH_WARN_IF_ERROR(Env::get_remote_mgr()->create_remote_storage(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remote Storage Object should be shared, do not create remote storage object for every tablet. For example, we may monitor the remote storage's performance, if there are too many object, it is too hard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remote storage is shared in RemoteEnvMgr. Env::get_env() will get the shared_ptr from RemoteEnvMgr, path_desc will only take the storage_name and medium type.

@@ -1091,7 +1131,48 @@ void TabletManager::try_delete_unused_tablet_path(DataDir* data_dir, TTabletId t
// TODO(ygl): may do other checks in the future
if (Env::Default()->path_exists(schema_hash_path).ok()) {
LOG(INFO) << "start to move tablet to trash. tablet_path = " << schema_hash_path;
OLAPStatus rm_st = move_to_trash(schema_hash_path, schema_hash_path);
FilePathDesc segment_desc(schema_hash_path);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tablet GC could not be performance on any BE, because it does not know when to delete data.
It should be performed on FE. Be only care about its local data not remote data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every remote data has a local meta, When the local meta is deleted, remote data will be deleted too.
If deleted by fe, when a migration operation is failed and did not notify fe, how can fe know it?

@yiguolei
Copy link
Contributor

yiguolei commented Mar 27, 2022

I have reviewed the code, it is too complicated, It should be simplified.

  1. I find that you reused the schema change logic. It's ok to reuse the logic but not reuse the code, we should not modify schema change code since it is very serious. We should rewrite a new job very like schema change. Because some logic has micro differences, for example there are 10 tablet doing migration, if one tablet failed, then schema change is failed, but storage migration should ignore the failure and may add a new task for the failed tablet.
  2. I think if user set the partition to COLD, then it should be freezed, not allow to insert new data. It is acceptable. But I think It is very important to allow SCHEMA CHANGE!!! because schema change is table level.
  3. If the tablet is a new tablet and it is not writeable, or do compaction, then there should be no rubbish or garbage, so that we do not need to care about gc or sweep logic for remote storage.
  4. FE should care about rubbish tablet for example if one tablet do migration failed fe should do gc logic. For example, FE may call BE's API do this gc logic.
  5. Remote Storage Medium is set to tablet level, it is ok. But not a separate parameter, it should be part of table meta both in FE and BE.
  6. We should write clone logic on FE maybe it is just a create tablet logic on BE, BE just load the header file from remote storage.
  7. There are also many other logics like backup, restore, snapshot logic. Maybe we could disable some feature at current step and implement them in the future.

@yiguolei
Copy link
Contributor

I also think Env is not used properly. There are many code like if Env is xxx Env then do something. I think we could bind tablet to specific Env, then not check env type and just call env object's method directly.

@yiguolei
Copy link
Contributor

yiguolei commented Mar 27, 2022

Some method in Env is useless, for example new_sequential_file it is only used in

`
Status get_thread_stats(int64_t tid, ThreadStats* stats) {
DCHECK(stats != nullptr);
if (kTicksPerSec <= 0) {
return Status::NotSupported("ThreadStats not supported");
}
faststring buf;
RETURN_IF_ERROR(env_util::read_file_to_string(
Env::Default(), strings::Substitute("/proc/self/task/$0/stat", tid), &buf));

  return parse_stat(buf.ToString(), nullptr, stats);

}
`

It does not make sense to be part of Env. What about just use posix api do to this and simplify Env Interface.

new_random_access_file is used in read cluster id. It is not related with remote storage.

@morningman
Copy link
Contributor

This PR is closed, please refer to #8663

@morningman morningman closed this Mar 29, 2022
@pengxiangyu pengxiangyu deleted the remote branch September 7, 2022 09:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/remote-storage kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Support storage of remote cluster(BOS/S3) for doris data.
3 participants