Support remote storage, step2, only for be: hot data trans to cold data. clean cold data when drop table #7529

pengxiangyu · 2021-12-29T07:33:32Z

Proposed changes

When the hot data need to be changed to cold data, MigrationHandler will create a new Job of SchemaChangeV2, call the CreateReplicaTask and AlterReplicaTask. It will do something:
1.1 FE create a new shadow tablet with remote path, send this request to BE.(MigrationHandler.java)
1.2 BE will create local cache dir for remote path, upload data from local to s3. (schema_change.cpp beta_rowset.cpp)
1.3 Write tablet_uid(used for remote path) in cache path. (schema_change.cpp) tablet_uid will be used to create the remote path on S3. When the remote path need to be deleted, tablet_uid is useful.
1.4 update meta to olap meta. (schema_change.cpp)
hot data to cold data, for FE:
2.1 Add StorageColdMedium for DataProperty. change every DataProperty() function.
2.2 Add Type S3 to Cold Storage. (PropertyAnalyzer.java)
Whe the cold data need to be read:
3.1 RemoteBlockManager will be used, all the operations will be done in it.
3.2 When a select operation arrived, it will be send to RemoteBlockManager, download the files from remote path on S3 to cache path, read the files in cache path and return.
When the cold data need to be dropped:
3.1 move remote data to trash path on s3.（data_dir.cpp move_to_trash()）
3.2 move local data to trash path on local disk.
3.3 delete files in trash dir on remote and local.（storage_engine.cpp start_trash_sweep() _do_sweep())

Types of changes

What types of changes does your code introduce to Doris?
Put an x in the boxes that apply

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation Update (if none of the other choices apply)
Code refactor (Modify the code structure, format the code, etc...)
Optimization. Including functional usability improvements and performance improvements.
Dependency. Such as changes related to third-party components.
Other.

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

I have created an issue on (Fix [Feature] Support storage of remote cluster(BOS/S3) for doris data. #7097) and described the bug/feature there in detail
Compiling and unit tests pass locally with my changes
I have added tests that prove my fix is effective or that my feature works
If these changes need document changes, I have updated the document
Any dependent changes have been merged

Further comments

After this patch, get remote data when select is called.

morningman · 2022-01-18T02:57:51Z

link to #7575

yiguolei · 2022-01-18T03:02:21Z

I have two questions:

Shoud set the partition to freeze state to avoid insert data to cold partitions?
How to deal with schema change for the data in S3?

morningman · 2022-03-07T15:36:24Z

Please update the PR comment to describe the new implementation

pengxiangyu · 2022-03-09T06:17:20Z

I have two questions:

Shoud set the partition to freeze state to avoid insert data to cold partitions?

How to deal with schema change for the data in S3?

partition need not to freeze, when migration, old tablet and new shadow tablet will both be inserted new data. Just like schema_change.
data in S3 is cold data, this type of data is not writable, schema_change is not supported. If cold data is in a partition, the partition can't be writable, but partition of hot data is writable.

fe/fe-core/src/main/java/org/apache/doris/alter/MigrationHandler.java

fe/fe-core/src/main/java/org/apache/doris/system/HeartbeatMgr.java

…ta. clean cold data when drop table

yiguolei · 2022-03-26T13:48:05Z

be/src/olap/rowset/beta_rowset.h


-    OLAPStatus upload_files_to(const FilePathDesc& dir_desc) override;
+    OLAPStatus upload_files_to(const FilePathDesc& dir_desc,


why need this api?

yiguolei · 2022-03-26T13:49:15Z

be/src/olap/rowset/alpha_rowset.cpp

@@ -110,7 +110,7 @@ OLAPStatus AlphaRowset::link_files_to(const FilePathDesc& dir_desc, RowsetId new
    return OLAP_SUCCESS;


not support alpha rowset any more, just return error and stop migration job if this rowset is alpharowset

yiguolei · 2022-03-27T01:23:27Z

be/src/olap/base_tablet.cpp

@@ -28,9 +29,10 @@ extern MetricPrototype METRIC_query_scan_bytes;
 extern MetricPrototype METRIC_query_scan_rows;
 extern MetricPrototype METRIC_query_scan_count;

-BaseTablet::BaseTablet(TabletMetaSharedPtr tablet_meta, DataDir* data_dir)
+BaseTablet::BaseTablet(TabletMetaSharedPtr tablet_meta, const StorageParamPB& storage_param, DataDir* data_dir)


Not add a new parameter, StorageParam should be part of TabletMeta

yiguolei · 2022-03-27T01:36:23Z

be/src/olap/storage_engine.cpp

@@ -776,18 +783,18 @@ void StorageEngine::_clean_unused_txns() {
    }
 }

-OLAPStatus StorageEngine::_do_sweep(const string& scan_root, const time_t& local_now,
+OLAPStatus StorageEngine::_do_sweep(const FilePathDesc& scan_root_desc, const time_t& local_now,
                                    const int32_t expire) {


Cold data will have no trash or garbage, not deal with it.

It may be deleted later, but I think is safer when migration operation is online for the first time.

yiguolei · 2022-03-27T01:39:45Z

be/src/olap/tablet_manager.cpp

+        std::shared_ptr<Env> env = Env::get_env(path_desc);
+        if (env == nullptr) {
+            LOG(INFO) << "remote storage is not exist, create it. storage_name: " << request.storage_param.storage_name;
+            RETURN_WITH_WARN_IF_ERROR(Env::get_remote_mgr()->create_remote_storage(


Remote Storage Object should be shared, do not create remote storage object for every tablet. For example, we may monitor the remote storage's performance, if there are too many object, it is too hard.

Remote storage is shared in RemoteEnvMgr. Env::get_env() will get the shared_ptr from RemoteEnvMgr, path_desc will only take the storage_name and medium type.

yiguolei · 2022-03-27T01:51:06Z

be/src/olap/tablet_manager.cpp

@@ -1091,7 +1131,48 @@ void TabletManager::try_delete_unused_tablet_path(DataDir* data_dir, TTabletId t
    // TODO(ygl): may do other checks in the future
    if (Env::Default()->path_exists(schema_hash_path).ok()) {
        LOG(INFO) << "start to move tablet to trash. tablet_path = " << schema_hash_path;
-        OLAPStatus rm_st = move_to_trash(schema_hash_path, schema_hash_path);
+        FilePathDesc segment_desc(schema_hash_path);


Tablet GC could not be performance on any BE, because it does not know when to delete data.
It should be performed on FE. Be only care about its local data not remote data.

Every remote data has a local meta, When the local meta is deleted, remote data will be deleted too.
If deleted by fe, when a migration operation is failed and did not notify fe, how can fe know it？

yiguolei · 2022-03-27T02:04:10Z

I have reviewed the code, it is too complicated, It should be simplified.

I find that you reused the schema change logic. It's ok to reuse the logic but not reuse the code, we should not modify schema change code since it is very serious. We should rewrite a new job very like schema change. Because some logic has micro differences, for example there are 10 tablet doing migration, if one tablet failed, then schema change is failed, but storage migration should ignore the failure and may add a new task for the failed tablet.
I think if user set the partition to COLD, then it should be freezed, not allow to insert new data. It is acceptable. But I think It is very important to allow SCHEMA CHANGE!!! because schema change is table level.
If the tablet is a new tablet and it is not writeable, or do compaction, then there should be no rubbish or garbage, so that we do not need to care about gc or sweep logic for remote storage.
FE should care about rubbish tablet for example if one tablet do migration failed fe should do gc logic. For example, FE may call BE's API do this gc logic.
Remote Storage Medium is set to tablet level, it is ok. But not a separate parameter, it should be part of table meta both in FE and BE.
We should write clone logic on FE maybe it is just a create tablet logic on BE, BE just load the header file from remote storage.
There are also many other logics like backup, restore, snapshot logic. Maybe we could disable some feature at current step and implement them in the future.

yiguolei · 2022-03-27T02:16:59Z

I also think Env is not used properly. There are many code like if Env is xxx Env then do something. I think we could bind tablet to specific Env, then not check env type and just call env object's method directly.

yiguolei · 2022-03-27T02:22:50Z

Some method in Env is useless, for example new_sequential_file it is only used in

`
Status get_thread_stats(int64_t tid, ThreadStats* stats) {
DCHECK(stats != nullptr);
if (kTicksPerSec <= 0) {
return Status::NotSupported("ThreadStats not supported");
}
faststring buf;
RETURN_IF_ERROR(env_util::read_file_to_string(
Env::Default(), strings::Substitute("/proc/self/task/$0/stat", tid), &buf));

  return parse_stat(buf.ToString(), nullptr, stats);

}
`

It does not make sense to be part of Env. What about just use posix api do to this and simplify Env Interface.

new_random_access_file is used in read cluster id. It is not related with remote storage.

morningman · 2022-03-29T02:15:01Z

This PR is closed, please refer to #8663

pengxiangyu added 4 commits December 29, 2021 15:14

add remote storage, patch2

30e7840

add remote storage, patch2

22a9cf9

add remote storage, patch2

4b31c0a

Merge branch 'apache:master' into remote

912549f

morningman self-assigned this Jan 5, 2022

morningman added area/remote-storage kind/feature Categorizes issue or PR as related to a new feature. labels Jan 5, 2022

pengxiangyu added 3 commits March 1, 2022 20:39

When migration, add shadow tablet instead.

626a338

When migration, add shadow tablet instead.

e30586d

When migration, add shadow tablet instead.

615d030

pengxiangyu added 3 commits March 9, 2022 14:18

When migration, add shadow tablet instead.

78e57f2

When migration, add shadow tablet instead.

ac02d51

When migration, add shadow tablet instead.

2a39660

morningman reviewed Mar 21, 2022

View reviewed changes

pengxiangyu added 2 commits March 23, 2022 14:52

Support remote storage, step2, only for be: hot data trans to cold da…

926b9b1

…ta. clean cold data when drop table

Support remote storage, step2, only for be: hot data trans to cold da…

21175c4

…ta. clean cold data when drop table

yiguolei reviewed Mar 26, 2022

View reviewed changes

yiguolei reviewed Mar 27, 2022

View reviewed changes

morningman closed this Mar 29, 2022

pengxiangyu deleted the remote branch September 7, 2022 09:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support remote storage, step2, only for be: hot data trans to cold data. clean cold data when drop table #7529

Support remote storage, step2, only for be: hot data trans to cold data. clean cold data when drop table #7529

pengxiangyu commented Dec 29, 2021 •

edited

Loading

morningman commented Jan 18, 2022

yiguolei commented Jan 18, 2022

morningman commented Mar 7, 2022

pengxiangyu commented Mar 9, 2022

yiguolei Mar 26, 2022

yiguolei Mar 26, 2022

yiguolei Mar 27, 2022

yiguolei Mar 27, 2022

pengxiangyu Mar 28, 2022

yiguolei Mar 27, 2022

pengxiangyu Mar 28, 2022

yiguolei Mar 27, 2022

pengxiangyu Mar 28, 2022

yiguolei commented Mar 27, 2022 •

edited

Loading

yiguolei commented Mar 27, 2022

yiguolei commented Mar 27, 2022 •

edited

Loading

morningman commented Mar 29, 2022


		OLAPStatus upload_files_to(const FilePathDesc& dir_desc) override;
		OLAPStatus upload_files_to(const FilePathDesc& dir_desc,

		@@ -110,7 +110,7 @@ OLAPStatus AlphaRowset::link_files_to(const FilePathDesc& dir_desc, RowsetId new
		return OLAP_SUCCESS;

Support remote storage, step2, only for be: hot data trans to cold data. clean cold data when drop table #7529

Support remote storage, step2, only for be: hot data trans to cold data. clean cold data when drop table #7529

Conversation

pengxiangyu commented Dec 29, 2021 • edited Loading

Proposed changes

Types of changes

Checklist

Further comments

morningman commented Jan 18, 2022

yiguolei commented Jan 18, 2022

morningman commented Mar 7, 2022

pengxiangyu commented Mar 9, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yiguolei commented Mar 27, 2022 • edited Loading

yiguolei commented Mar 27, 2022

yiguolei commented Mar 27, 2022 • edited Loading

morningman commented Mar 29, 2022

pengxiangyu commented Dec 29, 2021 •

edited

Loading

yiguolei commented Mar 27, 2022 •

edited

Loading

yiguolei commented Mar 27, 2022 •

edited

Loading