Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Support storage of remote cluster(BOS/S3) for doris data. #7097

Closed
3 tasks done
pengxiangyu opened this issue Nov 11, 2021 · 0 comments · Fixed by #7098
Closed
3 tasks done

[Feature] Support storage of remote cluster(BOS/S3) for doris data. #7097

pengxiangyu opened this issue Nov 11, 2021 · 0 comments · Fixed by #7098
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@pengxiangyu
Copy link
Contributor

pengxiangyu commented Nov 11, 2021

Search before asking

  • I had searched in the issues and found no similar issues.

Description

Now, doris only store data in local disk, it makes you can read and write data on disk quickly. But not all data in database is read/written usually, most data is used when it is a new one. When the data is not hot, it will still cost the space of the disk.You can delete it, however some data maybe useful again some time.

So, the cold data need to be saved on some cheaper storage, such as BOS/S3/HDFS, etc. It will be cheaper.

Then the cold data can also be read when it is necessary, just from remote storage.

Overall

  1. Support remote storage, data will be move to remote storage(BOS/S3) when it is cold.
  2. Dynamic partition need to be set to cold continuously by the create time, so we can set them cold continuously.
  3. Meta need to be local, so we can read it quickly. Then read data by the meta.
  4. When cold data need to be read, get it from remote storage.
  5. remote storage need to be similar to local storage, cold data can be read, moved to trash and deleted, but cant't be appended.

Detail design

BE will resovle the relation of local disk and remote storage.
Local disk will hold the meta, which will be used to find which data is needed.
Remote storage will hold the cold data, which will be read by be.

                                     FE
                                      |
                                     BE
                          |                        |
                         META                     DATA
                      LOCAL DISK              REMOTE STORAGE
  1. Support remote storage
    remote storage configure will be set in the properties of Create/Alter Table
    a. storage_medium is the storage for hot data.
    b. storage_cold_medium is the destination storage which cold data will be moved to.
    c. storage_cooldown_time is the time for cold data.
CREATE TABLE TblPxy
(
    aa BIGINT
)
ENGINE=olap
DISTRIBUTED BY HASH (aa) BUCKETS 32
PROPERTIES(
    "storage_medium" = "SSD",
    "storage_cold_medium" = "S3",
    "storage_cooldown_time" = "2021-11-08 11:52:00"
);
  1. Dynamic partition cold data
    Dynamic partition is created continuously, so the cold time must be set by the partition time.
    a. dynamic_partition.hot_partition_num means how many hot partition will relay, the older partition will be set to cold.
    b. dynamic_partition.storage_medium is the storage holding hot data.
    c. dynamic_partition.storage_cold_medium is the dest storage for cold data.
CREATE TABLE TblPxy (
    k1 DATE,
    aa BIGINT
) ENGINE=olap PARTITION BY RANGE (k1) ()
DISTRIBUTED BY HASH (aa) BUCKETS 1
PROPERTIES(
    "dynamic_partition.hot_partition_num" = "3",
    "dynamic_partition.storage_medium" = "HDD",
    "dynamic_partition.storage_cold_medium" = "S3",
    "dynamic_partition.time_unit" = "DAY",
    "dynamic_partition.start" = "-3",
    "dynamic_partition.end" = "3",
    "dynamic_partition.prefix" = "p",
    "dynamic_partition.buckets" = "32"
);
  1. Read cold data, meta will be local
    When you are calling select and the data is cold. BE will get meta of local disck first, choose which data is needed.
    Then the matched remote data will be read and return to BE.
SELECT * FROM TblPxy;
  1. Cold data trash
    When cold data need to be dropped, move it to trash path on remote storage, and the trash path will be set in local trash path.
    Cleaner will check local trash path, if it's time to delete, remote data will be deleted first, and then local.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
1 participant