Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repartition #132

Merged
merged 19 commits into from
Jun 13, 2023
Merged

Repartition #132

merged 19 commits into from
Jun 13, 2023

Conversation

valiantljk
Copy link
Collaborator

  • Added new compute step 'repartition', in deltacat, to support different repartition strategy, e.g., hash, range, column
  • Currently implemented range repartition
  • Added new standalone repartition session, such that data layout optimization via repartition can be decoupled from compaction

Use case: Hot/cold partitioning

  • To split the delta by date into hot and cold sub-files, use the range repartition API to reshuffle the data
  • Round completion file will be generated, so that following-up compaction can pick up and use the latest re-organized delta

Testing

  • Tested with one production table, used 100 r5.8xlarge workers, completed the repartition in 160 seconds.

@Zyiqin-Miranda Zyiqin-Miranda self-assigned this Jun 7, 2023
@Zyiqin-Miranda Zyiqin-Miranda self-requested a review June 7, 2023 22:52
Copy link
Collaborator

@raghumdani raghumdani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a high level, we have been writing unit tests for all the modules we are developing. Since this is a completely new functionality, can we add unit tests i.e., definition of done?

deltacat/compute/compactor/model/repartition_result.py Outdated Show resolved Hide resolved
deltacat/compute/compactor/utils/repartition_session.py Outdated Show resolved Hide resolved
deltacat/compute/compactor/utils/repartition_session.py Outdated Show resolved Hide resolved
deltacat/compute/compactor/steps/repartition.py Outdated Show resolved Hide resolved
deltacat/compute/compactor/steps/repartition.py Outdated Show resolved Hide resolved
deltacat/compute/compactor/steps/repartition.py Outdated Show resolved Hide resolved
deltacat/compute/compactor/steps/repartition.py Outdated Show resolved Hide resolved
Copy link
Member

@Zyiqin-Miranda Zyiqin-Miranda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Collaborator

@raghumdani raghumdani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good. Please check high level comments.

deltacat/compute/compactor/steps/repartition.py Outdated Show resolved Hide resolved
deltacat/compute/compactor/steps/repartition.py Outdated Show resolved Hide resolved
deltacat/compute/compactor/steps/repartition.py Outdated Show resolved Hide resolved
deltacat/compute/compactor/steps/repartition.py Outdated Show resolved Hide resolved
deltacat/compute/compactor/utils/repartition_session.py Outdated Show resolved Hide resolved
deltacat/compute/compactor/steps/repartition.py Outdated Show resolved Hide resolved
deltacat/compute/compactor/steps/repartition.py Outdated Show resolved Hide resolved
@valiantljk
Copy link
Collaborator Author

On a high level, we have been writing unit tests for all the modules we are developing. Since this is a completely new functionality, can we add unit tests i.e., definition of done?

#134

Copy link
Collaborator

@raghumdani raghumdani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Please address minor comments.

deltacat/compute/compactor/repartition_session.py Outdated Show resolved Hide resolved
deltacat/compute/compactor/steps/repartition.py Outdated Show resolved Hide resolved
deltacat/compute/compactor/steps/repartition.py Outdated Show resolved Hide resolved
deltacat/storage/model/types.py Show resolved Hide resolved
Copy link
Collaborator

@raghumdani raghumdani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@valiantljk valiantljk merged commit ec1e7f8 into main Jun 13, 2023
Zyiqin-Miranda pushed a commit to Zyiqin-Miranda/deltacat that referenced this pull request Jun 13, 2023
Support Repartition to split and organize the data into multiple groups
Zyiqin-Miranda pushed a commit to Zyiqin-Miranda/deltacat that referenced this pull request Jun 13, 2023
Support Repartition to split and organize the data into multiple groups
Zyiqin-Miranda pushed a commit to Zyiqin-Miranda/deltacat that referenced this pull request Jun 23, 2023
Support Repartition to split and organize the data into multiple groups
Zyiqin-Miranda pushed a commit to Zyiqin-Miranda/deltacat that referenced this pull request Jun 26, 2023
Support Repartition to split and organize the data into multiple groups
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants