-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shard based recovery #799
Labels
feature
New feature or request
Comments
2 tasks
Rachelint
added
help wanted
Extra attention is needed
and removed
help wanted
Extra attention is needed
labels
Mar 31, 2023
@Rachelint In the #800, the separate runtime for recovery has been mentioned, and I guess it should be taken into considerations together with this. |
Rachelint
added a commit
that referenced
this issue
May 9, 2023
## Which issue does this PR close? Closes # Part of #799 ## Rationale for this change Now, we update `TableData` and store its wal seperately. The order of two operations above is maintained by developer, that will lead to a big bug if developer is not so careful when modifying related codes. <!--- Why are you proposing this change? If this is already explained clearly in the issue, then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? + Place table datas into manifest. + Update it both in memory and storage in the same call. <!--- There is no need to duplicate the description in the issue here, but it is sometimes worth providing a summary of the individual changes in this PR to help reviewers understand the structure. --> ## Are there any user-facing changes? None. <!--- Please mention if: - there are user-facing changes that need to update the documentation or configuration. - this is a breaking change to public APIs --> ## How does this change test Test by ut. <!-- Please describe how you test this change (like by unit test case, integration test or some other ways) if this change has touched the code. -->
chunshao90
pushed a commit
to chunshao90/ceresdb
that referenced
this issue
May 15, 2023
## Which issue does this PR close? Closes # Part of apache#799 ## Rationale for this change Now, we update `TableData` and store its wal seperately. The order of two operations above is maintained by developer, that will lead to a big bug if developer is not so careful when modifying related codes. <!--- Why are you proposing this change? If this is already explained clearly in the issue, then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? + Place table datas into manifest. + Update it both in memory and storage in the same call. <!--- There is no need to duplicate the description in the issue here, but it is sometimes worth providing a summary of the individual changes in this PR to help reviewers understand the structure. --> ## Are there any user-facing changes? None. <!--- Please mention if: - there are user-facing changes that need to update the documentation or configuration. - this is a breaking change to public APIs --> ## How does this change test Test by ut. <!-- Please describe how you test this change (like by unit test case, integration test or some other ways) if this change has touched the code. -->
Rachelint
added a commit
that referenced
this issue
May 22, 2023
## Which issue does this PR close? Closes # Part of #799 ## Rationale for this change + Add `open_shard` and `close_shard` methods into `TableEngine`. + Impl the methods above on demand. ## What changes are included in this PR? See above. ## Are there any user-facing changes? None. ## How does this change test Test by ut.
jiacai2050
pushed a commit
that referenced
this issue
Jun 15, 2023
## Rationale Part of #799 ## Detailed Changes - Define `WalReplayer` to carry out replay work. - Support both `TableBased`(original) and `RegionBased` replay mode in `WalReplayer`. - Expose related configs. ## Test Plan - Modify exist unit tests to cover the `RegionBased` wal replay. - Refactor the integration test to cover recovery logic(TODO).
Rachelint
added a commit
that referenced
this issue
Jun 16, 2023
## Rationale Part of #799 Now we run the test about recovery manually that is so tired, this pr add this into integration tests which will be run automatically in ci. ## Detailed Changes + Add integration test about recovery. + Add above test to ci. ## Test Plan None.
Rachelint
added a commit
that referenced
this issue
Jun 20, 2023
## Rationale Part of #799 We use `rskafka` as our kafka client. However I found it will retry without limit even though kafka service is unavailable... (see [https://github.com/influxdata/rskafka/issues/65](https://github.com/influxdata/rskafka/issues/65)) Worse, I found `rskafka` is almostis no longer maintained... For quick fix, I forked it to support limited retry. ## Detailed Changes + Use the instead forked `rskafka`(supporting limited retry). + Add more logs in recovery path for better debugging. ## Test Plan Test manually.
Rachelint
added a commit
that referenced
this issue
Jun 20, 2023
## Rationale Part of #799 ## Detailed Changes see title. ## Test Plan None.
dust1
pushed a commit
to dust1/ceresdb
that referenced
this issue
Aug 9, 2023
## Rationale Part of apache#799 ## Detailed Changes - Define `WalReplayer` to carry out replay work. - Support both `TableBased`(original) and `RegionBased` replay mode in `WalReplayer`. - Expose related configs. ## Test Plan - Modify exist unit tests to cover the `RegionBased` wal replay. - Refactor the integration test to cover recovery logic(TODO).
dust1
pushed a commit
to dust1/ceresdb
that referenced
this issue
Aug 9, 2023
## Rationale Part of apache#799 Now we run the test about recovery manually that is so tired, this pr add this into integration tests which will be run automatically in ci. ## Detailed Changes + Add integration test about recovery. + Add above test to ci. ## Test Plan None.
dust1
pushed a commit
to dust1/ceresdb
that referenced
this issue
Aug 9, 2023
) ## Rationale Part of apache#799 We use `rskafka` as our kafka client. However I found it will retry without limit even though kafka service is unavailable... (see [https://github.com/influxdata/rskafka/issues/65](https://github.com/influxdata/rskafka/issues/65)) Worse, I found `rskafka` is almostis no longer maintained... For quick fix, I forked it to support limited retry. ## Detailed Changes + Use the instead forked `rskafka`(supporting limited retry). + Add more logs in recovery path for better debugging. ## Test Plan Test manually.
dust1
pushed a commit
to dust1/ceresdb
that referenced
this issue
Aug 9, 2023
## Rationale Part of apache#799 ## Detailed Changes see title. ## Test Plan None.
Most tasks are finished, so closing. |
10 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe This Problem
Now table recovery in on table level but wal's storing in on shard level.
The recovery performance may be unsatisfied especially in kafka based wal.
Proposal
1. Split actual table recovery from schema, and refactor table engine
We should begin at modifying the high level interface(Schema and TableEngine) for adapting to the new recovery process.
Now the path about
Schema
andTableEngine
when opening tables on shard is like:For modify interfaces above to open whole tables on shard together rather than respectively, the most troublesome place is:
open_tables_on_shard
toSchema
.My solution about this is:
TableEngine
directly, and just register the opened tables toSchema
feat: introduceTableOperator
to encasulate operation of tables #808In this stage, we still keep the origin interface of
TableEngine
, the path may be like:TableEngine
interface to support shard level openingtbc...
Schema
andTableEngine
completely?register_table
andunregister_table
inSchema
TableEngine
2. Impl shard based table meta recovery
TableData
toTableContext
, and extract members which will be recovered from manifest as the newTableData
.TableData
into manifest whencreate table
andopen table
, and unregister it whendrop table
andclose table
.Manifest
, we just make use of the holdTableData
rather than scanning the persist wals.TableData
s intoManifest
, and we update the memory and storage in just one place.3. Impl shard based table data recovery
4. other
rskafka
to support limited retry #1005Additional Context
No response
The text was updated successfully, but these errors were encountered: