Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature chunk trash bin #215

Draft
wants to merge 9 commits into
base: dev
Choose a base branch
from
Draft

Feature chunk trash bin #215

wants to merge 9 commits into from

Conversation

ictus4u
Copy link
Collaborator

@ictus4u ictus4u commented Oct 10, 2024

Add a safety layer in case of chunk removal by SaunaFS. Instead, the chunks files will go to a temporary trash bin folder.

Copy link
Contributor

@lgsilva3087 lgsilva3087 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice change, please check my suggestions.
Also, I see very nice unit tests but no integration tests. In this case, please consider to add it to see how it works in a real scenario.
The test could:

  • Create some files.
  • Set small trash time.
  • Wait for them to be deleted.
  • Check that the files were moved to the expected directory.

src/chunkserver/chunkserver-common/chunk_trash_manager.cc Outdated Show resolved Hide resolved
src/chunkserver/chunkserver-common/chunk_trash_manager.cc Outdated Show resolved Hide resolved
src/chunkserver/chunkserver-common/chunk_trash_manager.cc Outdated Show resolved Hide resolved
src/chunkserver/chunkserver-common/chunk_trash_manager.cc Outdated Show resolved Hide resolved
src/chunkserver/chunkserver-common/chunk_trash_manager.cc Outdated Show resolved Hide resolved
src/chunkserver/chunkserver-common/cmr_disk.cc Outdated Show resolved Hide resolved
src/chunkserver/chunkserver-common/cmr_disk.h Outdated Show resolved Hide resolved
src/chunkserver/chunkserver-common/cmr_disk_unittest.cc Outdated Show resolved Hide resolved
src/chunkserver/chunkserver-common/disk_with_fd.h Outdated Show resolved Hide resolved
tests/ci_build/run-build.sh Outdated Show resolved Hide resolved
@ictus4u ictus4u force-pushed the feature-chunk-trash-bin branch from f06ea7c to bdeab93 Compare October 14, 2024 10:26
@ictus4u
Copy link
Collaborator Author

ictus4u commented Oct 14, 2024

...but no integration tests. In this case, please consider to add it...

@lgsilva3087 By your suggestion, the `test_unlink' was updated to check for the trashed files.

@ictus4u ictus4u requested a review from lgsilva3087 October 14, 2024 11:55
Copy link
Collaborator

@rolysr rolysr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please consider my minor suggestion for the integration test.


# Ensure thr "unlinked" files are trashed
trashed_chunk_files=$(find_all_trashed_chunks | wc -l)
if [ "${trashed_chunk_files}" -eq 0 ]; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, consider in this test to use probably assert_equals or some similar approach for checking if the number of chunks in trash was the expected one, instead of just checking if there is some trash or not.

Copy link
Collaborator Author

@ictus4u ictus4u Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this test the removal is mandated before the replication of the files finishes. I added a check on the initial state, a sample on the middle and another fuzzy check at the end. Would this suffice the integration test?

Copy link
Collaborator

@rolysr rolysr Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I consider it good now. Just take into account for comparing expected values with some variable value or function output to follow the conventions on other tests by performing this comparisons by using assert_equals or assert_not_equal . This is a minor change as it is most for having same coding style across all integration tests, but is beneficial for readability.

Copy link
Collaborator

@dmga44 dmga44 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice change, though I think it can be improved.

From my understanding, the chunks are never really unlinked, just moved to the trash folder. We could set a trashtime for those chunks, to make sure they are not filling the drives forever.

Maybe this could be an idea for a next iteration of the chunk trash bin feature.

ralcolea
ralcolea previously approved these changes Oct 15, 2024
Copy link
Contributor

@ralcolea ralcolea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 👍 🔥 🚀
I’ve added a few minor suggestions for your consideration, but they’re not blockers for merging.

src/chunkserver/chunkserver-common/chunk_trash_manager.cc Outdated Show resolved Hide resolved
src/chunkserver/chunkserver-common/cmr_disk.cc Outdated Show resolved Hide resolved
src/chunkserver/chunkserver-common/disk_with_fd.h Outdated Show resolved Hide resolved
tests/test_suites/ShortSystemTests/test_unlink.sh Outdated Show resolved Hide resolved
@ictus4u
Copy link
Collaborator Author

ictus4u commented Oct 15, 2024

We could set a trashtime for those chunks

@dmga44 Yes, it is only changing the removal behavior. The trash time management is a responsibility for the Garbage collector.

@dmga44
Copy link
Collaborator

dmga44 commented Oct 15, 2024

The trash time management is a responsibility for the Garbage collector.

What Garbage collector?

@ictus4u
Copy link
Collaborator Author

ictus4u commented Oct 16, 2024

The trash time management is a responsibility for the Garbage collector.

What Garbage collector?

Currently WIP

ralcolea
ralcolea previously approved these changes Oct 16, 2024
Copy link
Contributor

@ralcolea ralcolea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job @ictus4u 👍 🔥 🚀

Copy link
Collaborator

@uristdwarf uristdwarf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently WIP

In this case this PR should be marked as draft or the feature should be behind a feature flag (see my other comments in the code). I would prefer feature flag because this is already a very useful in high risk operations (upgrading, migrating from LizardFS etc.) and we should avoid long feature branches that take months.

I already love this feature (you could already achieve the deletion with a cron job), but putting a unfinished feature, that changes how chunkserver behaves significantly (it doesn't delete chunks, so chunkserver can get full over time without emptying the trash) and that is on by default on into dev is not something that should NOT be accepted at ALL in my opinion.

It's fine to have unfinished features in dev and even releases, but they should be disabled by default (for example prometheus is woefully unfinished, but needs to enabled specifically) so they don't affect anyone.

src/chunkserver/chunkserver-common/cmr_disk.cc Outdated Show resolved Hide resolved
src/chunkserver/chunkserver-common/chunk_trash_manager.cc Outdated Show resolved Hide resolved
lgsilva3087
lgsilva3087 previously approved these changes Oct 17, 2024
Copy link
Contributor

@lgsilva3087 lgsilva3087 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please see my comments.
Also: Consider Urmas' suggestion to add it as an option.

src/chunkserver/chunkserver-common/cmr_disk.cc Outdated Show resolved Hide resolved
src/chunkserver/chunkserver-common/cmr_disk.cc Outdated Show resolved Hide resolved
src/chunkserver/chunkserver-common/cmr_disk_unittest.cc Outdated Show resolved Hide resolved
fi
# Below value is non-deterministic, but it should be enough to test the
# feature. Avoiding to use a fixed value to prevent false positives.
random_files_count_before_removal=$(find_all_chunks | wc -l)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
random_files_count_before_removal=$(find_all_chunks | wc -l)
expected_chunks_count_before_removal=$(find_all_chunks | wc -l)

fi
# Below value is non-deterministic, but it should be enough to test the
# feature. Avoiding to use a fixed value to prevent false positives.
random_files_count_before_removal=$(find_all_chunks | wc -l)
rm -f "$file" "$xorfile"

# Wait for removing all the chunks
timeout="3 minutes"
if ! wait_for '[[ $(find_all_chunks | wc -l) == 0 ]]' "$timeout"; then
test_add_failure $'The following chunks were not removed:\n'"$(find_all_chunks)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
test_add_failure $'The following chunks were not removed:\n'"$(find_all_chunks)"
test_add_failure $'The following chunks are still pending for removal:\n'"$(find_all_chunks)"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider a more descriptive message as removal could take more than 3 minutes in some cases. For instance consider setting this master param to low value to trigger much faster chunk operations: OPERATIONS_DELAY_INIT = 2

antuan96314
antuan96314 previously approved these changes Oct 21, 2024
Copy link
Contributor

@antuan96314 antuan96314 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ictus4u ictus4u force-pushed the feature-chunk-trash-bin branch from c4da37e to 2d0e144 Compare October 28, 2024 12:26
@ictus4u ictus4u dismissed stale reviews from antuan96314 and lgsilva3087 via f2d465b October 28, 2024 12:47
@ictus4u ictus4u force-pushed the feature-chunk-trash-bin branch from 72b2204 to a46c8d8 Compare November 4, 2024 07:18
This commit provides additional safety measure for the system,
temporarilly allowing easier recovery for deleted chunks.
ictus4u and others added 8 commits November 4, 2024 09:19
For splitting responsibility and better tests.
Co-authored-by: Antuan <antuan@leil.io>
Co-authored-by: Guillex <guillex@leil.io>
Co-authored-by: Antuan <antuan@leil.io>
Implement Chunk Trash garbage collection with configuration-based
trash management, including methods for moving files to trash,
deleting expired files, and freeing space based on disk thresholds.

Co-authored-by: Antuan <antuan@leil.io>
@ictus4u ictus4u force-pushed the feature-chunk-trash-bin branch from a46c8d8 to 16c6a2d Compare November 4, 2024 08:19
Comment on lines +57 to +59
if [ "${trashed_chunks_count}" -eq 0 ]; then
test_add_failure $'The removed chunks were not moved to the trash folder'
fi
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider this kind of comparisons to be treated like this on the tests

# Ensure the "unlinked" files are trashed
assert_equals "0" "${trashed_chunks_count}"

In order to be conventionally align to the other tests.

@ictus4u ictus4u marked this pull request as draft November 29, 2024 11:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants