Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring Large File Upload Continification Logic #286

Merged
merged 11 commits into from
Jul 3, 2023

Conversation

przadka
Copy link
Collaborator

@przadka przadka commented Jun 17, 2023

This PR addresses the issue Backblaze#381, "Unify large file continuation preventors". The common logic from _find_unfinished_file_by_plan_id and _match_unfinished_file_if_possible is extracted into a new function named _find_matching_unfinished_file, which now handles their shared behavior. This new function includes three parameters, log_rejections, check_file_info_without_large_file_sha1, and eager_mode, to allow for flexible control over its behavior. Additionally, docstrings for all functions have been enhanced to better express their purpose and functionality.

Key changes and observations:

  1. Most of the code between the two original functions was repetitive, except _find_unfinished_file_by_plan_id had significantly less logging.
  2. The original handling of None encryption appeared inconsistent. Specifically, the way None encryption was handled seemed to vary, which was identified as a potential bug. In this refactor, we've standardized the handling of None encryption. The new condition for this is: if encryption is not None and encryption != file_.encryption:. This means that if encryption is not None and the encryption type does not match the file encryption, the condition will be true, leading to a more consistent and correct treatment of this edge case.
  3. The original behavior for sha1_sum check was inconsistent. The new approach is less penalizing, re-uploading only the part with a mismatching sha1, similar to _find_unfinished_file_by_plan_id.
  4. Previously, _match_unfinished_file_if_possible was checking file_info_without_large_file_sha1, while _find_unfinished_file_by_plan_id was not. In the refactored function _find_matching_unfinished_file, this check is now controlled by a parameter (check_file_info_without_large_file_sha1), giving us flexibility to handle this check based on the specific needs of the caller function.
  5. The new function also checks for part size and part sha, previously only checked by _match_unfinished_file_if_possible. The updated approach reuploads parts with non-matching sha or size, rather than failing completely.
  6. The original handling of plan_id logic was not explicit, and while it hasn't been changed in this PR, it is noted for potential future improvement.

Potential next steps (not addressed in this PR to maintain simplicity):

  1. Make plan_id logic more explicit in the code.
  2. Unify the approach to logging. Currently, the refactored functions have different approaches, which seems unnecessary.
  3. Consider removing _match_unfinished_file_if_possible and _find_unfinished_file_by_plan_id entirely. They are only called in a single place in the code each, so perhaps we could replace these calls with calls to _find_matching_unfinished_file.

@@ -17,6 +17,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Infrastructure
* Replaced `pyflakes` with `ruff` for linting
* Automatically set copyright date when generating the docs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you accidentally started from #282 branch
btw, was that ever submitted upstream?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you are right. i did by mistake. it is actually my branch and should be already merged, it was reviewed. I think we can merge this commit with this PR. ok?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we can - basically workflow is, you create PR in reef, get it reviewed, then recreate it for upstream (backblaze repo).
Also we don't want stuff to get stuck on some other changes.

I did that a lot recently and it was no fun;p

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i can rebase (i guess this is the correct term) and resubmit my branch. that would be the reasonable approach?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

little to late for a rebase if it was already submitted for review (the golden rule of rebasing ;) )
thankfully it is not that big change that got bundled in here - so we can just ignore it I guess for now

whatever we do - please create PR for date change to upstream

b2sdk/transfer/emerge/executor.py Outdated Show resolved Hide resolved
b2sdk/transfer/emerge/executor.py Show resolved Hide resolved
b2sdk/transfer/emerge/executor.py Outdated Show resolved Hide resolved
b2sdk/transfer/emerge/executor.py Outdated Show resolved Hide resolved
This commit refactors the logic used to search for unfinished file
uploads in the bucket, primarily focusing on the
`_find_unfinished_file_by_plan_id`, `_match_unfinished_file_if_possible`,
and `_find_matching_unfinished_file` functions.

Key changes include:

- Extracting shared logic into `_find_matching_unfinished_file`. This
  increased code reuse, and improved code maintenance and readability.

- Refactoring `_find_unfinished_file_by_plan_id` and
  `_match_unfinished_file_if_possible` to use the shared logic, enhancing
  their readability and reliability.

- Enhancing function documentation. Docstrings were updated to be more
  informative and adhere to a standard template.

- Adding a note indicating 'listFiles' access is required for the
  operations.
@mjurbanski-reef mjurbanski-reef merged commit 0ad8ae8 into master Jul 3, 2023
21 checks passed
@mjurbanski-reef mjurbanski-reef deleted the large-file-cont branch July 3, 2023 09:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants