Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor replicate execution method to not iterate the origin bucket #2906

Merged
merged 3 commits into from
Oct 30, 2020

Conversation

yeya24
Copy link
Contributor

@yeya24 yeya24 commented Jul 17, 2020

Signed-off-by: Ben Ye yb532204897@gmail.com

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

In the previous implementation, we are iterating the origin bucket, and each time, we use metaFetcher to fetch all meta files. But this seems not necessary, we can just fetch meta files once at the beginning.

Verification

Copy link
Member

@GiedriusS GiedriusS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two comments but overall LGTM!

pkg/replicate/scheme.go Show resolved Hide resolved
metas, partials, err := rs.fetcher.Fetch(ctx)
if err != nil && metas == nil {
return err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should log the error if metas != nil && err != nil?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is good to log the error here, but kind of duplicate.

I just check the code, only here returns metas and error at the same time. In this case, partials != nil, so we will log the partial meta errors later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm the if err != nil && metas == nil { feels like code smell to me though. It just looks wrong from outside which can lead to bugs in future:

  • Inside implementation can change, and metas can be produced without partial for some reason - then we will miss an error.
  • Even if we say it makes sense, maybe another person will come here and refactor it anyway as it looks like error can be missed.

On other hand, shouldn't replicate FAIL on any error spotted? Even if we failed to read one block? Sometimes we heavily rely on replication status so this has to be heavily tested and controlled, ideally with failed metric being somewhere (:

@yeya24
Copy link
Contributor Author

yeya24 commented Jul 19, 2020

CI failure is unrelated. Please take a look at this pr again

Copy link
Member

@GiedriusS GiedriusS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT the CI failure seems related to the changes

@yeya24
Copy link
Contributor Author

yeya24 commented Jul 29, 2020

AFAICT the CI failure seems related to the changes

Thanks, the previous problem was caused by the make examples-in-container target, and that should be fixed it in the latest commit 😄 .

pkg/replicate/scheme.go Outdated Show resolved Hide resolved
pkg/replicate/scheme.go Show resolved Hide resolved
Copy link
Contributor

@OGKevin OGKevin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yeya24
Copy link
Contributor Author

yeya24 commented Sep 2, 2020

Can I get a review for this? I think it is close.

Copy link
Member

@kakkoyun kakkoyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, some comments though, before merging (:

pkg/replicate/scheme.go Show resolved Hide resolved
metas, partials, err := rs.fetcher.Fetch(ctx)
if err != nil && metas == nil {
return err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm the if err != nil && metas == nil { feels like code smell to me though. It just looks wrong from outside which can lead to bugs in future:

  • Inside implementation can change, and metas can be produced without partial for some reason - then we will miss an error.
  • Even if we say it makes sense, maybe another person will come here and refactor it anyway as it looks like error can be missed.

On other hand, shouldn't replicate FAIL on any error spotted? Even if we failed to read one block? Sometimes we heavily rely on replication status so this has to be heavily tested and controlled, ideally with failed metric being somewhere (:

return nil
}
for id, partialError := range partials {
level.Info(rs.logger).Log("msg", "failed to fetch block meta. Skipping.", "block_uuid", id.String(), "err", partialError)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this failure? This is actually valid state of blocks being potentially uploaded, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like level.Info(rs.logger).Log("msg", "block meta not uploaded yet. Skipping.", "block_uuid", id.String()) r

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PartialError has 2 cases, one is meta.json not found, another one is meta.json corrupted. What about the corrupted case, is it an error or not?

For different obj store implementation, is it possible that one uploaded file is corrupted? If it is this case, how can we differentiate the case that one file is totally corrupted or still being uploaded?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrupted case, is it an error or not?

For current flow it's not error. We consider corrupted meta same as no meta.json as the reason is usually same - partial upload (:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Updated.

@kakkoyun
Copy link
Member

@yeya24 This looks fine to me. You need to rebase I believe.
Ping @bwplotka are you satisfied with the state of the PR now?

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responded to comment, that still needs to be addressed, otherwise LGTM!

Thanks!

return nil
}
for id, partialError := range partials {
level.Info(rs.logger).Log("msg", "failed to fetch block meta. Skipping.", "block_uuid", id.String(), "err", partialError)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrupted case, is it an error or not?

For current flow it's not error. We consider corrupted meta same as no meta.json as the reason is usually same - partial upload (:

Signed-off-by: Ben Ye <yb532204897@gmail.com>
Signed-off-by: Ben Ye <yb532204897@gmail.com>
Signed-off-by: Ben Ye <yb532204897@gmail.com>
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Changelog entry could be better, but we can fix it later.

@@ -28,6 +28,12 @@ We use *breaking :warning:* to mark changes that are not backward compatible (re
- [#3331](https://github.com/thanos-io/thanos/pull/3331) Disable Azure blob exception logging
- [#3341](https://github.com/thanos-io/thanos/pull/3341) Disable Azure blob syslog exception logging

### Changed

- [#2906](https://github.com/thanos-io/thanos/pull/2906) Tools: Refactor Bucket replicate execution. Removed all `thanos_replicate_origin_.*` metrics.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically not need to mention refactor, just metric changes (:

@bwplotka bwplotka merged commit 5abda52 into thanos-io:master Oct 30, 2020
@OGKevin
Copy link
Contributor

OGKevin commented Oct 30, 2020

Whoop 🥳 🎊 I'll work on #2979.

@OGKevin
Copy link
Contributor

OGKevin commented Oct 30, 2020

Nice work @yeya24

@bwplotka
Copy link
Member

@bwplotka
Copy link
Member

💪

@yeya24 yeya24 deleted the rework-replicate-exec branch October 30, 2020 22:20
Oghenebrume50 pushed a commit to Oghenebrume50/thanos that referenced this pull request Dec 7, 2020
…hanos-io#2906)

* Refactor replicate execution

Signed-off-by: Ben Ye <yb532204897@gmail.com>

* return error when fetcher gets an error

Signed-off-by: Ben Ye <yb532204897@gmail.com>

* address feedback

Signed-off-by: Ben Ye <yb532204897@gmail.com>
Signed-off-by: Oghenebrume50 <raphlbrume@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants