Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DSET-4080 fix recovery from error state #36

Merged
merged 5 commits into from
Jun 21, 2023
Merged

DSET-4080 fix recovery from error state #36

merged 5 commits into from
Jun 21, 2023

Conversation

zdaratom-s1
Copy link
Collaborator

@zdaratom-s1 zdaratom-s1 commented Jun 20, 2023

Problem

  • when DataSet returns Retryable error we try to retry with timeout. Once this time is over we stay in error mode since last error is never overriden, and new event handling is rejected.

Solution

  • keep timestamp of last error and consider it while evaluate last error timestamp in order to recover from error mode.

@codecov-commenter
Copy link

codecov-commenter commented Jun 20, 2023

Codecov Report

Merging #36 (d6a0ea3) into main (1295b6f) will increase coverage by 0.98%.
The diff coverage is 85.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #36      +/-   ##
==========================================
+ Coverage   73.86%   74.84%   +0.98%     
==========================================
  Files          10       10              
  Lines        1377     1387      +10     
==========================================
+ Hits         1017     1038      +21     
+ Misses        298      290       -8     
+ Partials       62       59       -3     
Impacted Files Coverage Δ
pkg/client/client.go 88.53% <85.00%> (+2.85%) ⬆️

... and 1 file with indirect coverage changes

Problem
- when DataSet returns Retryable error we try to retry with timeout. Once this time is over we stay in error mode since last error is never overriden, and new event handling is rejected.

Solution
- keep timestamp of last error and consider it while evaluate last error timestamp in order to recover from error mode.
@zdaratom-s1 zdaratom-s1 marked this pull request as ready for review June 21, 2023 09:26
Copy link
Collaborator

@tomaz-s1 tomaz-s1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a change like that, I would feel more comfortable if we also had some end to end tests - I think there should be the mock Python server in place which we should be able to utilize for that.

@zdaratom-s1
Copy link
Collaborator Author

@tomaz-s1 I have more e2e test like unit (blackbox) test using waiting etc. but that test fails on pre commit hook despite the fact that in idea its passes. So after all I decided to mock the state (whitebox test), its also much faster.

@tomaz-s1
Copy link
Collaborator

@zdaratom Yeah, it actually looks like we don't have a setup for such tests in this repo, but we have it in the otel one (https://github.com/scalyr/dataset-exporter-opentelemetry/tree/main/dummy-server), so we should add an end to end test like this in that repo once the PR is merged.

@zdaratom-s1 zdaratom-s1 requested a review from tomaz-s1 June 21, 2023 13:12
@zdaratom-s1
Copy link
Collaborator Author

sounds reasonable, I will add TODO into the Jira

@zdaratom-s1 zdaratom-s1 added this pull request to the merge queue Jun 21, 2023
Merged via the queue into scalyr:main with commit dab2a50 Jun 21, 2023
@zdaratom-s1 zdaratom-s1 deleted the DSET-4080 branch June 21, 2023 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants