Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ingest Manager] Retryable downloads of beats #19102

Merged

Conversation

michalpristas
Copy link
Contributor

What does this PR do?

Background:
when agent downloads an artifact and checksum does not match it yields a failure, but then it might occur that when download is performed again due to new config or whatever, download is skipped (because download was successful for some reason or packed artifacts are invalid).
Agent cleans up downloaded artifact only in case download yields error. so if this does not yield error but artifact is corrupted we might end up in a loop because it will try to verify artifact it find out it's incorrect and continues with failure... and so on

This PR changes this behavior a bit.

In case Verify fails. it cleans up downloaded artifacts (artifact + hash).

It also introduces retryable block within operation flow.
In this case we know than=t download+verify might be error prone so we can retry them if failure happens. (only if retry.enabled == true)

What this means for agent is that when it tries to install from corrupted artifact, it will remove artifact during Verify and re-download it again.

Why is it important?

Make download scneario more robust and repair loop faster

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test

  • Build a snapshot package
  • Modify one of sha files
  • enable retry by setting retry.enabled: true
  • run agent

See it fails with packed artifact, waits 30s and then downloads artifact from web

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ingest-management (Team:Ingest Management)

@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Jun 10, 2020
@elasticmachine
Copy link
Collaborator

elasticmachine commented Jun 10, 2020

💚 Build Succeeded

Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: [Pull request #19102 updated]

  • Start Time: 2020-06-12T07:14:55.789+0000

  • Duration: 35 min 52 sec

Test stats 🧪

Test Results
Failed 0
Passed 537
Skipped 127
Total 664

Steps errors

Expand to view the steps failures

  • Name: Report to Codecov
    • Description: curl -sSLo codecov https://codecov.io/bash for i in auditbeat filebeat heartbeat libbeat metricbeat packetbeat winlogbeat journalbeat do FILE="${i}/build/coverage/full.cov" if [ -f "${FILE}" ]; then bash codecov -f "${FILE}" fi done

    • Duration: 2 min 22 sec

    • Start Time: 2020-06-12T07:40:19.878+0000

    • log

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this, great to see a clean up and retry.

I think we should really cover this path, add a unit test?

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the test, looks great!

Comment on what needs to be updated for it to land. Just a little change from the PR I merged with the GRPC flip.

// examples:
// - Start does not need to run if process is running
// - Fetch does not need to run if package is already present
func (o *retryableOperations) Check() (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check() has become Check(app Application). Need to update this.

@michalpristas michalpristas added the needs_backport PR is waiting to be backported to other branches. label Jun 12, 2020
@michalpristas michalpristas merged commit eaf5e2f into elastic:master Jun 12, 2020
michalpristas added a commit to michalpristas/beats that referenced this pull request Jun 12, 2020
[Ingest Manager] Retryable downloads of beats (elastic#19102)
v1v added a commit to v1v/beats that referenced this pull request Jun 12, 2020
…ngs-archive

* upstream/master: (119 commits)
  Update filebeat input docs (elastic#19110)
  Add ECS fields from log pipeline of PostgreSQL (elastic#19127)
  Init package libbeat/statestore (elastic#19117)
  [Ingest Manager] Retryable downloads of beats (elastic#19102)
  [DOCS] Add output.console to Functionbeat doc and Functionbeat reference file (elastic#18965)
  Add compatibility info (elastic#18929)
  Set ecszap version to v0.2.0 (elastic#19106)
  [filebeat][httpjson] Fix unit test function call (elastic#19124)
  [Filebeat][httpjson] Adds oauth2 support for httpjson input (elastic#18892)
  Allow host.* fields to be disabled in Suricata module (elastic#19107)
  Make selector string casing configurable (elastic#18854)
  Switch the GRPC communication where Agent is running the server and the beats are connecting back to Agent (elastic#18973)
  Disable host.* fields by default for netflow module (elastic#19087)
  Automatically fill zube teams on backports if available (elastic#18924)
  Fix crash on vsphere module (elastic#19078)
  [Ingest Manager] Download snapshot artifacts from snapshots repo (elastic#18685)
  [Ingest Manager] Basic Elastic Agent documentation (elastic#19030)
  Make user.id a string in system/users, in line with ECS (elastic#19019)
  [docs] Add 7.8 release highlights placeholder file (elastic#18493)
  Fix translate_sid's empty target field handling (elastic#18991)
  ...
michalpristas added a commit that referenced this pull request Jun 12, 2020
[Ingest Manager] Retryable downloads of beats (#19102)
melchiormoulin pushed a commit to melchiormoulin/beats that referenced this pull request Oct 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Ingest Management:beta1 Group issues for ingest management beta1 needs_backport PR is waiting to be backported to other branches. review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants