Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Timeout functionality #51

Open
jbrockopp opened this issue Mar 14, 2019 · 6 comments
Open

Add Timeout functionality #51

jbrockopp opened this issue Mar 14, 2019 · 6 comments

Comments

@jbrockopp
Copy link
Contributor

jbrockopp commented Mar 14, 2019

Currently, if we experience issues with our S3 cache service provider, it can cause the plugin to "hang". This causes the build to be stuck in a "running" state until the build reaches the timeout (usually 1 hour).

A single occurrence of the issue is not a problem, but when it happens across the entire enterprise, it causes the number of pending builds in the queue to grow massively because our agent capacity is consumed by all the "stuck" cache builds.

The logs produced in the step are:

time="2019-03-14T13:29:38Z" level=info msg="No path specified. Creating default"
time="2019-03-14T13:29:38Z" level=info msg="No fallback_path specified. Creating default"
time="2019-03-14T13:29:38Z" level=info msg="No flush_path specified. Creating default"
time="2019-03-14T13:29:38Z" level=info msg="No filename specified. Creating default"
time="2019-03-14T13:29:38Z" level=info msg="Installing new ca certificate at /etc/ssl/certs/ca-certificates.crt"
time="2019-03-14T13:29:38Z" level=info msg="Restoring cache at /prod-drone-cache/drone/toss-test/master/archive.tar"
time="2019-03-14T13:29:38Z" level=info msg="Retrieving file in prod-drone-cache at drone/toss-test/master/archive.tar"
time="2019-03-14T13:29:38Z" level=info msg="Copying object from the server"

This leads me to believe it's getting hung on line #93 because we never see the log message from line #99.

log.Infof("Copying object from the server")
numBytes, err := io.Copy(dst, object)
if err != nil {
return err
}
log.Infof("Downloaded %s from server", humanize.Bytes(uint64(numBytes)))
return nil

To remediate this issue, we decided to create our own timeout feature that we embedded into the plugin and drone/drone-cache-lib.

Would this be a feature the Drone admins were willing to accept into the plugin? I've attached the changesets below so you can see how we went about resolving this issue.

Repo Changes:

@tboerger
Copy link
Contributor

IMHO it's a really interesting feature, other opinions?

@bradrydzewski
Copy link
Member

bradrydzewski commented Mar 14, 2019

we might be able to simplify by using context and WithTimeout like this:
http://ixday.github.io/post/golang-cancel-copy/

-numBytes, err := io.Copy(dst, object)
+err := Copy(ctx, dst, object)

I am not sure if the code in the referenced blog post is any good. But I like the approach of abstracting this to a separate function that resembles io.Copy but accepts and uses context.

@jbrockopp
Copy link
Contributor Author

@bradrydzewski I agree that would be a much simpler implementation and I'm happy to try and submit something to that nature.

My only concern is, what if the plugin "hangs" on a different part of the function? We'd have to be very clear on the implementation that the timeout flag for the plugin was only a timeout for how long to wait for the copy function.

To be specific, what if the minio GetObject function hangs inside the Get function?

func (s *s3Storage) Get(p string, dst io.Writer) error {
bucket, key := splitBucket(p)
if len(bucket) == 0 || len(key) == 0 {
return fmt.Errorf("Invalid path %s", p)
}
log.Infof("Retrieving file in %s at %s", bucket, key)
exists, err := s.client.BucketExists(bucket)
if !exists {
return err
}
object, err := s.client.GetObject(bucket, key, minio.GetObjectOptions{})
if err != nil {
return err
}
log.Infof("Copying object from the server")
numBytes, err := io.Copy(dst, object)
if err != nil {
return err
}
log.Infof("Downloaded %s from server", humanize.Bytes(uint64(numBytes)))
return nil
}

Sure, we could swap to using the GetObjectWithContext function but then we deal with having the context be applied to both the Copy function (your pseudo code provided above) and the GetObjectWithContext function.

This could have unintended consequences if both the GetObjectWithContext and Copy function don't exceed the timeout, but combined together they could exceed the timeout.

Do you have thoughts on that?

@Baccata
Copy link

Baccata commented Jun 25, 2019

I seem to be having a similar issue, when the AWS IAM role available to drone does not let him create a bucket : The logs indicate that

Logs (seems to be hanging there)

1 | time="2019-06-25T10:34:54Z" level=info msg="No path specified. Creating default"
-- | --
2 | time="2019-06-25T10:34:54Z" level=info msg="No fallback_path specified. Creating default"
3 | time="2019-06-25T10:34:54Z" level=info msg="No flush_path specified. Creating default"
4 | time="2019-06-25T10:34:54Z" level=info msg="No filename specified. Creating default"
5 | time="2019-06-25T10:34:54Z" level=info msg="Rebuilding cache at /omelois/drone-test/master/archive.tar"
6 | time="2019-06-25T10:34:54Z" level=info msg="Rebuilding cache at [pipo] to /omelois/drone-test/master/archive.tar"
7 | time="2019-06-25T10:34:54Z" level=info msg="Uploading to bucket omelois at drone-test/master/archive.tar"

Pipeline

kind: pipeline
name: default

steps:
- name: restore
  image: plugins/s3-cache
  settings:
    pull: true
    restore: true

- name: build
  image: bash
  commands:
    - mkdir pipo
    - echo "hello" > pipo/hello.txt

- name: rebuild
  image: plugins/s3-cache
  settings:
    pull: true
    rebuild: true
    mount:
      - pipo
    when:
      event: push

@wadeholler
Copy link

Our IAM role is enabled - other s3 actions are being performaned from the same kubernetes cluster that our drone install runs in - and I think we are experiancing the same behavior as @Baccata describes.


time="2020-05-21T18:46:44Z" level=info msg="No path specified. Creating default"
--
2 | time="2020-05-21T18:46:44Z" level=info msg="No fallback_path specified. Creating default"
3 | time="2020-05-21T18:46:44Z" level=info msg="No flush_path specified. Creating default"
4 | time="2020-05-21T18:46:44Z" level=info msg="No filename specified. Creating default"
5 | time="2020-05-21T18:46:44Z" level=info msg="Rebuilding cache at /obfuscated-drone-s3-mvn-cache/~obfuscated/obfuscated-java/master/archive.tar"
6 | time="2020-05-21T18:46:44Z" level=info msg="Rebuilding cache at [.m2/repository] to /obfuscated-drone-s3-mvn-cache/~obfuscated/obfuscated-java/master/archive.tar"
7 | time="2020-05-21T18:46:44Z" level=info msg="Uploading to bucket obfuscated-drone-s3-mvn-cache at ~obfuscated/obfuscated-java/master/archive.tar"


@wadeholler
Copy link

update - this was solved by entering the endpoint for our region of aws.
still struggling with how to make it "backup" /root/.m2/repository and restore it before the mvn build though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants