Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Cache already exists" due to concurrency #537

Closed
Vampire opened this issue Aug 2, 2020 · 8 comments · Fixed by #558
Closed

"Cache already exists" due to concurrency #537

Vampire opened this issue Aug 2, 2020 · 8 comments · Fixed by #558
Labels
bug Something isn't working

Comments

@Vampire
Copy link

Vampire commented Aug 2, 2020

If you have a look at this workflow run: https://github.com/Vampire/setup-wsl/runs/938421512?check_suite_focus=true you see the problem.
Here the timeline of the problem:

22:30:04 Job `Test 'Debian' distribution` / Step ` Execute action`: ##[debug]Resolved Keys
22:30:04 Job `Test 'Debian' distribution` / Step ` Execute action`: Cache not found => download and later add to cache
22:30:08 Job `Test additional packages for 'Debian' distribution` / Step `Execute action`: ##[debug]Resolved Keys:
22:30:08 Job `Test additional packages for 'Debian' distribution` / Step `Execute action`: Cache not found => download and later add to cache
22:30:09 Job `Test 'Debian' distribution` / Step ` Execute action`: ##[debug]Reserving Cache
22:30:11 Job `Test additional packages for 'Debian' distribution` / Step ` Execute action`: ##[debug]Reserving Cache
22:30:11 Job `Test additional packages for 'Debian' distribution` / Step ` Execute action`: ##[error]reserveCache failed
22:30:14 Job `Test 'Debian' distribution` / Step ` Execute action`: Cache saved successfully

The problem is clear, the typical problem you have with lazy initialization if you don't do the double-checked locking properly.
Both jobs see the cache is not present, prepare the thing to be cached and then try to cache it, while the second fails at caching.

Is there a clean way to solve this, or can something be added like for example a method cache.exists() that only checks whether the cache exists? Doing a full restore before doing the save seems a bit overkill just to check whether in the meantime another job already filled the cache. And just catching the exception feels like "exception for flow control" if there could maybe be a simple test method to check up front.

@Vampire Vampire added the bug Something isn't working label Aug 2, 2020
@JojOatXGME
Copy link

JojOatXGME commented Aug 11, 2020

I created my own action and I seem to get a wanring caused by this error quite reliably on every build (if the cache was not already present before I started it). It is annoying.

##[warning]Error: reserveCache failed: Cache already exists. Scope: refs/heads/***, Key: ***, Version: ***

If we look at the actions/cache action, they already have implemented a special case to prevent such warnings. I did the same in my action.

// https://github.com/actions/cache/blob/d29c1df198dd38ac88e0ae23a2881b99c2d20e68/src/save.ts#L43
} catch (error) {
    if (error.name === cache.ValidationError.name) {
        throw error;
    } else if (error.name === cache.ReserveCacheError.name) {
        core.info(error.message);
    } else {
        utils.logWarning(error.message);
    }
}

This ReserveCacheError is supposed to be thrown here:

// https://github.com/actions/toolkit/blob/9ad01e4fd30025e8858650d38e95cfe9193a3222/packages/cache/src/cache.ts#L146
const cacheId = await cacheHttpClient.reserveCache(key, paths, {
  compressionMethod
})
if (cacheId === -1) {
  throw new ReserveCacheError(
    `Unable to reserve cache with key ${key}, another job may be creating this cache.`
  )
}

Unfortunatly, another generic error seems to be thrown before we reach that point. I think it is initially thown by the @actions/http-client package and converted in requestUtils.retry(). However, I just looked at it for 2 or 3 hours on GitHub and haven't debuged it. Maybe it works if the cache is not yet created but just reserved. However, I think this library should not handle the cases that differently and should allow us to omit the warning.

PS: Something like cache.exists() would still have a race condition because another job could create the cache after you call this method but before you upload your own data.

@Vampire
Copy link
Author

Vampire commented Aug 11, 2020

Then maybe cache.reserve() that reserves the cache if it is not reserved or present yet and returns a boolean instead of throwing an exception.

Btw. I copied the exception handling from @action/cache too in the meantime :-D

@JojOatXGME
Copy link

JojOatXGME commented Aug 11, 2020

Yes, returning a boolean instead of throwing an exception might be a better API. Anyway, I would already be happy if @actions/cache would reliably throw ReserveCacheError instead of (at least sometimes) throwing a generic Error.

@dhadka
Copy link
Member

dhadka commented Aug 16, 2020

Thanks for reporting this. There's definitely an issue with handling non-successful status codes. It looks like there were a series of changes that led up to this. Caching used to call the get method in http-client which would return a http response object with the status code, that was changed to getJson which instead throws an error for 3xx, 4xx, and 5xx response codes, but the caching module is still just checking the status code in the response.

If the cache already exists, the server returns a 409 Conflict so http-client throws an error, whereas the expected behavior is reserveCache returns -1. I'll take a look at this next week.

@dhadka
Copy link
Member

dhadka commented Aug 18, 2020

Before fixing the cache module, will need to fix http-client so it provides the status code in a type safe way (since actions/toolkit uses strict type checking) - actions/http-client#32

@weibeld
Copy link

weibeld commented Oct 9, 2020

Is there any timeline for when this will be fixed?

Currently, I'm using the following workaround:

try {
  await cache.saveCache(paths, key);
} catch (err) {
  if (err.message.includes("Cache already exists")) {
    core.info(`Cache entry ${key} has already been created by another worfklow`);
  } else {
    throw err;
  }
}

This seems to work to catch and neutralise the error. The error is detected through the error message because the type of the error seems to be just Error and not the more useful ReserveCacheError that would be intended for this situation, as described in @JojOatXGME's comment. However, since it's a race condition, it's hard to test and debug properly.

weibeld added a commit to weibeld/install-ajv-cli that referenced this issue Oct 10, 2020
In the previous code there was a potential race condition when two or
more workflows checked the cache at the same time:

1. Workflow 1: cache miss; Workflow 2: cache miss
2. Workflow 1: install; Workflow 2: install
3. Workflow 1: save cache; Workflow 2: save cache <<< RACE CONDITION

This resulted in the step failing with the following error:

Error: reserveCache failed: Cache already exists. Scope: refs/heads/master,
Key: <key>, Version: <...>

It's a bug in the @actions/cache package and it's discussed in the
following issue: actions/toolkit#537

The correct behaviour would be for Workflow 2 to simply drop its attempt
to save the cache entry when this situation occurs.

For now, this code uses a workaround to fix this issue which works by
detecting and catchin the above error based on its error message.
@dhadka
Copy link
Member

dhadka commented Oct 12, 2020

@weibeld Thanks for the ping. The changes to http-client have been merged and the new version (1.0.9) published to NPM. I'll try to get the rest of the fix described above implemented this week.

@dhadka
Copy link
Member

dhadka commented Oct 12, 2020

PR for the cache module: #558

Once this is merged and published, we'll also need to bump the version used by the actions/cache action.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants