Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expiry javadoc is a bit misleading #217

Closed
facboy opened this issue Jan 31, 2018 · 19 comments
Closed

Expiry javadoc is a bit misleading #217

facboy opened this issue Jan 31, 2018 · 19 comments

Comments

@facboy
Copy link
Contributor

facboy commented Jan 31, 2018

The javadoc for com.github.benmanes.caffeine.cache.Expiry states:

To indicate no expiration an entry may be given an excessively long period, such as Long#MAX_VALUE.

This isn't strictly accurate as if you use Long.MAX_VALUE the expiry (now + duration) overflows and you end up with a negative eviction time, ie the entry gets evicted almost immediately.

It looks like this scenario is guarded against in the Async code by restricting the effective duration to (Long.MAX_VALUE >> 1) but this isn't done in com.github.benmanes.caffeine.cache.BoundedLocalCache.

@ben-manes
Copy link
Owner

Sorry about that. We should add some unit tests against this then.

The time calculation should be tolerant to overflows by following the advice in System.nanoTime. It should be using t1 - t0 < 0 instead of t1 < t0 to avoid sign assumptions. Its easy mistake to do, so if I did that again then we should fix it. Otherwise, an overflow should be okay.

If I recall correctly, the Async situation is slightly different. It is used as a signal that the future is in-flight and should not be evaluated for expiration. Since the future can complete anytime, it is helpful to have that flag separated for querying the future's state. By capping the expiry that a user can set, it lets us differentiate between the states. But I don't recall enough if that safe-guard was necessary, as there is a lot of mental overhead to keep track of these details.

Can you write a unit test and we can fix your issue?

@ben-manes
Copy link
Owner

Hi @facboy,

I did a quick sanity check and was unable to reproduce this issue. We'll keep digging and make sure there is test coverage. Did you observe this bug or was it a mistake when reading the JavaDoc?

@facboy
Copy link
Contributor Author

facboy commented Feb 4, 2018

sorry i've not had a chance to write a unit test yet. i observed the bug.

@ben-manes
Copy link
Owner

In case this helps, here is my passing test. Hopefully you can tweak it to fail and we can understand the problem.

@Test
public void issue217() {
  FakeTicker ticker = new FakeTicker();
  ticker.advance(ThreadLocalRandom.current().nextLong());
  Cache<Integer, Integer> cache = Caffeine.newBuilder()
      .ticker(ticker::read)
      .executor(Runnable::run)
      .expireAfter(new Expiry<Integer, Integer>() {
        @Override public long expireAfterCreate(Integer key, Integer value, long currentTime) {
          return Long.MAX_VALUE;
        }
        @Override public long expireAfterUpdate(Integer key, Integer value, long currentTime,
            long currentDuration) {
          return Long.MAX_VALUE;
        }
        @Override public long expireAfterRead(Integer key, Integer value, long currentTime,
            long currentDuration) {
          return Long.MAX_VALUE;
        }})
      .removalListener((k, v, c) -> {
        Assert.fail(k + " " + v + " " + c);
      })
      .build();
  cache.put(1, 2);
  assertThat(cache.getIfPresent(1), is(2));
  assertThat(cache.estimatedSize(), is(1L));

  ticker.advance(1, TimeUnit.MINUTES);
  assertThat(cache.getIfPresent(1), is(2));
  assertThat(cache.estimatedSize(), is(1L));
}

@facboy
Copy link
Contributor Author

facboy commented Feb 5, 2018

well this is pretty annoying, i cannot reproduce it in a unit test so started wondering if i'd imagined the whole thing. back to our production code and when i use Long.MAX_VALUE, it definitely breaks, so i'll have to whittle back from there.

@facboy
Copy link
Contributor Author

facboy commented Feb 5, 2018

ah, the simple approach: https://github.com/facboy/caffeine-test

@ben-manes
Copy link
Owner

Thanks! That is very odd, but something we can debug with. Interestingly it takes 1.5M operations to fail, though this is using real time instead of a fake source so that may be understandable. Since my test suite primarily uses TestNG, here is a slight adaption to run there.

@rash67 wants to get his hands dirty, so hopefully he can take a look. If not, I'll try to resolve it over the weekend.

@Test
@SuppressWarnings({"unchecked", "RedundantStringConstructorCall", "InfiniteLoopStatement"})
void maxExpiryShouldNotExpireImmediately() {
  LoadingCache<Integer, String> cache = Caffeine.newBuilder()
      .expireAfter(new Expiry<Integer, String>() {
        @Override
        public long expireAfterCreate(Integer key, String value, long currentTime) {
          return Long.MAX_VALUE;
        }
        @Override
        public long expireAfterUpdate(Integer key, String value, long currentTime,
            long currentDuration) {
          return Long.MAX_VALUE;
        }
        @Override
        public long expireAfterRead(Integer key, String value, long currentTime,
            long currentDuration) {
          return Long.MAX_VALUE;
        }
      })
      .removalListener((k, v, c) -> {
        //Assert.fail(k + " " + v + " " + c);
      }).build(key -> new String("10"));

  int counter = 0;
  String expected = cache.get(10);
  Stopwatch stopwatch = Stopwatch.createStarted();
  while (stopwatch.elapsed(TimeUnit.MINUTES) < 1) {
    String val = cache.get(10);
    assertThat("Failed after " + counter + " iterations", val, sameInstance(expected));
    counter++;
  }
}

@facboy
Copy link
Contributor Author

facboy commented Feb 9, 2018

I don't think it's related but is there also a bug with the Expiry.expireAfterUpdate behaviour? perhaps it's not a supported use-case but I expect an entry to expire (ie be evicted and invoke removalListener) after the delay returned from Expiry.expireAfterUpdate() has passed. It seems that this doesn't happen if I do not further access the cache. I didn't follow all the ins-and-outs of it but it seems that Caffeine.expiresAfterWrite() doesn't return true if Caffeine.expireAfter(Expiry) is used, so no eviction is scheduled on a write.

Possibly the error is here in the generated LocalCacheFactory:

    if (builder.expiresAfterAccess() || builder.expiresVariable()) {
      sb.append('A');
    }
    if (builder.expiresAfterWrite()) {
      sb.append('W');
    }

Should it append 'W' if builder.expiresVariable() is true as well?

@ben-manes
Copy link
Owner

To expire without any interaction with the cache would require a thread to schedule on. In Guava's Cache timeframe, there were some environments like JavaEE and AppEngine that disallowed creating threads. It is a fairly high cost when typically the promptness is not necessary.

Java 9 added a JVM-wide scheduler (via CompletableFuture), which we might leverage as an option. I am way behind on a JDK9 version, unfortunately.

@facboy
Copy link
Contributor Author

facboy commented Feb 9, 2018

oh i see, i thought that was what the timerWheel did but I didn't look at it in great detail. i guess the wheel only advances on interaction with the cache?

EDIT: doesn't it schedule evictions etc using the ForkJoinPool?

@ben-manes
Copy link
Owner

ben-manes commented Feb 9, 2018

Yes, the timerwheel is a passive data structure. There is a nice article regarding Kafka's implementation that schedules on a thread. They combine the timerwheel with a ScheduledExecutorService (heap-based) and schedule the bucket instead of the entry. This way the heap is bounded to O(lg # of wheels) instead of O(lg # of timers), which in our case would be a maximum of O(lg 165) ~= 7.

@facboy
Copy link
Contributor Author

facboy commented Feb 13, 2018

would be interesting if this was added as an option :). for my current use-case i don't need them to expire in a particularly timely manner, i've just added an external background thread that invokes cleanUp() periodically.

@ben-manes
Copy link
Owner

#195 is basically asking for this. It has definitely come up many times for Guava, etc. The cost of maintaining dedicated threads is high for a library, and while we can use FJP in JDK8 it doesn't support any scheduling semantics.

The hope is that we can use optionally enable using CompletableFuture.html#delayedExecutor to trigger a cleanUp() on the next expiration event. This uses a global JDK-wide scheduling thread, so we do not impose our own. We might want to offer a simple Scheduler abstraction for unit tests, etc. (like we did with Ticker for time). I don't know what we'd call the enablement of this feature, as it would be off by default.

That's the idea of how to do it properly, at least. I am pretty behind (on this bug fix and v3.0 release in general).

@ben-manes
Copy link
Owner

@facboy Can you try testing with .executor(Runnable::run) added to the builder? It doesn't fail for me then.

I think the issue might be that threads read different timestamps when calling System.nanoTime(). When enough reads occur, they are replayed on the policy asynchronously. Then it somehow overflows and goes negative. I haven't debugged more than that, but it seems like a clock synchronization issue if my theory holds.

@facboy
Copy link
Contributor Author

facboy commented Feb 20, 2018

hmm. so it fixes the test-case i gave you, but it doesn't fix the 'actual' problem i was having which is still in private code. if it helps this is the partial stack trace of the removalListener when it is expired (the reason cause[0] is EXPIRED):

	at com.github.benmanes.caffeine.cache.BoundedLocalCache.lambda$notifyRemoval$1(BoundedLocalCache.java:283)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.notifyRemoval(BoundedLocalCache.java:289)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:2080)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:2010)
	at com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:113)
	at com.github.benmanes.caffeine.cache.LocalLoadingCache.get(LocalLoadingCache.java:65)

there might be an additional issue. the code where I first observed this (ie not the test-case) is not doing a lot of reads...probably in the low 10s at most as it's a small integration test.

@ben-manes
Copy link
Owner

It’s likely that the async thread has an older time stamp and the overflow occurs since it’s a tight boundary. I just need to better accustom myself to this. I think that the solution you gave will be the right fix, though.

ben-manes added a commit that referenced this issue Feb 21, 2018
…217)

When the duration is set to the maximum length, Long.MAX_VALUE nanoseconds,
the calcuation of expirationTime - currentTime > 0 may overflow and be
negative. This will not occur if the same thread calculates both timestamps.
It may occur across threads when the expirationTime is concurrently updated
using a later base time than t1's reading of the currentTime. This can
occur whenever the maintenance work is triggered to sweep expired entries
and a user thread accesses the entry. The later timestamp plus the maximum
duration results in an overflow, causing the remaining time to be negative,
and therefore causes the cache to expire the entry.

The internal maximum is now capped at Long.MAX_VALUE / 2 or ~150 years. This
should give a broad safety net to avoid these concurrency-inducing overflows
in normal code.
@ben-manes
Copy link
Owner

I generalized the test to a pass/fail scenario, showing that the concurrency due to threading was the cause. When the read buffer is full (~16 reads per stripe), the async maintenance is triggered. The entry continues to be read, bumping its expiration time beyond the overflow point. This means that the maintenance's currentTime read is less than the entry's currentTime read, so t1 - (t0 + duration) will overflow with Long.MAX_VALUE. If all operations are using a single thread, this cannot occur due to order of operations causing linear timestamps. Since the unit tests default to same-thread cleanup for simplicity, it was an oversight on my part.

I updated the internals to always enforce a Long.MAX_VALUE / 2 as you suggested to provide enough leeway for the overflow to not occur in practical code.

I'll make a release tomorrow and will let the CI do its thing tonight. Thanks for reporting the issue.

ben-manes added a commit that referenced this issue Feb 21, 2018
…217)

When the duration is set to the maximum length, Long.MAX_VALUE nanoseconds,
the calcuation of expirationTime - currentTime > 0 may overflow and be
negative. This will not occur if the same thread calculates both timestamps.
It may occur across threads when the expirationTime is concurrently updated
using a later base time than t1's reading of the currentTime. This can
occur whenever the maintenance work is triggered to sweep expired entries
and a user thread accesses the entry. The later timestamp plus the maximum
duration results in an overflow, causing the remaining time to be negative,
and therefore causes the cache to expire the entry.

The internal maximum is now capped at Long.MAX_VALUE / 2 or ~150 years. This
should give a broad safety net to avoid these concurrency-inducing overflows
in normal code.
ben-manes added a commit that referenced this issue Feb 21, 2018
…217)

When the duration is set to the maximum length, Long.MAX_VALUE nanoseconds,
the calcuation of expirationTime - currentTime > 0 may overflow and be
negative. This will not occur if the same thread calculates both timestamps.
It may occur across threads when the expirationTime is concurrently updated
using a later base time than t1's reading of the currentTime. This can
occur whenever the maintenance work is triggered to sweep expired entries
and a user thread accesses the entry. The later timestamp plus the maximum
duration results in an overflow, causing the remaining time to be negative,
and therefore causes the cache to expire the entry.

The internal maximum is now capped at Long.MAX_VALUE / 2 or ~150 years. This
should give a broad safety net to avoid these concurrency-inducing overflows
in normal code.
@facboy
Copy link
Contributor Author

facboy commented Feb 21, 2018

ah great, thanks for that!

ben-manes added a commit that referenced this issue Feb 21, 2018
…217)

When the duration is set to the maximum length, Long.MAX_VALUE nanoseconds,
the calcuation of expirationTime - currentTime > 0 may overflow and be
negative. This will not occur if the same thread calculates both timestamps.
It may occur across threads when the expirationTime is concurrently updated
using a later base time than t1's reading of the currentTime. This can
occur whenever the maintenance work is triggered to sweep expired entries
and a user thread accesses the entry. The later timestamp plus the maximum
duration results in an overflow, causing the remaining time to be negative,
and therefore causes the cache to expire the entry.

The internal maximum is now capped at Long.MAX_VALUE / 2 or ~150 years. This
should give a broad safety net to avoid these concurrency-inducing overflows
in normal code.
ben-manes added a commit that referenced this issue Feb 21, 2018
…217)

When the duration is set to the maximum length, Long.MAX_VALUE nanoseconds,
the calcuation of expirationTime - currentTime > 0 may overflow and be
negative. This will not occur if the same thread calculates both timestamps.
It may occur across threads when the expirationTime is concurrently updated
using a later base time than t1's reading of the currentTime. This can
occur whenever the maintenance work is triggered to sweep expired entries
and a user thread accesses the entry. The later timestamp plus the maximum
duration results in an overflow, causing the remaining time to be negative,
and therefore causes the cache to expire the entry.

The internal maximum is now capped at Long.MAX_VALUE / 2 or ~150 years. This
should give a broad safety net to avoid these concurrency-inducing overflows
in normal code.
@ben-manes
Copy link
Owner

Released. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants