Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

@CacheResult with method returning Uni makes cache exceed its maximum size #40852

Closed
jolsum opened this issue May 27, 2024 · 33 comments · Fixed by #41064
Closed

@CacheResult with method returning Uni makes cache exceed its maximum size #40852

jolsum opened this issue May 27, 2024 · 33 comments · Fixed by #41064
Labels
area/cache kind/bug Something isn't working
Milestone

Comments

@jolsum
Copy link

jolsum commented May 27, 2024

Describe the bug

When a method retuning a Uni is annotated with @CacheResult the result of the Uni is cached as expected, but the cache grows infinitely even when a maximum size is set.

Expected behavior

When the cache reaches the specified maximum size, the number of elements in the cache should stop growing, or at least not grow infinitely.

Actual behavior

No elements are evicted and the cache grows infinitely

How to Reproduce?

In GreetingResource there are two endpoints ("/hello/fast" and "/hello/delayed"). The "delayed" endpoint gets its data from a method returning a Uni<String> where the response is delayed, and the method is annotated with @CacheResult. The "fast" endpoint gets it data from method returning a Uni with an item without any delay.

@Path("/hello")
public class GreetingResource {

  @GET
  @Produces(MediaType.TEXT_PLAIN)
  @Path("/delayed")
  public Uni<String> delayed() {
    return getHelloDelayed(System.currentTimeMillis());
  }

  @GET
  @Produces(MediaType.TEXT_PLAIN)
  @Path("/fast")
  public Uni<String> fast() {
    return getHelloFast(System.currentTimeMillis());
  }

  @CacheResult(cacheName = "delayed")
  Uni<String> getHelloDelayed(long key) {
    return Uni.createFrom().item("Hello with a delay").onItem().delayIt().by(Duration.ofMillis(10));
  }

  @CacheResult(cacheName = "fast")
  Uni<String> getHelloFast(long key) {
    return Uni.createFrom().item("Hello");
  }
}

The cache maximum size is defined as quarkus.cache.caffeine.maximum-size = 10

In the GreetingResourceTest the endpoints are invoked 20 times and then wait for the size of the cache to become 10.

@QuarkusTest
class GreetingResourceTest {

  @Inject
  @CacheName("fast")
  Cache cacheFast;

  @Inject
  @CacheName("delayed")
  Cache cacheDelayed;

  @Test
  void testFastEndpoint() {
    for (int i = 0; i < 20; i++) {
      given().when().get("/hello/fast").then().statusCode(200).body(is("Hello"));
    }
    await()
        .atMost(5, TimeUnit.SECONDS)
        .until(() -> ((CaffeineCache) cacheFast).keySet().size() == 10);
  }

  @Test
  void testDelayedEndpoint() {
    for (int i = 0; i < 20; i++) {
      given().when().get("/hello/delayed").then().statusCode(200).body(is("Hello with a delay"));
    }
    await()
        .atMost(5, TimeUnit.SECONDS)
        .until(() -> ((CaffeineCache) cacheDelayed).keySet().size() == 10);
  }
}

However, the size remains at 20 even after waiting for 5 seconds

Output of uname -a or ver

Microsoft Windows [Version 10.0.22631.3593]

Output of java -version

openjdk version "21.0.1" 2023-10-17

Quarkus version or git rev

3.10.2

Build tool (ie. output of mvnw --version or gradlew --version)

Apache Maven 3.9.6

Additional information

I have done some debugging and manage to narrow the issue down to being an issue with CaffeineCache and how it works with the CompletableFuture returned from Uni.subscribeAsCompletionStage().

If I create a Uni with a delayed item like this

    var future =
        Uni.createFrom()
            .item("item")
            .onItem()
            .delayIt()
            .by(Duration.ofMillis(100))
            .subscribeAsCompletionStage();

    future.whenComplete(
        (result, error) -> {
          System.out.println(future.isDone());
        });

In whenComplete() the future.isDone() returns false. So in Caffeine class BoundedLocalCache.replace() the item gets weight 0 and the cache considers it as a read rather than a write even if the item is stored, and the eviction is not run.

@jolsum jolsum added the kind/bug Something isn't working label May 27, 2024
@quarkus-bot quarkus-bot bot added area/cache env/windows Impacts Windows machines labels May 27, 2024
Copy link

quarkus-bot bot commented May 27, 2024

/cc @gwenneg (cache)

@geoand geoand removed the env/windows Impacts Windows machines label May 28, 2024
@geoand
Copy link
Contributor

geoand commented Jun 6, 2024

Thanks a lot for digging into this.

This part seems very interesting to me:

    var future =
        Uni.createFrom()
            .item("item")
            .onItem()
            .delayIt()
            .by(Duration.ofMillis(100))
            .subscribeAsCompletionStage();

    future.whenComplete(
        (result, error) -> {
          System.out.println(future.isDone());
        });

I indeed ran it and it prints false. @jponge is that expected? If it is, I'd like to know why :)

@jponge
Copy link
Member

jponge commented Jun 6, 2024

Funny, I ran:

@Test
void yolo() throws Throwable {

    var future =
            Uni.createFrom()
                    .item("item")
                    .onItem()
                    .delayIt()
                    .by(Duration.ofMillis(100))
                    .subscribeAsCompletionStage();

    future.whenComplete(
            (result, error) -> {
                System.out.println(future.isDone());
            });

    future.get(1, TimeUnit.SECONDS);
}

and got true.

@jponge
Copy link
Member

jponge commented Jun 6, 2024

Also:

    @Test
    void yolo() throws Throwable {
        var future =
                Uni.createFrom()
                        .item("item")
                        .onItem()
                        .delayIt()
                        .by(Duration.ofMillis(100))
                        .subscribeAsCompletionStage();

        AtomicBoolean done = new AtomicBoolean();
        future.whenComplete(
                (result, error) -> {
                    done.set(future.isDone());
                });

        assertThat(future.get(1, TimeUnit.SECONDS)).isEqualTo("item");
        assertThat(done).isTrue();
    }

@jponge
Copy link
Member

jponge commented Jun 6, 2024

I'm not familiar with that code base, but there might be a timing issue causing the future completion state to be looked from code that doesn't assume async semantics, hence it reads false.

@geoand
Copy link
Contributor

geoand commented Jun 7, 2024

I tried it various ways and I am always getting false...

@ben-manes
Copy link

Could it be something to do with your CompletableFutureWrapper? I see that set in Infrastructure.setCompletableFutureWrapper for only some environments, so may explain inconsistent reproducer. I'm unsure how to write this up to check.

@geoand
Copy link
Contributor

geoand commented Jun 7, 2024

That's a very good candidate, and would explain the discrepancy we are seeing:

I am running the tests in a Quarkus project (which includes context-propagation that calls the method you mention) while I assume @jponge is running his test in upstream Mutiny.

@ben-manes
Copy link

Confirmed

  public static void main(String[] args) {
    var future = Uni
        .createFrom()
        .item("item")
        .onItem()
        .delayIt()
        .by(Duration.ofMillis(100))
        .subscribeAsCompletionStage();

    future.whenComplete((result, error) -> {
      System.out.println("Direct: " + future.isDone());
    }).join();

    Infrastructure.setCompletableFutureWrapper(new UnaryOperator<CompletableFuture<?>>() {
      @Override public CompletableFuture<?> apply(CompletableFuture<?> t) {
        var threadContext = SmallRyeThreadContext.getCurrentThreadContextOrDefaultContexts();
        return threadContext.withContextCapture(t);
      }
    });
    var futureW = Uni
        .createFrom()
        .item("item")
        .onItem()
        .delayIt()
        .by(Duration.ofMillis(100))
        .subscribeAsCompletionStage();

    futureW.whenComplete((result, error) -> {
      System.out.println("Wrapped: " + futureW.isDone());
    }).join();
  }
Direct: true
Wrapped: false

@geoand
Copy link
Contributor

geoand commented Jun 7, 2024

Great catch!

Right, I ran the same thing and saw the same results.
Now we need to find why context propagation is causing this issue...

@geoand
Copy link
Contributor

geoand commented Jun 7, 2024

@FroMage will be interested in this

@geoand
Copy link
Contributor

geoand commented Jun 7, 2024

Not that this adds anything, but I'll just mention that there is certainly a timing issue at play as although the delegate CF has been completed, the CompletableFutureWrapper doesn't get completed until later and therefore the call to done returns false

@ben-manes
Copy link

I believe it is because dependent actions are on a stack they run in LIFO order, though most assume FIFO. Therefore the constructor’s whenComplete to set the wrapper’s result is run last. Since client actions are attached directly to the delegate, the wrapper is incomplete. A possible solution is to use the f.whenComplete’s future as your delegate.

@geoand
Copy link
Contributor

geoand commented Jun 7, 2024

I don't know enough about how Context Propagation and Mutiny interact, but given that the completion takes time to propagate, the obvious solution would be for isDone to always call the delegate. I don't know however what side effects that could cause.

@geoand
Copy link
Contributor

geoand commented Jun 7, 2024

Given all the info, I think this one is for @jponge and @FroMage to figure out 😉

@ben-manes
Copy link

I think my solution is safer to maintain the correct ordering. Can you try testing with this modification?

this.f = f.whenComplete(…);

@geoand
Copy link
Contributor

geoand commented Jun 7, 2024

I think my solution is safer to maintain the correct ordering

Absolutely!

@geoand
Copy link
Contributor

geoand commented Jun 7, 2024

Can you try testing with this modification?

Let me try and build all the necessary pieces :)

@geoand
Copy link
Contributor

geoand commented Jun 7, 2024

The tests of Context Propagation fail with that change, but that might not be the fault of the change itself. Anyway, I'll build with those tests skipped

@ben-manes
Copy link

ben-manes commented Jun 7, 2024

fwiw, the equals contract on this wrapper is concerning since it may not be symmetric. A future use identity equality so I think it’s broken now and with my suggestion. I think deleting the equals/hashCode might also be advisable. Just eye balling so might be wrong

@geoand
Copy link
Contributor

geoand commented Jun 7, 2024

I tried your suggestion and it does work on the basic test we have been discussing.

@geoand
Copy link
Contributor

geoand commented Jun 7, 2024

I think @FroMage has a lot of information here he can use to fix this in Context Propagation :)

@jponge
Copy link
Member

jponge commented Jun 7, 2024

I'm indeed running upstream Mutiny, thanks for finding the issue in context propagation.

@geoand
Copy link
Contributor

geoand commented Jun 7, 2024

The tests of Context Propagation fail with that change, but that might not be the fault of the change itself. Anyway, I'll build with those tests skipped

Looking at it a little more, those tests should be easily fixable but doing some additional exception unwrapping

@FroMage
Copy link
Member

FroMage commented Jun 7, 2024

I dunno, it sounds like you are on top of it, do you really want me to take over? :)

@geoand
Copy link
Contributor

geoand commented Jun 7, 2024

I think that would be best since we could have missed unintended side effects in CP

@FroMage
Copy link
Member

FroMage commented Jun 7, 2024

OK

@FroMage
Copy link
Member

FroMage commented Jun 7, 2024

OK, please review smallrye/smallrye-context-propagation#445

Let's be honest, I don't remember the difference between wrappers and dependent stages, so I've no idea why fixing this breaks the TCK. This fix feels logical, in that it's conditional based on wrapper/dependent, though. Appears to fix the issue at hand (if my test is correct, and it was failing indeed before I added the fix) and the TCK keeps running.

Please review and let me know WYT and then I can make a release.

@FroMage
Copy link
Member

FroMage commented Jun 7, 2024

So I made a 2.1.1 release, and I did find a test not involving mutiny that exhibited this behaviour.

But I can't reproduce this test (#40852 (comment)) in integration-tests/smallrye-context-propagation so I'm a bit annoyed, it works with 2.1.0 and 2.1.1.

I'll make a PR anyway.

FroMage added a commit to FroMage/quarkus that referenced this issue Jun 7, 2024
@jolsum
Copy link
Author

jolsum commented Jun 9, 2024

I can confirm that the fix in Context Propagation version 2.1.1 resolves this issue. Thanks, @FroMage, @geoand, @jponge and @ben-manes!

FroMage added a commit to FroMage/quarkus that referenced this issue Jun 11, 2024
@quarkus-bot quarkus-bot bot added this to the 3.12 - main milestone Jun 11, 2024
@gsmet gsmet modified the milestones: 3.12 - main, 3.11.2 Jun 11, 2024
gsmet pushed a commit to gsmet/quarkus that referenced this issue Jun 11, 2024
Fixes quarkusio#40852

(cherry picked from commit 0dc312b)
holly-cummins pushed a commit to holly-cummins/quarkus that referenced this issue Jul 31, 2024
@gsmet gsmet modified the milestones: 3.11.2, 3.8.6 Aug 14, 2024
gsmet pushed a commit to gsmet/quarkus that referenced this issue Aug 14, 2024
Fixes quarkusio#40852

(cherry picked from commit 0dc312b)
@israelstmz
Copy link

Hay, we are running into what seems like a very similar issue in our production. Using 3.14.4. We are using the @CacheResult with a method returning Uni and we see that the cache greatly exceeds the limit. our limit is 10K and it reaches way above 100K. The thing is that out of ~500 application instances it happens only on 2-3... Also we can't reproduce it using the above codes...
@geoand, any idea how can we approach this? it causes a memory leak.

@geoand
Copy link
Contributor

geoand commented Sep 18, 2024

@israelstmz I would suggest opening a new issue adding as much detail as possible.
Without being able to reproduce the problem it will be very hard to find a solution, but it's definitely worth a try regardless.

@israelstmz
Copy link

opened: https://github.com/quarkusio/quarkus/issues/43382

danielsoro pushed a commit to danielsoro/quarkus that referenced this issue Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cache kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants