Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConcurrentCheckpointBLinkTest - Flaky NullPointerException during build #617

Open
aluccaroni opened this issue Jun 17, 2020 · 2 comments
Open
Labels

Comments

@aluccaroni
Copy link
Contributor

BUG REPORT

  1. Please describe the issue you observed:
  • What did you do?
    git clone (on my private fork https://github.com/aluccaroni/herddb)+ mvn clean install

  • What did you expect to see?
    Build without errors and/or errors related to lybrary changes (I'm trying to update Guava)

  • What did you see instead?
    The build is flaky, mainly on HugeTableRestoreTest.run (this happens a lot). The issue is related to a different error from usual: only 1 time I've got a NullPointerException in ConcurrentCheckpointBLinkTest:

Running herddb.index.blink.ConcurrentCheckpointBLinkTest
[TID  18] Deleted          6820 in    1000 ms (1000 ms)
[TID  22] Read           167124 in    1000 ms (1000 ms)
[TID  16] Inserted        13854 in    1000 ms (1000 ms)
[TID  17] Deleted          5546 in    1006 ms (1006 ms)
[TID  15] Checkpointed     2502 in    1011 ms (1011 ms)
[TID  19] Read           150973 in    1006 ms (1006 ms)
[TID  21] Read           159255 in    1005 ms (1005 ms)
[TID  20] Read           175858 in    1005 ms (1005 ms)
java.lang.NullPointerException
        at herddb.index.blink.BLink.attemptUnload(BLink.java:487)
        at herddb.index.blink.BLink.access$900(BLink.java:83)
        at herddb.index.blink.BLink$Node.loadAndLock(BLink.java:2650)
        at herddb.index.blink.BLink$Node.check_key(BLink.java:2209)
        at herddb.index.blink.BLink.search(BLink.java:598)
        at herddb.index.blink.ConcurrentCheckpointBLinkTest$Reader.call(Concurre                                                                                                                                    ntCheckpointBLinkTest.java:409)
        at herddb.index.blink.ConcurrentCheckpointBLinkTest$Reader.call(Concurre                                                                                                                                    ntCheckpointBLinkTest.java:347)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoo                                                                                                                                    lExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPo                                                                                                                                    olExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)

Please note that the server on which I'm building is not under load/stess.
It's a RHEL6 (Linux 2.6.32-754.27.1.el6.centos.plus.x86_64 #1 SMP Thu Jan 30 13:54:25 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux)

@diegosalvi
Copy link
Contributor

diegosalvi commented Jun 17, 2020

On first check it seems a the attempt to unload a page for a node unknown to Blink :O

@diegosalvi
Copy link
Contributor

diegosalvi commented Jun 17, 2020

Checking the code I see a possible pattern to achieve such NPE:

  1. a load operation for node X trigger unload of node N due to page replacement policy
  2. a concurrent close unload N and then attempt to remove it from page replacement policy

I can't see any other possibilities due to page replacement policy: nodes ad unloaded normally only AFTER page replacement policy remove the node from his memory (during add operation due to a policy too full or by remove request). The only piece of code that operate first unloading and then removing from page replacement policy is the close statement (that should be fixed).

The index was closing when the exception arised?

The other possibility is that a page replacement policy stores a node reference twice or doesn't really unload the reference and return it more times

@aluccaroni aluccaroni added the bug label Jun 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants