-
Notifications
You must be signed in to change notification settings - Fork 456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Another attempt at making test_vm_bits less flaky #7989
Conversation
3198 tests run: 3059 passed, 0 failed, 139 skipped (full report)Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
eacb09a at 2024-06-10T12:13:07.261Z :recycle: |
- Split the first and second parts of the test to two separate tests - In the first test, disable the aggressive GC, compaction, and autovacuum. They are only needed by the second test. I'd like to get the first test to a point that the VM page is never all-zeros. Disabling autovacuum in the first test is hopefully enough to accomplish that. - Compare the full page images, don't skip page header. After fixing the previous point, there should be no discrepancy. LSN still won't match, though, because of commit 387a368. Fixes issue #7984
4bdca93
to
cb2a7ec
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your arguments about the source of the flukiness of this test.
And hope that your refactoring of the test will fix it.
But I do no agree with this statement:
LSN still won't match, though, because of commit 387a368
LSN did not much not because of [387a368]
You have excluded the LSN from comparison in
[428d9fe] which was merged long before my PR.
Also I am not sure that reverting my commit [b06eec4] (ignoring page header in comparison) will not resurrect the problem which this commit is going to address: when zero page is compared with empty page.
We have suspicious code in pagestore_smgr which treats empty page as zero page:
else if (PageIsEmptyHeapPage((Page) buffer))
{
ereport(SmgrTrace,
(errmsg(NEON_TAG "Page %u of relation %u/%u/%u.%u is an empty heap page with no LSN",
blocknum,
RelFileInfoFmt(InfoFromSMgrRel(reln)),
forknum)));
}
I did not remember why it was necessary. May be we should comment it and check what is broken? Yes, in case of VM this branch should not normally be reachable because we force was-logging VM pages. But only when shutdown is not requested. If VM pages is flushed during shutdown, can it also happen (replacing empty page with zero page) for VM page?
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Yes, you're right. I thought
With
Yeah, we should also WAL-log VM pages when shutdown has been requested. And probably FSM pages too, although that's not required for correctness. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The flakiness is still there, no?
diff --git a/test_runner/regress/test_vm_bits.py b/test_runner/regress/test_vm_bits.py
index 9783cc442..e8c05e064 100644
--- a/test_runner/regress/test_vm_bits.py
+++ b/test_runner/regress/test_vm_bits.py
@@ -1,4 +1,5 @@
import time
+import pytest
from fixtures.log_helper import log
from fixtures.neon_fixtures import NeonEnv, NeonEnvBuilder, fork_at_current_lsn
@@ -8,6 +9,8 @@ from fixtures.neon_fixtures import NeonEnv, NeonEnvBuilder, fork_at_current_lsn
# Test that the VM bit is cleared correctly at a HEAP_DELETE and
# HEAP_UPDATE record.
#
+
+@pytest.mark.skip(reason="not flaky")
def test_vm_bit_clear(neon_simple_env: NeonEnv):
env = neon_simple_env
@@ -119,6 +122,7 @@ def test_vm_bit_clear(neon_simple_env: NeonEnv):
#
# This is a repro for the bug fixed in commit 66fa176cc8.
#
+@pytest.mark.repeat(1000)
def test_vm_bit_clear_on_heap_lock_whitebox(neon_env_builder: NeonEnvBuilder):
env = neon_env_builder.init_start()
@@ -172,6 +176,7 @@ def test_vm_bit_clear_on_heap_lock_whitebox(neon_env_builder: NeonEnvBuilder):
assert vm_page_at_pageserver == vm_page_in_cache
+@pytest.mark.repeat(1000)
def test_vm_bit_clear_on_heap_lock_blackbox(neon_env_builder: NeonEnvBuilder):
"""
The previous test is enough to verify the bug that was fixed in
Run 4 tests concurrently with
BUILD_TYPE=DEBUG DEFAULT_PG_VERSION=16 ./scripts/pytest -n4 ./test_runner/regress/test_vm_bits.py
Failed after ca 300 runs with
> assert vm_page_at_pageserver == vm_page_in_cache
E AssertionError: assert '000000000000...0000000000000' == '000000001800...0000000000000'
E - 0000000018000020002004200000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
E ? -----------------------
E + 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
E ? +++++++++++++++++++++++
test_runner/regress/test_vm_bits.py:176: AssertionError
Hmm, ok, thanks for the tip on how to repeat it. So who's holding back the xmin horizon? There shouldn't be any other concurrent transactions running. Turns out there is: the activity monitor that's part of compute_ctl periodically runs queries to get the status of replication and running queries. If that runs concurrently with the
To work around that, I made the test to run in a separate database. That way, the background activity in the 'postgres' database doesn't hold back the xmin horizon in the test database. With that, I've run about 600 iterations on my laptop now, with no failures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can confirm that with your changes the flakiness is gone 🥳
- Split the first and second parts of the test to two separate tests - In the first test, disable the aggressive GC, compaction, and autovacuum. They are only needed by the second test. I'd like to get the first test to a point that the VM page is never all-zeros. Disabling autovacuum in the first test is hopefully enough to accomplish that. - Compare the full page images, don't skip page header. After fixing the previous point, there should be no discrepancy. LSN still won't match, though, because of commit 387a368. Fixes issue #7984
Split the first and second parts of the test to two separate tests
In the first test, disable the aggressive GC, compaction, and autovacuum. They are only needed by the second test. I'd like to get the first test to a point that the VM page is never all-zeros. Disabling autovacuum in the first test is hopefully enough to accomplish that.
Compare the full page images, don't skip page header. After fixing the previous point, there should be no discrepancy. LSN still won't match, though, because of commit 387a368.
Fixes issue #7984