-
Notifications
You must be signed in to change notification settings - Fork 439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_import_from_pageserver_multisegment consistently failed #2255
Comments
My theory: when the pageserver receives SIGTERM, it starts the shutdown sequence, but it doesn't immediately kill GC and/or compaction. They continue to run, and after a large import like in this test, they can take a long time to finish. The timeout on shutdown is 60 s in our tests; if the pageserver doesn't exit in 60s when it receives SIGTERM, the test fails. The error message with |
@bojanserafimov can you take a look at this, after the daemonize issue, please? PR #2261 will probably at least change the error message from this test, as it changes the way we wait for the pageserver shutdown. |
This it not a theory and might be triggered very simply, as I've mentioned in the RFC. I'm not sure if it's really the case here though, but libpq neon/pageserver/src/layered_repository.rs Lines 83 to 93 in e593cba
that's used as a semaphore to wait for regular, spawned gc and compaction tasks to stop: neon/pageserver/src/tenant_mgr.rs Lines 363 to 365 in 84d1bc0
|
* github/workflows: Fix git dubious ownership (#2223) * Move relation size cache from WalIngest to DatadirTimeline (#2094) * Move relation sie cache to layered timeline * Fix obtaining current LSN for relation size cache * Resolve merge conflicts * Resolve merge conflicts * Reestore 'lsn' field in DatadirModification * adjust DatadirModification lsn in ingest_record * Fix formatting * Pass lsn to get_relsize * Fix merge conflict * Update pageserver/src/pgdatadir_mapping.rs Co-authored-by: Heikki Linnakangas <heikki@zenith.tech> * Update pageserver/src/pgdatadir_mapping.rs Co-authored-by: Heikki Linnakangas <heikki@zenith.tech> Co-authored-by: Heikki Linnakangas <heikki@zenith.tech> * refactor: replace lazy-static with once-cell (#2195) - Replacing all the occurrences of lazy-static with `once-cell::sync::Lazy` - fixes #1147 Signed-off-by: Ankur Srivastava <best.ankur@gmail.com> * Add more buckets to pageserver latency metrics (#2225) * ignore record property warning to fix benchmarks * increase statement timeout * use event so it fires only if workload thread successfully finished * remove debug log * increase timeout to pass test with real s3 * avoid duplicate parameter, increase timeout * Major migration script (#2073) This script can be used to migrate a tenant across breaking storage versions, or (in the future) upgrading postgres versions. See the comment at the top for an overview. Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech> * Fix etcd typos * Fix links to safekeeper protocol docs. (#2188) safekeeper/README_PROTO.md was moved to docs/safekeeper-protocol.md in commit 0b14fdb, as part of reorganizing the docs into 'mdbook' format. Fixes issue #1475. Thanks to @banks for spotting the outdated references. In addition to fixing the above issue, this patch also fixes other broken links as a result of 0b14fdb. See #2188 (review). Co-authored-by: Heikki Linnakangas <heikki@neon.tech> Co-authored-by: Thang Pham <thang@neon.tech> * Update CONTRIBUTING.md * Update CONTRIBUTING.md * support node id and remote storage params in docker_entrypoint.sh * Safe truncate (#2218) * Move relation sie cache to layered timeline * Fix obtaining current LSN for relation size cache * Resolve merge conflicts * Resolve merge conflicts * Reestore 'lsn' field in DatadirModification * adjust DatadirModification lsn in ingest_record * Fix formatting * Pass lsn to get_relsize * Fix merge conflict * Update pageserver/src/pgdatadir_mapping.rs Co-authored-by: Heikki Linnakangas <heikki@zenith.tech> * Update pageserver/src/pgdatadir_mapping.rs Co-authored-by: Heikki Linnakangas <heikki@zenith.tech> * Check if relation exists before trying to truncat it refer #1932 * Add test reporducing FSM truncate problem Co-authored-by: Heikki Linnakangas <heikki@zenith.tech> * Fix exponential backoff values * Update back `vendor/postgres` back; it was changed accidentally. (#2251) Commit 4227cfc accidentally reverted vendor/postgres to an older version. Update it back. * Add pageserver checkpoint_timeout option. To flush inmemory layer eventually when no new data arrives, which helps safekeepers to suspend activity (stop pushing to the broker). Default 10m should be ok. * Share exponential backoff code and fix logic for delete task failure (#2252) * Fix bug when import large (>1GB) relations (#2172) Resolves #2097 - use timeline modification's `lsn` and timeline's `last_record_lsn` to determine the corresponding LSN to query data in `DatadirModification::get` - update `test_import_from_pageserver`. Split the test into 2 variants: `small` and `multisegment`. + `small` is the old test + `multisegment` is to simulate #2097 by using a larger number of inserted rows to create multiple segment files of a relation. `multisegment` is configured to only run with a `release` build * Fix timeline physical size flaky tests (#2244) Resolves #2212. - use `wait_for_last_flush_lsn` in `test_timeline_physical_size_*` tests ## Context Need to wait for the pageserver to catch up with the compute's last flush LSN because during the timeline physical size API call, it's possible that there are running `LayerFlushThread` threads. These threads flush new layers into disk and hence update the physical size. This results in a mismatch between the physical size reported by the API and the actual physical size on disk. ### Note The `LayerFlushThread` threads are processed **concurrently**, so it's possible that the above error still persists even with this patch. However, making the tests wait to finish processing all the WALs (not flushing) before calculating the physical size should help reduce the "flakiness" significantly * postgres_ffi/waldecoder: validate more header fields * postgres_ffi/waldecoder: remove unused startlsn * postgres_ffi/waldecoder: introduce explicit `enum State` Previously it was emulated with a combination of nullable fields. This change should make the logic more readable. * disable `test_import_from_pageserver_multisegment` (#2258) This test failed consistently on `main` now. It's better to temporarily disable it to avoid blocking others' PRs while investigating the root cause for the test failure. See: #2255, #2256 * get_binaries uses DOCKER_TAG taken from docker image build step (#2260) * [proxy] Rework wire format of the password hack and some errors (#2236) The new format has a few benefits: it's shorter, simpler and human-readable as well. We don't use base64 anymore, since url encoding got us covered. We also show a better error in case we couldn't parse the payload; the users should know it's all about passing the correct project name. * test_runner/pg_clients: collect docker logs (#2259) * get_binaries script fix (#2263) * get_binaries uses DOCKER_TAG taken from docker image build step * remove docker tag discovery at all and fix get_binaries for version variable * Better storage sync logs (#2268) * Find end of WAL on safekeepers using WalStreamDecoder. We could make it inside wal_storage.rs, but taking into account that - wal_storage.rs reading is async - we don't need s3 here - error handling is different; error during decoding is normal I decided to put it separately. Test cargo test test_find_end_of_wal_last_crossing_segment prepared earlier by @yeputons passes now. Fixes #544 neondatabase/cloud#2004 Supersedes #2066 * Improve walreceiver logic (#2253) This patch makes walreceiver logic more complicated, but it should work better in most cases. Added `test_wal_lagging` to test scenarios where alive safekeepers can lag behind other alive safekeepers. - There was a bug which looks like `etcd_info.timeline.commit_lsn > Some(self.local_timeline.get_last_record_lsn())` filtered all safekeepers in some strange cases. I removed this filter, it should probably help with #2237 - Now walreceiver_connection reports status, including commit_lsn. This allows keeping safekeeper connection even when etcd is down. - Safekeeper connection now fails if pageserver doesn't receive safekeeper messages for some time. Usually safekeeper sends messages at least once per second. - `LaggingWal` check now uses `commit_lsn` directly from safekeeper. This fixes the issue with often reconnects, when compute generates WAL really fast. - `NoWalTimeout` is rewritten to trigger only when we know about the new WAL and the connected safekeeper doesn't stream any WAL. This allows setting a small `lagging_wal_timeout` because it will trigger only when we observe that the connected safekeeper has stuck. * increase timeout in wait_for_upload to avoid spurious failures when testing with real s3 * Bump vendor/postgres to include XLP_FIRST_IS_CONTRECORD fix. (#2274) * Set up a workflow to run pgbench against captest (#2077) Signed-off-by: Ankur Srivastava <best.ankur@gmail.com> Co-authored-by: Alexander Bayandin <alexander@neon.tech> Co-authored-by: Konstantin Knizhnik <knizhnik@garret.ru> Co-authored-by: Heikki Linnakangas <heikki@zenith.tech> Co-authored-by: Ankur Srivastava <ansrivas@users.noreply.github.com> Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com> Co-authored-by: Dmitry Rodionov <dmitry@neon.tech> Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech> Co-authored-by: Kirill Bulatov <kirill@neon.tech> Co-authored-by: Heikki Linnakangas <heikki@neon.tech> Co-authored-by: Thang Pham <thang@neon.tech> Co-authored-by: Stas Kelvich <stas.kelvich@gmail.com> Co-authored-by: Arseny Sher <sher-ars@yandex.ru> Co-authored-by: Egor Suvorov <egor@neon.tech> Co-authored-by: Andrey Taranik <andrey@cicd.team> Co-authored-by: Dmitry Ivanov <ivadmi5@gmail.com>
test_import_from_pageserver_multisegment
was added in #2172. The test passed in the PR, but after merging intomain
, it failed consistently with the error:Example failed run: https://github.com/neondatabase/neon/runs/7800417580?check_suite_focus=true
Update:
test_import_from_pageserver_multisegment
is disabled in #2258. One of the requirements for this PR is to investigate the failure cause and enable the test back.The text was updated successfully, but these errors were encountered: