test_import_from_pageserver_multisegment consistently failed #2255

aome510 · 2022-08-12T05:53:17Z

test_import_from_pageserver_multisegment was added in #2172. The test passed in the PR, but after merging into main, it failed consistently with the error:

2022-08-12T04:16:02.4390590Z E   Exception:             Run ['/tmp/neon/bin/neon_local', 'pageserver', 'stop'] failed:
2022-08-12T04:16:02.4391127Z E                 stdout: Stopping pageserver gracefully..............................................................
2022-08-12T04:16:02.4391556Z E                 stderr: 
2022-08-12T04:16:02.4392003Z E   Pageserver connection failed with error: Cannot assign requested address (os error 99)
2022-08-12T04:16:02.4392544Z E   pageserver stop failed: Failed to stop pageserver with pid 6030

Example failed run: https://github.com/neondatabase/neon/runs/7800417580?check_suite_focus=true

Update: test_import_from_pageserver_multisegment is disabled in #2258. One of the requirements for this PR is to investigate the failure cause and enable the test back.

The text was updated successfully, but these errors were encountered:

This test failed consistently on `main` now. It's better to temporarily disable it to avoid blocking others' PRs while investigating the root cause for the test failure. See: #2255, #2256

hlinnaka · 2022-08-12T18:50:24Z

My theory: when the pageserver receives SIGTERM, it starts the shutdown sequence, but it doesn't immediately kill GC and/or compaction. They continue to run, and after a large import like in this test, they can take a long time to finish. The timeout on shutdown is 60 s in our tests; if the pageserver doesn't exit in 60s when it receives SIGTERM, the test fails.

The error message with Cannot assign requested address (os error 99) is weird though. I'm not sure why that happens.

hlinnaka · 2022-08-12T18:52:04Z

@bojanserafimov can you take a look at this, after the daemonize issue, please? PR #2261 will probably at least change the error message from this test, as it changes the way we wait for the pageserver shutdown.

SomeoneToIgnore · 2022-08-12T19:59:17Z

My theory: when the pageserver receives SIGTERM, it starts the shutdown sequence, but it doesn't immediately kill GC and/or compaction.

This it not a theory and might be triggered very simply, as I've mentioned in the RFC.

I'm not sure if it's really the case here though, but libpq do_gc and checkpoint (that calls forced compaction) calls don't hold file_lock

neon/pageserver/src/layered_repository.rs

Lines 83 to 93 in e593cba

    
           // Allows us to gracefully cancel operations that edit the directory 
        
           // that backs this layered repository. Usage: 
        
           // 
        
           // Use `let _guard = file_lock.try_read()` while writing any files. 
        
           // Use `let _guard = file_lock.write().unwrap()` to wait for all writes to finish. 
        
           // 
        
           // TODO try_read this lock during checkpoint as well to prevent race 
        
           //      between checkpoint and detach/delete. 
        
           // TODO try_read this lock for all gc/compaction operations, not just 
        
           //      ones scheduled by the tenant task manager. 
        
           pub file_lock: RwLock<()>,

that's used as a semaphore to wait for regular, spawned gc and compaction tasks to stop:

neon/pageserver/src/tenant_mgr.rs

Lines 363 to 365 in 84d1bc0

    
           // Wait until all gc/compaction tasks finish 
        
           let repo = get_repository_for_tenant(tenant_id)?; 
        
           let _guard = repo.file_lock.write().unwrap();

@banks

* github/workflows: Fix git dubious ownership (#2223) * Move relation size cache from WalIngest to DatadirTimeline (#2094) * Move relation sie cache to layered timeline * Fix obtaining current LSN for relation size cache * Resolve merge conflicts * Resolve merge conflicts * Reestore 'lsn' field in DatadirModification * adjust DatadirModification lsn in ingest_record * Fix formatting * Pass lsn to get_relsize * Fix merge conflict * Update pageserver/src/pgdatadir_mapping.rs Co-authored-by: Heikki Linnakangas <heikki@zenith.tech> * Update pageserver/src/pgdatadir_mapping.rs Co-authored-by: Heikki Linnakangas <heikki@zenith.tech> Co-authored-by: Heikki Linnakangas <heikki@zenith.tech> * refactor: replace lazy-static with once-cell (#2195) - Replacing all the occurrences of lazy-static with `once-cell::sync::Lazy` - fixes #1147 Signed-off-by: Ankur Srivastava <best.ankur@gmail.com> * Add more buckets to pageserver latency metrics (#2225) * ignore record property warning to fix benchmarks * increase statement timeout * use event so it fires only if workload thread successfully finished * remove debug log * increase timeout to pass test with real s3 * avoid duplicate parameter, increase timeout * Major migration script (#2073) This script can be used to migrate a tenant across breaking storage versions, or (in the future) upgrading postgres versions. See the comment at the top for an overview. Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech> * Fix etcd typos * Fix links to safekeeper protocol docs. (#2188) safekeeper/README_PROTO.md was moved to docs/safekeeper-protocol.md in commit 0b14fdb, as part of reorganizing the docs into 'mdbook' format. Fixes issue #1475. Thanks to @banks for spotting the outdated references. In addition to fixing the above issue, this patch also fixes other broken links as a result of 0b14fdb. See #2188 (review). Co-authored-by: Heikki Linnakangas <heikki@neon.tech> Co-authored-by: Thang Pham <thang@neon.tech> * Update CONTRIBUTING.md * Update CONTRIBUTING.md * support node id and remote storage params in docker_entrypoint.sh * Safe truncate (#2218) * Move relation sie cache to layered timeline * Fix obtaining current LSN for relation size cache * Resolve merge conflicts * Resolve merge conflicts * Reestore 'lsn' field in DatadirModification * adjust DatadirModification lsn in ingest_record * Fix formatting * Pass lsn to get_relsize * Fix merge conflict * Update pageserver/src/pgdatadir_mapping.rs Co-authored-by: Heikki Linnakangas <heikki@zenith.tech> * Update pageserver/src/pgdatadir_mapping.rs Co-authored-by: Heikki Linnakangas <heikki@zenith.tech> * Check if relation exists before trying to truncat it refer #1932 * Add test reporducing FSM truncate problem Co-authored-by: Heikki Linnakangas <heikki@zenith.tech> * Fix exponential backoff values * Update back `vendor/postgres` back; it was changed accidentally. (#2251) Commit 4227cfc accidentally reverted vendor/postgres to an older version. Update it back. * Add pageserver checkpoint_timeout option. To flush inmemory layer eventually when no new data arrives, which helps safekeepers to suspend activity (stop pushing to the broker). Default 10m should be ok. * Share exponential backoff code and fix logic for delete task failure (#2252) * Fix bug when import large (>1GB) relations (#2172) Resolves #2097 - use timeline modification's `lsn` and timeline's `last_record_lsn` to determine the corresponding LSN to query data in `DatadirModification::get` - update `test_import_from_pageserver`. Split the test into 2 variants: `small` and `multisegment`. + `small` is the old test + `multisegment` is to simulate #2097 by using a larger number of inserted rows to create multiple segment files of a relation. `multisegment` is configured to only run with a `release` build * Fix timeline physical size flaky tests (#2244) Resolves #2212. - use `wait_for_last_flush_lsn` in `test_timeline_physical_size_*` tests ## Context Need to wait for the pageserver to catch up with the compute's last flush LSN because during the timeline physical size API call, it's possible that there are running `LayerFlushThread` threads. These threads flush new layers into disk and hence update the physical size. This results in a mismatch between the physical size reported by the API and the actual physical size on disk. ### Note The `LayerFlushThread` threads are processed **concurrently**, so it's possible that the above error still persists even with this patch. However, making the tests wait to finish processing all the WALs (not flushing) before calculating the physical size should help reduce the "flakiness" significantly * postgres_ffi/waldecoder: validate more header fields * postgres_ffi/waldecoder: remove unused startlsn * postgres_ffi/waldecoder: introduce explicit `enum State` Previously it was emulated with a combination of nullable fields. This change should make the logic more readable. * disable `test_import_from_pageserver_multisegment` (#2258) This test failed consistently on `main` now. It's better to temporarily disable it to avoid blocking others' PRs while investigating the root cause for the test failure. See: #2255, #2256 * get_binaries uses DOCKER_TAG taken from docker image build step (#2260) * [proxy] Rework wire format of the password hack and some errors (#2236) The new format has a few benefits: it's shorter, simpler and human-readable as well. We don't use base64 anymore, since url encoding got us covered. We also show a better error in case we couldn't parse the payload; the users should know it's all about passing the correct project name. * test_runner/pg_clients: collect docker logs (#2259) * get_binaries script fix (#2263) * get_binaries uses DOCKER_TAG taken from docker image build step * remove docker tag discovery at all and fix get_binaries for version variable * Better storage sync logs (#2268) * Find end of WAL on safekeepers using WalStreamDecoder. We could make it inside wal_storage.rs, but taking into account that - wal_storage.rs reading is async - we don't need s3 here - error handling is different; error during decoding is normal I decided to put it separately. Test cargo test test_find_end_of_wal_last_crossing_segment prepared earlier by @yeputons passes now. Fixes #544 neondatabase/cloud#2004 Supersedes #2066 * Improve walreceiver logic (#2253) This patch makes walreceiver logic more complicated, but it should work better in most cases. Added `test_wal_lagging` to test scenarios where alive safekeepers can lag behind other alive safekeepers. - There was a bug which looks like `etcd_info.timeline.commit_lsn > Some(self.local_timeline.get_last_record_lsn())` filtered all safekeepers in some strange cases. I removed this filter, it should probably help with #2237 - Now walreceiver_connection reports status, including commit_lsn. This allows keeping safekeeper connection even when etcd is down. - Safekeeper connection now fails if pageserver doesn't receive safekeeper messages for some time. Usually safekeeper sends messages at least once per second. - `LaggingWal` check now uses `commit_lsn` directly from safekeeper. This fixes the issue with often reconnects, when compute generates WAL really fast. - `NoWalTimeout` is rewritten to trigger only when we know about the new WAL and the connected safekeeper doesn't stream any WAL. This allows setting a small `lagging_wal_timeout` because it will trigger only when we observe that the connected safekeeper has stuck. * increase timeout in wait_for_upload to avoid spurious failures when testing with real s3 * Bump vendor/postgres to include XLP_FIRST_IS_CONTRECORD fix. (#2274) * Set up a workflow to run pgbench against captest (#2077) Signed-off-by: Ankur Srivastava <best.ankur@gmail.com> Co-authored-by: Alexander Bayandin <alexander@neon.tech> Co-authored-by: Konstantin Knizhnik <knizhnik@garret.ru> Co-authored-by: Heikki Linnakangas <heikki@zenith.tech> Co-authored-by: Ankur Srivastava <ansrivas@users.noreply.github.com> Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com> Co-authored-by: Dmitry Rodionov <dmitry@neon.tech> Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech> Co-authored-by: Kirill Bulatov <kirill@neon.tech> Co-authored-by: Heikki Linnakangas <heikki@neon.tech> Co-authored-by: Thang Pham <thang@neon.tech> Co-authored-by: Stas Kelvich <stas.kelvich@gmail.com> Co-authored-by: Arseny Sher <sher-ars@yandex.ru> Co-authored-by: Egor Suvorov <egor@neon.tech> Co-authored-by: Andrey Taranik <andrey@cicd.team> Co-authored-by: Dmitry Ivanov <ivadmi5@gmail.com>

aome510 added the a/test Area: related to testing label Aug 12, 2022

This was referenced Aug 12, 2022

Investigate #2255 #2256

Closed

Disable test_import_from_pageserver_multisegment #2258

Merged

hlinnaka assigned bojanserafimov Aug 12, 2022

bojanserafimov mentioned this issue Aug 17, 2022

Enable import test #2293

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_import_from_pageserver_multisegment consistently failed #2255

test_import_from_pageserver_multisegment consistently failed #2255

aome510 commented Aug 12, 2022 •

edited

Loading

hlinnaka commented Aug 12, 2022

hlinnaka commented Aug 12, 2022

SomeoneToIgnore commented Aug 12, 2022 •

edited

Loading

test_import_from_pageserver_multisegment consistently failed #2255

test_import_from_pageserver_multisegment consistently failed #2255

Comments

aome510 commented Aug 12, 2022 • edited Loading

hlinnaka commented Aug 12, 2022

hlinnaka commented Aug 12, 2022

SomeoneToIgnore commented Aug 12, 2022 • edited Loading

aome510 commented Aug 12, 2022 •

edited

Loading

SomeoneToIgnore commented Aug 12, 2022 •

edited

Loading