Remove `init` from pageserver CLI #2272

SomeoneToIgnore · 2022-08-15T10:32:30Z

Removes pageserver init CLI command.

pageserver.toml config file is now only created or updated when pageserver start command gets --update-config flag.
id is still a protected field that is not allowed to be updated in the existing config file
initial tenant and timeline creation is now only done in neon_local init, for that entire pageserver is started to serve two HTTP creation commands and then stopped.

Cloud does not use initial timelines at all, Python tests continue using the same "initial timeline" implicitly brought by the neon_local init invocation

hlinnaka

Ah, that's a patch to my liking 😄 :

 9 files changed, 218 insertions(+), 340 deletions(-)

Does starting the pageserver binary now implicitly create the target directory if it doesn't already exist? If so, that seems a bit fragile, I think I'd prefer to get an error, if for example the repository directory doesn't exist because you forgot to mount the filesystem it's in.

control_plane/src/local_env.rs

control_plane/src/storage.rs

hlinnaka · 2022-08-15T13:14:59Z

control_plane/src/storage.rs

+        datadir: &Path,
+        update_config: bool,
+    ) -> anyhow::Result<()> {
+        println!(


Note that the old code used print!() rather than println!(), so this will print an extra empty line

Yes, yet I've found that more natural when dealing with failing pageserver:

with empty line:

Work/neon/neon kb/pageserver-no-init:origin/kb/pageserver-no-init* ⇡❯ rm -rf .neon/ && ./target/debug/neon_local init Starting pageserver at '127.0.0.1:64000' in '.neon' thread 'main' panicked at '((((((((((((((((', pageserver/src/bin/pageserver.rs:41:9 stack backtrace: 0: rust_begin_unwind at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/std/src/panicking.rs:584:5 1: core::panicking::panic_fmt at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/panicking.rs:142:14 2: pageserver::main at ./pageserver/src/bin/pageserver.rs:41:9 3: core::ops::function::FnOnce::call_once at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/ops/function.rs:248:5 note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace. pageserver init failed: Pageserver failed to start. See console output and '.neon/pageserver.log' for details.

without:

Work/neon/neon kb/pageserver-no-init:origin/kb/pageserver-no-init* ⇡❯ rm -rf .neon/ && ./target/debug/neon_local init Starting pageserver at '127.0.0.1:64000' in '.neon'thread 'main' panicked at '((((((((((((((((', pageserver/src/bin/pageserver.rs:41:9 stack backtrace: 0: rust_begin_unwind at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/std/src/panicking.rs:584:5 1: core::panicking::panic_fmt at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/panicking.rs:142:14 2: pageserver::main at ./pageserver/src/bin/pageserver.rs:41:9 3: core::ops::function::FnOnce::call_once at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/ops/function.rs:248:5 note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace. pageserver init failed: Pageserver failed to start. See console output and '.neon/pageserver.log' for details.

control_plane/src/storage.rs

hlinnaka · 2022-08-15T13:17:35Z

control_plane/src/storage.rs

-            .http_request(
-                Method::POST,
-                format!("{}/tenant/{}/timeline", self.http_base_url, tenant_id),
+    ) -> anyhow::Result<TimelineInfo> {


Why is this change needed?

Option<TimelineInfo> is None according to our logic only if we fail to create a timeline:

neon/pageserver/src/http/routes.rs

Line 244 in e593cba

None => json_response(StatusCode::CONFLICT, ())?,

So instead of doing various .unwrap() or other error handling above the client, I've simplified it to validating the presence here.

SomeoneToIgnore · 2022-08-15T14:36:05Z

Does starting the pageserver binary now implicitly create the target directory if it doesn't already exist?

The tenants/ one? Yes, now we do crate it implicitly. Do you think that base path should be mounted instead anyway? We'll fail if that's not done, so either I don't understand it or it should be ok.

We could somehow alter --update-config to something more generic and allow dir creation along with that operation too, yet it slightly becomes to resemble --init flag now 🙂

SomeoneToIgnore · 2022-08-15T20:28:56Z

A PR to fix e2e tests: https://github.com/neondatabase/cloud/pull/2058
Other tests seem to flack for known, daemonize-related issues.

hlinnaka · 2022-08-16T08:42:23Z

Does starting the pageserver binary now implicitly create the target directory if it doesn't already exist?

The tenants/ one? Yes, now we do crate it implicitly. Do you think that base path should be mounted instead anyway? We'll fail if that's not done, so either I don't understand it or it should be ok.

I meant the directory that contains tenants/. The data directory looks like this:

<path>/tenants/
<path>/pageserver.log
<path>/pageserver.pid
<path>/pageserver.toml

The log file is created on startup. The pid-file is also created. The config file isn't mandatory I guess. But the question is, if <path> itself doesn't exist, do we create it automatically, or error out? And if <path>/tenants doesn't already exist, do we create it or error out?

Imagine that pageserver runs on a host with two filesystems. The root filesystem, and 'data' that holds all the pageserver data. At server startup, we do this:

mount /data
pageserver --workdir /data/pageserver-data

What happens if the mount /data command fails? Does pageserver still start up and automatically create a new empty data directory as /data/pageserver-data/, now on the root filesystem? Or does it refuse to start? I think refusing to start is better.

We could somehow alter --update-config to something more generic and allow dir creation along with that operation too, yet it slightly becomes to resemble --init flag now slightly_smiling_face

Yeah, I think an explicit init option still makes sense. But without creating the initial tenant and timeline.

- Merge with PR #2279 - add some quick workarounds in some tests that will go away with PR #2272 There's a lot of FIXMEs and TODOs and cleanup needed. But we can discuss the design.

SomeoneToIgnore · 2022-08-16T16:46:04Z

I meant the directory that contains tenants/. The data directory looks like this:

Now I understand.
This is a workdir, the one that's being checked for existence implicitly, with .canonicalize:

neon/pageserver/src/bin/pageserver.rs

Lines 107 to 111 in d29c545

    
           let workdir = Path::new(arg_matches.value_of("workdir").unwrap_or(".neon")); 
        
           let workdir = workdir 
        
               .canonicalize() 
        
               .with_context(|| format!("Error opening workdir '{}'", workdir.display()))?; 
        
           let cfg_file_path = workdir.join("pageserver.toml");

If we try to canonicalize or do env::set_current_dir (we do that later below), we fail for non-existent path.
So we don't create the workdir, rather implicitly failing.

This means I can avoid changing the code to be --init, after all.

LizardWizzard

Looks good to me

I’m for avoiding init and bailing when passed directory doesn’t exist. Though I’m not sure that this is followed by safekeepers, need to check

UPD: safekeeper will start even without passed datadir argument. It will fallback to current working directory. It is the simplest behavior, but if we want to guard against such config errors we better use the same behavior everywhere. Thoughts?

control_plane/src/storage.rs

* Check for entire range during sasl validation (#2281) * Gen2 GH runner (#2128) * Re-add rustup override * Try s3 bucket * Set git version * Use v4 cache key to prevent problems * Switch to v5 for key * Add second rustup fix * Rebase * Add kaniko steps * Fix typo and set compress level * Disable global run default * Specify shell for step * Change approach with kaniko * Try less verbose shell spec * Add submodule pull * Add promote step * Adjust dependency chain * Try default swap again * Use env * Don't override aws key * Make kaniko build conditional * Specify runs on * Try without dependency link * Try soft fail * Use image with git * Try passing to next step * Fix duplicate * Try other approach * Try other approach * Fix typo * Try other syntax * Set env * Adjust setup * Try step 1 * Add link * Try global env * Fix mistake * Debug * Try other syntax * Try other approach * Change order * Move output one step down * Put output up one level * Try other syntax * Skip build * Try output * Re-enable build * Try other syntax * Skip middle step * Update check * Try first step of dockerhub push * Update needs dependency * Try explicit dir * Add missing package * Try other approach * Try other approach * Specify region * Use with * Try other approach * Add debug * Try other approach * Set region * Follow AWS example * Try github approach * Skip Qemu * Try stdin * Missing steps * Add missing close * Add echo debug * Try v2 endpoint * Use v1 endpoint * Try without quotes * Revert * Try crane * Add debug * Split steps * Fix duplicate * Add shell step * Conform to options * Add verbose flag * Try single step * Try workaround * First request fails hunch * Try bullseye image * Try other approach * Adjust verbose level * Try previous step * Add more debug * Remove debug step * Remove rogue indent * Try with larger image * Add build tag step * Update workflow for testing * Add tag step for test * Remove unused * Update dependency chain * Add ownership fix * Use matrix for promote * Force update * Force build * Remove unused * Add new image * Add missing argument * Update dockerfile copy * Update Dockerfile * Update clone * Update dockerfile * Go to correct folder * Use correct format * Update dockerfile * Remove cd * Debug find where we are * Add debug on first step * Changedir to postgres * Set workdir * Use v1 approach * Use other dependency * Try other approach * Try other approach * Update dockerfile * Update approach * Update dockerfile * Update approach * Update dockerfile * Update dockerfile * Add workspace hack * Update Dockerfile * Update Dockerfile * Update Dockerfile * Change last step * Cleanup pull in prep for review * Force build images * Add condition for latest tagging * Use pinned version * Try without name value * Remove more names * Shorten names * Add kaniko comments * Pin kaniko * Pin crane and ecr helper * Up one level * Switch to pinned tag for rust image * Force update for test Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box> Co-authored-by: Rory de Zoete <rdezoete@b04468bf-cdf4-41eb-9c94-aff4ca55e4bf.fritz.box> Co-authored-by: Rory de Zoete <rdezoete@Rorys-Mac-Studio.fritz.box> Co-authored-by: Rory de Zoete <rdezoete@4795e9ee-4f32-401f-85f3-f316263b62b8.fritz.box> Co-authored-by: Rory de Zoete <rdezoete@2f8bc4e5-4ec2-4ea2-adb1-65d863c4a558.fritz.box> Co-authored-by: Rory de Zoete <rdezoete@27565b2b-72d5-4742-9898-a26c9033e6f9.fritz.box> Co-authored-by: Rory de Zoete <rdezoete@ecc96c26-c6c4-4664-be6e-34f7c3f89a3c.fritz.box> Co-authored-by: Rory de Zoete <rdezoete@7caff3a5-bf03-4202-bd0e-f1a93c86bdae.fritz.box> * Add missing step output, revert one deploy step (#2285) * Add missing step output, revert one deploy step * Conform to syntax * Update approach * Add missing value * Add missing needs Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box> * Error for fatal not git repo (#2286) Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box> * Use main, not branch for ref check (#2288) * Use main, not branch for ref check * Add more debug * Count main, not head * Try new approach * Conform to syntax * Update approach * Get full history * Skip checkout * Cleanup debug * Remove more debug Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box> * Fix docker zombie process issue (#2289) * Fix docker zombie process issue * Init everywhere Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box> * Fix 1.63 clippy lints (#2282) * split out timeline metrics, track layer map loading and size calculation * reset rust cache for clippy run to avoid an ICE additionally remove trailing whitespaces * Rename pg_control_ffi.h to bindgen_deps.h, for clarity. The pg_control_ffi.h name implies that it only includes stuff related to pg_control.h. That's mostly true currently, but really the point of the file is to include everything that we need to generate Rust definitions from. * Make local mypy behave like CI mypy (#2291) * Fix flaky pageserver restarts in tests (#2261) * Remove extra type aliases (#2280) * Update cachepot endpoint (#2290) * Update cachepot endpoint * Update dockerfile & remove env * Update image building process * Cannot use metadata endpoint for this * Update workflow * Conform to kaniko syntax * Update syntax * Update approach * Update dockerfiles * Force update * Update dockerfiles * Update dockerfile * Cleanup dockerfiles * Update s3 test location * Revert s3 experiment * Add more debug * Specify aws region * Remove debug, add prefix * Remove one more debug Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box> * workflows/benchmarking: increase timeout (#2294) * Rework `init` in pageserver CLI (#2272) * Do not create initial tenant and timeline (adjust Python tests for that) * Rework config handling during init, add --update-config to manage local config updates * Fix: Always build images (#2296) * Always build images * Remove unused Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box> * Move auto-generated 'bindings' to a separate inner module. Re-export only things that are used by other modules. In the future, I'm imagining that we run bindgen twice, for Postgres v14 and v15. The two sets of bindings would go into separate 'bindings_v14' and 'bindings_v15' modules. Rearrange postgres_ffi modules. Move function, to avoid Postgres version dependency in timelines.rs Move function to generate a logical-message WAL record to postgres_ffi. * fix cargo test * Fix walreceiver and safekeeper bugs (#2295) - There was an issue with zero commit_lsn `reason: LaggingWal { current_commit_lsn: 0/0, new_commit_lsn: 1/6FD90D38, threshold: 10485760 } }`. The problem was in `send_wal.rs`, where we initialized `end_pos = Lsn(0)` and in some cases sent it to the pageserver. - IDENTIFY_SYSTEM previously returned `flush_lsn` as a physical end of WAL. Now it returns `flush_lsn` (as it was) to walproposer and `commit_lsn` to everyone else including pageserver. - There was an issue with backoff where connection was cancelled right after initialization: `connected!` -> `safekeeper_handle_db: Connection cancelled` -> `Backoff: waiting 3 seconds`. The problem was in sleeping before establishing the connection. This is fixed by reworking retry logic. - There was an issue with getting `NoKeepAlives` reason in a loop. The issue is probably the same as the previous. - There was an issue with filtering safekeepers based on retry attempts, which could filter some safekeepers indefinetely. This is fixed by using retry cooldown duration instead of retry attempts. - Some `send_wal.rs` connections failed with errors without context. This is fixed by adding a timeline to safekeepers errors. New retry logic works like this: - Every candidate has a `next_retry_at` timestamp and is not considered for connection until that moment - When walreceiver connection is closed, we update `next_retry_at` using exponential backoff, increasing the cooldown on every disconnect. - When `last_record_lsn` was advanced using the WAL from the safekeeper, we reset the retry cooldown and exponential backoff, allowing walreceiver to reconnect to the same safekeeper instantly. * on safekeeper registration pass availability zone param (#2292) Co-authored-by: Kirill Bulatov <kirill@neon.tech> Co-authored-by: Rory de Zoete <33318916+zoete@users.noreply.github.com> Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box> Co-authored-by: Rory de Zoete <rdezoete@b04468bf-cdf4-41eb-9c94-aff4ca55e4bf.fritz.box> Co-authored-by: Rory de Zoete <rdezoete@Rorys-Mac-Studio.fritz.box> Co-authored-by: Rory de Zoete <rdezoete@4795e9ee-4f32-401f-85f3-f316263b62b8.fritz.box> Co-authored-by: Rory de Zoete <rdezoete@2f8bc4e5-4ec2-4ea2-adb1-65d863c4a558.fritz.box> Co-authored-by: Rory de Zoete <rdezoete@27565b2b-72d5-4742-9898-a26c9033e6f9.fritz.box> Co-authored-by: Rory de Zoete <rdezoete@ecc96c26-c6c4-4664-be6e-34f7c3f89a3c.fritz.box> Co-authored-by: Rory de Zoete <rdezoete@7caff3a5-bf03-4202-bd0e-f1a93c86bdae.fritz.box> Co-authored-by: Dmitry Rodionov <dmitry@neon.tech> Co-authored-by: Heikki Linnakangas <heikki@neon.tech> Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com> Co-authored-by: Alexander Bayandin <alexander@neon.tech> Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech> Co-authored-by: Anton Galitsyn <agalitsyn@users.noreply.github.com>

SomeoneToIgnore force-pushed the kb/pageserver-no-init branch 2 times, most recently from 546c374 to 3c04ef5 Compare August 15, 2022 11:05

SomeoneToIgnore changed the title ~~WIP: remove init from pageserver~~ Remove init from pageserver CLI Aug 15, 2022

SomeoneToIgnore requested review from LizardWizzard and hlinnaka August 15, 2022 11:24

hlinnaka reviewed Aug 15, 2022

View reviewed changes

SomeoneToIgnore marked this pull request as ready for review August 15, 2022 14:33

SomeoneToIgnore requested a review from hlinnaka August 15, 2022 14:33

SomeoneToIgnore force-pushed the kb/pageserver-no-init branch 2 times, most recently from 25af7ab to beb5e40 Compare August 15, 2022 18:36

SomeoneToIgnore force-pushed the kb/pageserver-no-init branch from beb5e40 to 8b3d8b7 Compare August 15, 2022 20:29

hlinnaka added a commit that referenced this pull request Aug 16, 2022

WIP: on-demand download

775901b

- Merge with PR #2279 - add some quick workarounds in some tests that will go away with PR #2272 There's a lot of FIXMEs and TODOs and cleanup needed. But we can discuss the design.

LizardWizzard approved these changes Aug 17, 2022

View reviewed changes

ololobus reviewed Aug 17, 2022

View reviewed changes

control_plane/src/storage.rs Show resolved Hide resolved

Remove pageserver init and adjust its config updates

6895a6b

SomeoneToIgnore force-pushed the kb/pageserver-no-init branch from 6978b7e to 6895a6b Compare August 17, 2022 18:12

Add pageserver config init argument to be compatible with deployment

c5dc00f

SomeoneToIgnore force-pushed the kb/pageserver-no-init branch from 199d0d2 to c5dc00f Compare August 17, 2022 19:48

SomeoneToIgnore merged commit 67e091c into main Aug 17, 2022

SomeoneToIgnore deleted the kb/pageserver-no-init branch August 17, 2022 20:24

hlinnaka added a commit that referenced this pull request Aug 17, 2022

Silence another test failure that will be fixed properly by PR #2272

f1dc0d4

SomeoneToIgnore mentioned this pull request Aug 18, 2022

Use better defaults for pageserver Docker image #2302

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove `init` from pageserver CLI #2272

Remove `init` from pageserver CLI #2272

SomeoneToIgnore commented Aug 15, 2022 •

edited

Loading

hlinnaka left a comment

hlinnaka Aug 15, 2022

SomeoneToIgnore Aug 15, 2022

hlinnaka Aug 15, 2022

SomeoneToIgnore Aug 15, 2022

SomeoneToIgnore commented Aug 15, 2022

SomeoneToIgnore commented Aug 15, 2022

hlinnaka commented Aug 16, 2022

SomeoneToIgnore commented Aug 16, 2022

LizardWizzard left a comment •

edited

Loading

Remove init from pageserver CLI #2272

Remove init from pageserver CLI #2272

Conversation

SomeoneToIgnore commented Aug 15, 2022 • edited Loading

hlinnaka left a comment

Choose a reason for hiding this comment

hlinnaka Aug 15, 2022

Choose a reason for hiding this comment

SomeoneToIgnore Aug 15, 2022

Choose a reason for hiding this comment

hlinnaka Aug 15, 2022

Choose a reason for hiding this comment

SomeoneToIgnore Aug 15, 2022

Choose a reason for hiding this comment

SomeoneToIgnore commented Aug 15, 2022

SomeoneToIgnore commented Aug 15, 2022

hlinnaka commented Aug 16, 2022

SomeoneToIgnore commented Aug 16, 2022

LizardWizzard left a comment • edited Loading

Choose a reason for hiding this comment

Remove `init` from pageserver CLI #2272

Remove `init` from pageserver CLI #2272

SomeoneToIgnore commented Aug 15, 2022 •

edited

Loading

LizardWizzard left a comment •

edited

Loading