Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove init from pageserver CLI #2272

Merged
merged 2 commits into from
Aug 17, 2022
Merged

Remove init from pageserver CLI #2272

merged 2 commits into from
Aug 17, 2022

Conversation

SomeoneToIgnore
Copy link
Contributor

@SomeoneToIgnore SomeoneToIgnore commented Aug 15, 2022

Removes pageserver init CLI command.

  • pageserver.toml config file is now only created or updated when pageserver start command gets --update-config flag.
    id is still a protected field that is not allowed to be updated in the existing config file

  • initial tenant and timeline creation is now only done in neon_local init, for that entire pageserver is started to serve two HTTP creation commands and then stopped.

Cloud does not use initial timelines at all, Python tests continue using the same "initial timeline" implicitly brought by the neon_local init invocation

@SomeoneToIgnore SomeoneToIgnore force-pushed the kb/pageserver-no-init branch 2 times, most recently from 546c374 to 3c04ef5 Compare August 15, 2022 11:05
@SomeoneToIgnore SomeoneToIgnore changed the title WIP: remove init from pageserver Remove init from pageserver CLI Aug 15, 2022
Copy link
Contributor

@hlinnaka hlinnaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's a patch to my liking 😄 :

 9 files changed, 218 insertions(+), 340 deletions(-)

Does starting the pageserver binary now implicitly create the target directory if it doesn't already exist? If so, that seems a bit fragile, I think I'd prefer to get an error, if for example the repository directory doesn't exist because you forgot to mount the filesystem it's in.

control_plane/src/local_env.rs Show resolved Hide resolved
control_plane/src/storage.rs Outdated Show resolved Hide resolved
datadir: &Path,
update_config: bool,
) -> anyhow::Result<()> {
println!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the old code used print!() rather than println!(), so this will print an extra empty line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, yet I've found that more natural when dealing with failing pageserver:

with empty line:

Work/neon/neon kb/pageserver-no-init:origin/kb/pageserver-no-init*​ ⇡❯ rm -rf .neon/ && ./target/debug/neon_local init
Starting pageserver at '127.0.0.1:64000' in '.neon'
thread 'main' panicked at '((((((((((((((((', pageserver/src/bin/pageserver.rs:41:9
stack backtrace:
   0: rust_begin_unwind
             at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/panicking.rs:142:14
   2: pageserver::main
             at ./pageserver/src/bin/pageserver.rs:41:9
   3: core::ops::function::FnOnce::call_once
             at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/ops/function.rs:248:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
pageserver init failed: Pageserver failed to start. See console output and '.neon/pageserver.log' for details.

without:

Work/neon/neon kb/pageserver-no-init:origin/kb/pageserver-no-init*​ ⇡❯ rm -rf .neon/ && ./target/debug/neon_local init
Starting pageserver at '127.0.0.1:64000' in '.neon'thread 'main' panicked at '((((((((((((((((', pageserver/src/bin/pageserver.rs:41:9
stack backtrace:
   0: rust_begin_unwind
             at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/panicking.rs:142:14
   2: pageserver::main
             at ./pageserver/src/bin/pageserver.rs:41:9
   3: core::ops::function::FnOnce::call_once
             at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/ops/function.rs:248:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
pageserver init failed: Pageserver failed to start. See console output and '.neon/pageserver.log' for details.

control_plane/src/storage.rs Outdated Show resolved Hide resolved
.http_request(
Method::POST,
format!("{}/tenant/{}/timeline", self.http_base_url, tenant_id),
) -> anyhow::Result<TimelineInfo> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option<TimelineInfo> is None according to our logic only if we fail to create a timeline:

None => json_response(StatusCode::CONFLICT, ())?,

So instead of doing various .unwrap() or other error handling above the client, I've simplified it to validating the presence here.

@SomeoneToIgnore SomeoneToIgnore marked this pull request as ready for review August 15, 2022 14:33
@SomeoneToIgnore
Copy link
Contributor Author

Does starting the pageserver binary now implicitly create the target directory if it doesn't already exist?

The tenants/ one? Yes, now we do crate it implicitly. Do you think that base path should be mounted instead anyway? We'll fail if that's not done, so either I don't understand it or it should be ok.

We could somehow alter --update-config to something more generic and allow dir creation along with that operation too, yet it slightly becomes to resemble --init flag now 🙂

@SomeoneToIgnore SomeoneToIgnore force-pushed the kb/pageserver-no-init branch 2 times, most recently from 25af7ab to beb5e40 Compare August 15, 2022 18:36
@SomeoneToIgnore
Copy link
Contributor Author

A PR to fix e2e tests: https://github.com/neondatabase/cloud/pull/2058
Other tests seem to flack for known, daemonize-related issues.

@hlinnaka
Copy link
Contributor

Does starting the pageserver binary now implicitly create the target directory if it doesn't already exist?

The tenants/ one? Yes, now we do crate it implicitly. Do you think that base path should be mounted instead anyway? We'll fail if that's not done, so either I don't understand it or it should be ok.

I meant the directory that contains tenants/. The data directory looks like this:

<path>/tenants/
<path>/pageserver.log
<path>/pageserver.pid
<path>/pageserver.toml

The log file is created on startup. The pid-file is also created. The config file isn't mandatory I guess. But the question is, if <path> itself doesn't exist, do we create it automatically, or error out? And if <path>/tenants doesn't already exist, do we create it or error out?

Imagine that pageserver runs on a host with two filesystems. The root filesystem, and 'data' that holds all the pageserver data. At server startup, we do this:

  1. mount /data
  2. pageserver --workdir /data/pageserver-data

What happens if the mount /data command fails? Does pageserver still start up and automatically create a new empty data directory as /data/pageserver-data/, now on the root filesystem? Or does it refuse to start? I think refusing to start is better.

We could somehow alter --update-config to something more generic and allow dir creation along with that operation too, yet it slightly becomes to resemble --init flag now slightly_smiling_face

Yeah, I think an explicit init option still makes sense. But without creating the initial tenant and timeline.

hlinnaka added a commit that referenced this pull request Aug 16, 2022
- Merge with PR #2279
- add some quick workarounds in some tests that will go away with PR #2272

There's a lot of FIXMEs and TODOs and cleanup needed. But we can discuss
the design.
@SomeoneToIgnore
Copy link
Contributor Author

I meant the directory that contains tenants/. The data directory looks like this:

Now I understand.
This is a workdir, the one that's being checked for existence implicitly, with .canonicalize:

let workdir = Path::new(arg_matches.value_of("workdir").unwrap_or(".neon"));
let workdir = workdir
.canonicalize()
.with_context(|| format!("Error opening workdir '{}'", workdir.display()))?;
let cfg_file_path = workdir.join("pageserver.toml");

If we try to canonicalize or do env::set_current_dir (we do that later below), we fail for non-existent path.
So we don't create the workdir, rather implicitly failing.

This means I can avoid changing the code to be --init, after all.

Copy link
Contributor

@LizardWizzard LizardWizzard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

I’m for avoiding init and bailing when passed directory doesn’t exist. Though I’m not sure that this is followed by safekeepers, need to check

UPD: safekeeper will start even without passed datadir argument. It will fallback to current working directory. It is the simplest behavior, but if we want to guard against such config errors we better use the same behavior everywhere. Thoughts?

@SomeoneToIgnore SomeoneToIgnore merged commit 67e091c into main Aug 17, 2022
@SomeoneToIgnore SomeoneToIgnore deleted the kb/pageserver-no-init branch August 17, 2022 20:24
petuhovskiy added a commit that referenced this pull request Aug 18, 2022
* Check for entire range during sasl validation (#2281)

* Gen2 GH runner (#2128)

* Re-add rustup override

* Try s3 bucket

* Set git version

* Use v4 cache key to prevent problems

* Switch to v5 for key

* Add second rustup fix

* Rebase

* Add kaniko steps

* Fix typo and set compress level

* Disable global run default

* Specify shell for step

* Change approach with kaniko

* Try less verbose shell spec

* Add submodule pull

* Add promote step

* Adjust dependency chain

* Try default swap again

* Use env

* Don't override aws key

* Make kaniko build conditional

* Specify runs on

* Try without dependency link

* Try soft fail

* Use image with git

* Try passing to next step

* Fix duplicate

* Try other approach

* Try other approach

* Fix typo

* Try other syntax

* Set env

* Adjust setup

* Try step 1

* Add link

* Try global env

* Fix mistake

* Debug

* Try other syntax

* Try other approach

* Change order

* Move output one step down

* Put output up one level

* Try other syntax

* Skip build

* Try output

* Re-enable build

* Try other syntax

* Skip middle step

* Update check

* Try first step of dockerhub push

* Update needs dependency

* Try explicit dir

* Add missing package

* Try other approach

* Try other approach

* Specify region

* Use with

* Try other approach

* Add debug

* Try other approach

* Set region

* Follow AWS example

* Try github approach

* Skip Qemu

* Try stdin

* Missing steps

* Add missing close

* Add echo debug

* Try v2 endpoint

* Use v1 endpoint

* Try without quotes

* Revert

* Try crane

* Add debug

* Split steps

* Fix duplicate

* Add shell step

* Conform to options

* Add verbose flag

* Try single step

* Try workaround

* First request fails hunch

* Try bullseye image

* Try other approach

* Adjust verbose level

* Try previous step

* Add more debug

* Remove debug step

* Remove rogue indent

* Try with larger image

* Add build tag step

* Update workflow for testing

* Add tag step for test

* Remove unused

* Update dependency chain

* Add ownership fix

* Use matrix for promote

* Force update

* Force build

* Remove unused

* Add new image

* Add missing argument

* Update dockerfile copy

* Update Dockerfile

* Update clone

* Update dockerfile

* Go to correct folder

* Use correct format

* Update dockerfile

* Remove cd

* Debug find where we are

* Add debug on first step

* Changedir to postgres

* Set workdir

* Use v1 approach

* Use other dependency

* Try other approach

* Try other approach

* Update dockerfile

* Update approach

* Update dockerfile

* Update approach

* Update dockerfile

* Update dockerfile

* Add workspace hack

* Update Dockerfile

* Update Dockerfile

* Update Dockerfile

* Change last step

* Cleanup pull in prep for review

* Force build images

* Add condition for latest tagging

* Use pinned version

* Try without name value

* Remove more names

* Shorten names

* Add kaniko comments

* Pin kaniko

* Pin crane and ecr helper

* Up one level

* Switch to pinned tag for rust image

* Force update for test

Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@b04468bf-cdf4-41eb-9c94-aff4ca55e4bf.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@Rorys-Mac-Studio.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@4795e9ee-4f32-401f-85f3-f316263b62b8.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@2f8bc4e5-4ec2-4ea2-adb1-65d863c4a558.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@27565b2b-72d5-4742-9898-a26c9033e6f9.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@ecc96c26-c6c4-4664-be6e-34f7c3f89a3c.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@7caff3a5-bf03-4202-bd0e-f1a93c86bdae.fritz.box>

* Add missing step output, revert one deploy step (#2285)

* Add missing step output, revert one deploy step

* Conform to syntax

* Update approach

* Add missing value

* Add missing needs

Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box>

* Error for fatal not git repo (#2286)

Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box>

* Use main, not branch for ref check (#2288)

* Use main, not branch for ref check

* Add more debug

* Count main, not head

* Try new approach

* Conform to syntax

* Update approach

* Get full history

* Skip checkout

* Cleanup debug

* Remove more debug

Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box>

* Fix docker zombie process issue (#2289)

* Fix docker zombie process issue

* Init everywhere

Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box>

* Fix 1.63 clippy lints (#2282)

* split out timeline metrics, track layer map loading and size calculation

* reset rust cache for clippy run to avoid an ICE

additionally remove trailing whitespaces

* Rename pg_control_ffi.h to bindgen_deps.h, for clarity.

The pg_control_ffi.h name implies that it only includes stuff related to
pg_control.h. That's mostly true currently, but really the point of the
file is to include everything that we need to generate Rust definitions
from.

* Make local mypy behave like CI mypy (#2291)

* Fix flaky pageserver restarts in tests (#2261)

* Remove extra type aliases (#2280)

* Update cachepot endpoint (#2290)

* Update cachepot endpoint

* Update dockerfile & remove env

* Update image building process

* Cannot use metadata endpoint for this

* Update workflow

* Conform to kaniko syntax

* Update syntax

* Update approach

* Update dockerfiles

* Force update

* Update dockerfiles

* Update dockerfile

* Cleanup dockerfiles

* Update s3 test location

* Revert s3 experiment

* Add more debug

* Specify aws region

* Remove debug, add prefix

* Remove one more debug

Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box>

* workflows/benchmarking: increase timeout (#2294)

* Rework `init` in pageserver CLI  (#2272)

* Do not create initial tenant and timeline (adjust Python tests for that)
* Rework config handling during init, add --update-config to manage local config updates

* Fix: Always build images (#2296)

* Always build images

* Remove unused

Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box>

* Move auto-generated 'bindings' to a separate inner module.

Re-export only things that are used by other modules.

In the future, I'm imagining that we run bindgen twice, for Postgres
v14 and v15. The two sets of bindings would go into separate
'bindings_v14' and 'bindings_v15' modules.

Rearrange postgres_ffi modules.

Move function, to avoid Postgres version dependency in timelines.rs
Move function to generate a logical-message WAL record to postgres_ffi.

* fix cargo test

* Fix walreceiver and safekeeper bugs (#2295)

- There was an issue with zero commit_lsn `reason: LaggingWal { current_commit_lsn: 0/0, new_commit_lsn: 1/6FD90D38, threshold: 10485760 } }`. The problem was in `send_wal.rs`, where we initialized `end_pos = Lsn(0)` and in some cases sent it to the pageserver.
- IDENTIFY_SYSTEM previously returned `flush_lsn` as a physical end of WAL. Now it returns `flush_lsn` (as it was) to walproposer and `commit_lsn` to everyone else including pageserver.
- There was an issue with backoff where connection was cancelled right after initialization: `connected!` -> `safekeeper_handle_db: Connection cancelled` -> `Backoff: waiting 3 seconds`. The problem was in sleeping before establishing the connection. This is fixed by reworking retry logic.
- There was an issue with getting `NoKeepAlives` reason in a loop. The issue is probably the same as the previous.
- There was an issue with filtering safekeepers based on retry attempts, which could filter some safekeepers indefinetely. This is fixed by using retry cooldown duration instead of retry attempts.
- Some `send_wal.rs` connections failed with errors without context. This is fixed by adding a timeline to safekeepers errors.

New retry logic works like this:
- Every candidate has a `next_retry_at` timestamp and is not considered for connection until that moment
- When walreceiver connection is closed, we update `next_retry_at` using exponential backoff, increasing the cooldown on every disconnect.
- When `last_record_lsn` was advanced using the WAL from the safekeeper, we reset the retry cooldown and exponential backoff, allowing walreceiver to reconnect to the same safekeeper instantly.

* on safekeeper registration pass availability zone param (#2292)

Co-authored-by: Kirill Bulatov <kirill@neon.tech>
Co-authored-by: Rory de Zoete <33318916+zoete@users.noreply.github.com>
Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@b04468bf-cdf4-41eb-9c94-aff4ca55e4bf.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@Rorys-Mac-Studio.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@4795e9ee-4f32-401f-85f3-f316263b62b8.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@2f8bc4e5-4ec2-4ea2-adb1-65d863c4a558.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@27565b2b-72d5-4742-9898-a26c9033e6f9.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@ecc96c26-c6c4-4664-be6e-34f7c3f89a3c.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@7caff3a5-bf03-4202-bd0e-f1a93c86bdae.fritz.box>
Co-authored-by: Dmitry Rodionov <dmitry@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
Co-authored-by: Anton Galitsyn <agalitsyn@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants