Dmitri/3846 install unattended #408

a-palchikov · 2019-05-17T20:55:59Z

This PR implements first set of requirements towards friendlier installation UX as described in this issue.
The context of the PR is quite large. Following are the main focus areas with most significant changes:

lib/install (and e/lib/install) has been refactored to extract the code implementing the installation workflow. It has been implemented as lib/install/engine/{cli,interactive,ops}. This makes the workflow straightforward both to follow and change and does away with the installer mode.
the old lib/{install,expand} has become the core of the install/expand service implementation. The service hosts a gRPC server listening on the local unix domain socket (located in the temporary operation state directory). The client is responsible for setting it up and after that merely receives progress updates from the service. The service is configured to always restart and will be explicitly shutdown on aborting or completion of the operation.
manual operation execution has also been moved to the service as this has multiple advantages but the main few are - consistency (it should not be possible to execute steps with the service also executing the oepration - i.e. only service is responsible for the operation) and simpler implementation (as execute/rollback and plan completion become the same commands for install/join).
I opted for a simpler solution to place the temporary operation state directory in the current working directory. This makes state handling a lot easier and the fact that a node only runs a single operation at any given time only speaks in its favor. I'm planning to have install/join/update state directories stored in this one predictable location. The installer/agent will validate that the state directory is not on volatile storage to support resuming operation after a node reboot.

⚠️ Note: I still have tests since I ran zero during the past couple of days when I was finalizing the implementation and updates quite a bit. Also, Ops Center installation has not been tested.

Requires https://github.com/gravitational/gravity.e/pull/4082.
Fixes https://github.com/gravitational/gravity.e/issues/3846.

a-palchikov · 2019-05-17T21:00:04Z

Gopkg.toml

@@ -151,6 +151,8 @@ ignored = [
  version = "1.12.17"

 [[constraint]]
+  # FIXME: this is using a version of boltdb that was patched directly in vendor
+  # to avoid hassle with dep. Make sure to transfer changes when bumping the version


I ran into issues with teleport when I tried using it from a fork so gave up to free time for main work. Will need to revisit eventually though.

What is the issue with bolt / teleport? We should not patch things directly in vendor, that's for sure. Is it hard to fix?

a-palchikov · 2019-05-17T21:00:56Z

lib/clients/teleport.go

@@ -98,7 +99,22 @@ func TeleportAuth(ctx context.Context, operator ops.Operator, proxyHost, cluster
 	if err != nil {
 		return nil, trace.Wrap(err)
 	}
-	return authClient, nil
+	return &AuthClient{


I added this to reliably close the auth client since it was leaking connections and it prevented clean shutdown.

a-palchikov · 2019-05-17T21:01:59Z

lib/expand/join.go

+// Implements signals.Stopper
+func (p *Peer) Stop(ctx context.Context) error {
+	p.Info("Stop.")
+	// p.cancel()


Forgot to remove this.

a-palchikov · 2019-05-17T21:06:19Z

lib/expand/join.go

-			return nil, nil, trace.NotFound("no other agents joined yet")
-		}
-	}
+	// FIXME: this is not friends with the option when the node running the installer


I failed to find a satisfactory solution for this but it prevents the CLI installer from excluding itself from the cluster (i.e. when installer uses local host as implicit future cluster node). @r0mant if you have a better idea - let's discuss.
On the flip side, I don't believe this to be much of a problem if we ignore this. If there's an agent on another node joining at the same time as the single node installer - it is user's problem (since if even tokens are the same, it is joining legitimately).

Maybe this was indeed solving an imaginary problem. Does everything work as expected if a "join" agent joins faster than "install" agent? If so, I believe we can remove this.

a-palchikov · 2019-05-17T21:07:21Z

lib/install/agent.go

 	if err != nil {
 		listener.Close()
 		return nil, trace.Wrap(err)
 	}
-
+	if !config.SkipConnectValidation {


Added this kludge as installer runs an agent which might start before the agent server is up.

Why can't it retry until the agentserver is up?

It does in general, but not this particular call which fails on first failure. I will disable it since I don't see value in this check.

a-palchikov · 2019-05-17T21:22:32Z

lib/ops/opsservice/instructions.go

@@ -76,7 +76,7 @@ echo "$(date) [INFO] Install agent will be using ${TMPDIR:-/tmp} for temporary f
 %v
 {{.service_user_env}}={{.service_uid}} \
 {{.service_group_env}}={{.service_gid}} \
-{{.gravity_bin_path}} {{if .devmode}}--insecure{{end}} --debug join {{.ops_url}} \
+{{.gravity_bin_path}} {{if .devmode}}--insecure{{end}} --debug --httpprofile=localhost:0 join {{.ops_url}} \


I had planned to keep profiling on permanently but need to figure out if this really affects performance. Otherwise, it is a major help in troubleshooting.

It's something to be careful of too, since customers tend to scan the software, and don't tend to like debug endpoints, even if they're bound to localhost.

a-palchikov · 2019-05-17T21:25:18Z

lib/rpc/server/peer.go

@@ -238,15 +256,17 @@ func (r serverPeer) Reconnect(ctx context.Context) (Client, error) {

 	_, err = clt.PeerJoin(ctx, &pb.PeerJoinRequest{Addr: r.addr, Config: &r.config, SystemInfo: payload})
 	if err != nil {
-		return nil, &backoff.PermanentError{trace.Wrap(err)}


I found this to be the wrong decision that the peer's Reconnect was deciding whether to abort the reconnects - instead I moved this to the default ReconnectStrategy implementation.

a-palchikov · 2019-05-17T21:26:51Z

lib/state/state.go

+// GravityInstallDir returns the location of the temporary state directory for
+// the install/join operation.
+// elems are appended to resulting path
+func GravityInstallDir(elems ...string) (path string) {


I think using working directory is superior to any fixed location and allows me to avoid ambiguity with services on the same node (i.e. wizard + agent scenario). Let me know if you see this not working for any cases.

The inconvenience with this I think is that you have to always run gravity from the same working dir. I.e. you may have installer in ~/tmp/xxx/yyy/zzz/installer directory and always have to cd into it to execute any command. This is not a huge deal but I think it may be a bit confusing/frustrating from UX point of view. As a user, I would expect that once you've started install/join, you should be able to invoke it from anywhere on the node.

Is there a problem with picking stable sub-directory inside our "ephemeral" dir for example?

Another idea is to write "install/join working dir" location to our state file (/etc/.gravity I think it's called? same where state dir is stored) - gravity CLI commands read it first and can determine work dir this way.

a-palchikov · 2019-05-17T21:30:06Z

lib/systemservice/systemd.go

@@ -146,11 +146,21 @@ func (s *systemdManager) installService(service serviceTemplate, noBlock bool) e
 		return trace.Wrap(err, "error rendering template")
 	}

-	if err := s.EnableService(service.Name); err != nil {
+	if req.Unmask {


Like I said above, I will probably revert adding (un)masking.

a-palchikov · 2019-05-17T21:31:11Z

tool/gravity/cli/commands.go

@@ -76,6 +76,8 @@ type Application struct {
 	PlanExecuteCmd PlanExecuteCmd
 	// PlanRollbackCmd rolls back a phase of an active operation
 	PlanRollbackCmd PlanRollbackCmd
+	// ResumeCmd resumes active operation
+	ResumeCmd ResumeCmd


I also tentatively added gravity resume as it was discussed and had also to do with this PR.

r0mant

Feels like I'm about 30% through. Will continue tomorrow.

r0mant · 2019-05-21T00:02:23Z

lib/clients/teleport.go

+
+// AuthClient represents the client to the auth server
+type AuthClient struct {
+	io.Closer


I'm curious, what's the point of embedding io.Closer into a struct?

r0mant · 2019-05-21T00:08:55Z

lib/devicemapper/configure.go

@@ -66,18 +66,18 @@ func Mount(disk string, out io.Writer, entry log.FieldLogger) (err error) {
 }

 // Unmount removes docker devicemapper environment
-func Unmount(out io.Writer, entry *log.Entry) (err error) {
-	disk, err := queryPhysicalVolume(entry)
+func Unmount(out io.Writer, logger log.FieldLogger) (err error) {


Do we still need all this code? We removed devicemapper support AFAIR.

Good point, but I feel this PR has already crossed the line on the amount of changes one can digest in a go. I'll file an issue to remove this.

r0mant · 2019-05-21T00:15:22Z

lib/expand/join.go

+		errC <- p.server.Serve(p, listener)
+	}()
+	defer func() {
+		ctx, cancel := context.WithTimeout(context.Background(), defaults.ShutdownTimeout)


Shouldn't this use peer's internal context? Also, would it make sense to pass shutdown timeout via peer config?

It cannot use the peer's internal context as it's being used for cancelling the peer and this code needs to run regardless. I'll move the timeout to config though.

r0mant · 2019-05-21T00:15:37Z

lib/expand/join.go

+	defer func() {
+		ctx, cancel := context.WithTimeout(context.Background(), defaults.ShutdownTimeout)
+		p.stopRPC(ctx)
+		cancel()


I'd do defer cancel() one line above.

it does not really make that much sense

r0mant · 2019-05-21T00:16:44Z

lib/expand/join.go

+	select {
+	case err := <-errC:
+		if exitErr := p.server.ExitError(); exitErr != nil {
+			err = exitErr


Is the error that came thru the channel not important? Maybe use trace.NewAggregate.

r0mant · 2019-05-21T00:55:17Z

lib/install/client/client.go

+	log.FieldLogger
+	utils.Printer
+	*signals.InterruptHandler
+	ConnectStrategy


Missing Godoc here and above.

r0mant · 2019-05-21T00:55:57Z

lib/install/client/client.go

+			if s, ok := status.FromError(err); ok && s.Code() == codes.Canceled {
+				return nil
+			}
+			if trace.Unwrap(err) == io.EOF {


trace.IsEOF(err)

r0mant · 2019-05-21T01:01:26Z

lib/install/client/install.go

+	// ApplicationDir specifies the directory with installer files
+	ApplicationDir string
+	// Validate specifies the environment validation function.
+	// The service will only be installed when Validate returns true


Nit: "when Validate returns nil"?

r0mant · 2019-05-21T01:03:07Z

lib/install/client/resume.go

+		r.SocketPath = installpb.SocketPath()
+	}
+	if r.ConnectTimeout == 0 {
+		r.ConnectTimeout = 10 * time.Minute


I feel like this (10 * time.Minute) should move to defaults - it was used somewhere above too.

r0mant · 2019-05-21T01:05:04Z

lib/install/config.go

+	Config
+	// FSMFactory specifies the state machine factory to use
+	FSMFactory engine.FSMFactory
+	// CLusterFactory specifies the cluster request factory to use


Nit: CLuster -> Cluster

r0mant

Next chunk. One more to follow :)

r0mant · 2019-05-21T16:40:06Z

lib/install/install.go

-		return trace.BadParameter("invalid vxlan port: must be in range 1-65535")
-	}
-	if err := c.validateCloudConfig(); err != nil {
+// Run runs the server operation using the specified engine


Nit: I don't see any "specified engines" in this function.

r0mant · 2019-05-21T16:42:08Z

lib/install/install.go

+// Implements Interface
+func (i *Installer) NotifyOperationAvailable(op ops.SiteOperation) error {
+	if i.agent != nil {
+		i.startAgent(op)


This method can return an error AFAICS.

r0mant · 2019-05-21T16:45:00Z

lib/install/install.go

+	if err := i.emitAuditEvents(i.ctx, operation); err != nil {
+		errors = append(errors, err)
+	}
+	// Explicitly stop agents only iff the operation has been completed successfully


Nit: either "only if" or just "iff" (no "only") if you meant "if and only if".

r0mant · 2019-05-21T16:46:37Z

lib/install/install.go

+}
+
+// CompleteFinalInstallStep marks the final install step as completed unless
+// the application has a custom install step - in which case it does nothing


I don't see this check for the presence of custom install step in this method - is it checked elsewhere? If so, the comment is a bit misleading as it implies that this method does that.

r0mant · 2019-05-21T16:46:56Z

lib/install/install.go

+// Implements Interface
+func (i *Installer) CompleteFinalInstallStep(key ops.SiteOperationKey, delay time.Duration) error {
+	req := ops.CompleteFinalInstallStepRequest{
+		AccountID:           defaults.SystemAccountID,


key.AccountID?

r0mant · 2019-05-21T18:10:33Z

lib/process/watch.go

-func (p *Process) startWatchingReloadEvents(ctx context.Context, client *kubernetes.Clientset) error {
-	go func() {
+// runReloadEventsWatch watches reload events and restarts the process.
+func (p *Process) runReloadEventsWatch(client *kubernetes.Clientset) clusterService {


Same as above two comments about "cluster services". IIRC I introduced this "cluster service" thing specifically for things that should run only in active gravity site. This should run in all.

What's the problem with having more cluster services? I understand there're services that are only supposed to run on the leader which they continue to do - I just generalized the rest of similar "services" to have the same signature.

r0mant · 2019-05-21T18:12:36Z

lib/rpc/proto/utils.go

 }

 func DecodeError(err *Error) error {
 	return errors.New(err.Message)
 }
+
+func FormatPeerJoinRequest(req *PeerJoinRequest) string {


Nit: Can't all these structs implement Stringer instead of having Format* methods?

You're correct - I did not think about replacing the auto-generated String implementation with custom ones.

r0mant · 2019-05-21T18:21:15Z

lib/state/state.go

+// GravityInstallDir returns the location of the temporary state directory for
+// the install/join operation.
+// elems are appended to resulting path
+func GravityInstallDir(elems ...string) (path string) {


The inconvenience with this I think is that you have to always run gravity from the same working dir. I.e. you may have installer in ~/tmp/xxx/yyy/zzz/installer directory and always have to cd into it to execute any command. This is not a huge deal but I think it may be a bit confusing/frustrating from UX point of view. As a user, I would expect that once you've started install/join, you should be able to invoke it from anywhere on the node.

Is there a problem with picking stable sub-directory inside our "ephemeral" dir for example?

Another idea is to write "install/join working dir" location to our state file (/etc/.gravity I think it's called? same where state dir is stored) - gravity CLI commands read it first and can determine work dir this way.

r0mant · 2019-05-21T18:25:23Z

lib/system/environ/validate.go

+		return trace.BadParameter("installer is running from a temporary file system.\n" +
+			"It is required to run the installer from a non-volatile location.")
+	}
+	var volatileDirectories = []string{"/tmp", "/var/tmp"}


I'm pretty sure this will break installs for half of our customers :) I.e. I know for sure two our major customers run installers from somewhere in /tmp. Where did this requirement come from?

This requirement comes from the fact that in order to support resumption after the node is rebooted, the state needs to reside somewhere non-volatile. I was already thinking of making this a warning instead of the hard requirement as installs can and do complete without the node being rebooted.

Regarding the choice of the operation state directory - taking from git's approach to storing state in a local .git directory which I think is very flexible.
The fact that it forces the user to execute the binary from the same directory is not a disadvantage - contrast this with the requirement that update places on its operation. Why should this be different for an install or join operation?
Additionally, using a different location per operation is not ideal and leads to proliferation of state which we have to deal with when we try to compute the momentary operational state of the cluster - why do we need to do this if there's just one operation in flight at any given time? Why not keep cluster state in --state-dir and any active operation state in some-other-state-dir (which would be the same for any operation) so we don't need to consult 3-4 places to build the state view.

Having the temporary state directory written to a central directory (i.e. /etc/.gravity) as opposed to local directory relative to where the application state is (i.e. unpacked data for install) is fragile to updates if we need to differentiate the location between install/join/update/update_config/update_environ/etc operation state directories.

Additionally, it is possible to run both the installer and the agent on a single node - this necessitates two services and two state directories. Should it also necessitate two resume commands since it needs to know which process to resume?

r0mant · 2019-05-21T18:28:42Z

lib/system/environ/validate.go

+}
+
+func validateDirectoryEmpty(stateDir string) error {
+	empty, err := utils.IsDirectoryEmpty(stateDir)


This check for empty /var/lib/gravity dir will very often fire when it's mounted on a separate disk which has default "lost+found" directory there. This will e.g. break installs with AWS provisioner. I know b/c I tried something similar in the past :) I think the check for package state that you have below is enough.

It is not the state directory for the cluster but the temporary operation state directory. This check verifies that the operation cannot be reinitialized once started but I realize this is poor UX - it would be much better to determine if the operation is still ongoing and do nothing - this would, for instance, involve checking whether the parameters used to create the operation are the same and if the operation is still ongoing. I will try to accommodate that.

r0mant · 2019-05-21T22:44:49Z

lib/webapi/webapi.go

@@ -101,6 +101,9 @@ type Config struct {
 	// ServiceUser specifies the service user to use to
 	// create a cluster with for wizard-based installation
 	ServiceUser systeminfo.User
+	// InstallToken specifies the token to install cluster with.
+	// The token is used to authenticate agents during the install operation
+	InstallToken string


Hmm, why does webapi handler need some install token?

webapi handler also handles installation and needs to pass the original token to the cluster so it could be used when the install operation is created. The token itself is used to fix the issue with wizard not properly authenticating agents.
I think I can do away with passing it into webapi by providing a wizard-specific Operator.

r0mant · 2019-05-21T22:47:52Z

tool/gravity/cli/config.go

+		LocalApps:          env.Apps,
+		LocalBackend:       env.Backend,
+		LocalClusterClient: env.SiteOperator,
+


Nit: Is this empty line intentional?

r0mant · 2019-05-21T23:06:34Z

tool/gravity/cli/register.go

@@ -98,6 +93,8 @@ func RegisterCommands(app *kingpin.Application) *Application {
 	g.InstallCmd.GCENodeTags = g.InstallCmd.Flag("gce-node-tag", "Override node tag on the instance in GCE required for load balanacing. Defaults to cluster name.").Strings()
 	g.InstallCmd.DNSHosts = g.InstallCmd.Flag("dns-host", "Specify an IP address that will be returned for the given domain within the cluster. Accepts <domain>/<ip> format. Can be specified multiple times.").Hidden().Strings()
 	g.InstallCmd.DNSZones = g.InstallCmd.Flag("dns-zone", "Specify an upstream server for the given zone within the cluster. Accepts <zone>/<nameserver> format where <nameserver> can be either <ip> or <ip>:<port>. Can be specified multiple times.").Strings()
+	g.InstallCmd.ExcludeHostFromCluster = g.InstallCmd.Flag("exclude-from-cluster", "Do not use this node in the cluster").Bool()


I'd consult with Ev on what to call this flag. (If you haven't already.)

I haven't. Here's the comment in the corresponding issue: https://github.com/gravitational/gravity.e/issues/3846#issuecomment-496212814

r0mant · 2019-05-21T23:07:19Z

tool/gravity/cli/register.go

@@ -98,6 +93,8 @@ func RegisterCommands(app *kingpin.Application) *Application {
 	g.InstallCmd.GCENodeTags = g.InstallCmd.Flag("gce-node-tag", "Override node tag on the instance in GCE required for load balanacing. Defaults to cluster name.").Strings()
 	g.InstallCmd.DNSHosts = g.InstallCmd.Flag("dns-host", "Specify an IP address that will be returned for the given domain within the cluster. Accepts <domain>/<ip> format. Can be specified multiple times.").Hidden().Strings()
 	g.InstallCmd.DNSZones = g.InstallCmd.Flag("dns-zone", "Specify an upstream server for the given zone within the cluster. Accepts <zone>/<nameserver> format where <nameserver> can be either <ip> or <ip>:<port>. Can be specified multiple times.").Strings()
+	g.InstallCmd.ExcludeHostFromCluster = g.InstallCmd.Flag("exclude-from-cluster", "Do not use this node in the cluster").Bool()
+	g.InstallCmd.FromService = g.InstallCmd.Flag("from-service", "Run in service mode").Hidden().Bool()


Do we need this flag? Why is this not the only option? (I assume this is default?)

This flag differentiates between the client and the service. It is not directly used by the user, and the client process user communicates with controls the service process - so there're two processes involved which appear to the user as one. But they have different modes of operation hence the flag.

r0mant · 2019-05-21T23:07:37Z

tool/gravity/cli/register.go

 	g.JoinCmd.OperationID = g.JoinCmd.Flag("operation-id", "ID of the operation that was created via UI").Hidden().String()
+	g.JoinCmd.FromService = g.JoinCmd.Flag("from-service", "Run in service mode").Hidden().Bool()


Same q as above about the necessity of this flag.

knisbet · 2019-05-27T17:10:39Z

lib/defaults/defaults.go

@@ -213,6 +214,9 @@ const (
 	// PeerConnectTimeout is the timeout of an RPC agent connecting to its peer
 	PeerConnectTimeout = 10 * time.Second

+	// ServiceConnectTimeout specifies the timeout for connecting to the installer service
+	ServiceConnectTimeout = 5 * time.Minute


Just curious? why 5 minutes?

Also, if this is specific to the installer (as per the comment), maybe InstallServiceConnectTimeout?

No real reason, this is more of a left-over from debugging - thanks for pointing it out. I'll shorten it.

knisbet · 2019-05-27T17:11:50Z

lib/defaults/defaults.go

 	// RPCAgentSecretsPackage specifies the name of the RPC credentials package
 	RPCAgentSecretsPackage = "rpcagent-secrets"

+	// ShutdownTimeout defines the maximum amount of time to wait for an operation to complete


This isn't really just the amount of time to wait for an operation to complete is it? I assume based on the variable name, this is the amount of time to wait when a shutdown has been requested?

knisbet · 2019-05-27T17:18:24Z

lib/expand/join.go

+	if err := config.CheckAndSetDefaults(); err != nil {
+		return nil, trace.Wrap(err)
+	}
+	server := server.New(ctx)


I might just be missing a bit of context, why does each peer get its own instance of the gRPC installer server?

Yes, it seems confusing. Basically, a peer in the context of expand operation is different from a peer in context of an install operation (at least atm), so lib/expand ends up running two errands - for install, the server is not that relevant and I'd keep it for symmetry. For expand, it is required to run the expand operation.

knisbet · 2019-05-27T17:28:23Z

lib/expand/join.go

+// Execute executes the peer operation (join or just serving an agent).
+// Implements server.Executor
+func (p *Peer) Execute(phase *installpb.ExecuteRequest_Phase) (err error) {
+	p.Info("Execute.")


Please include additional context in the log. What is being executed, where the request is coming from, stuff like that.

knisbet · 2019-05-27T17:37:37Z

lib/expand/join.go

 	for _, addr := range p.Peers {
-		p.Debugf("Trying peer %v.", addr)


Dialing Peer instead of trying Peer?

knisbet · 2019-05-27T18:03:32Z

lib/install/phases/pull.go

@@ -150,7 +150,7 @@ func (p *pullExecutor) pullUserApplication() error {
 		DstApp:      p.LocalApps,
 		Package:     *p.Phase.Data.Package,
 	})
-	if err != nil {
+	if err != nil && !trace.IsAlreadyExists(err) {


Sometimes you point out really great stuff by commenting on your own PR, however, I feel this is a comment that is better suited in the source. It answers the why of a particular line of code and isn't a simple restatement of the code.

Why is AlreadyExists an acceptable error here?
Answer: Needs to be re-entrant.

knisbet · 2019-05-27T18:23:51Z

lib/ops/opsservice/instructions.go

@@ -76,7 +76,7 @@ echo "$(date) [INFO] Install agent will be using ${TMPDIR:-/tmp} for temporary f
 %v
 {{.service_user_env}}={{.service_uid}} \
 {{.service_group_env}}={{.service_gid}} \
-{{.gravity_bin_path}} {{if .devmode}}--insecure{{end}} --debug join {{.ops_url}} \
+{{.gravity_bin_path}} {{if .devmode}}--insecure{{end}} --debug --httpprofile=localhost:0 join {{.ops_url}} \


It's something to be careful of too, since customers tend to scan the software, and don't tend to like debug endpoints, even if they're bound to localhost.

…oth the installer and joining agents

…eOperation to installer agent protocol.

- Fix an issue with signal loop not cancelling the context - Unpack process assets into fixed location in ephemeral state directory - Fix installer client connect

…rce flag which has a slightly different meaning (will force even a completed phase). With resume flag, only in-progress or failed phases are re-ran. * Fix a couple of issues with signal handling by adding an explicit done channel. The channel is useful to ensure that all stoppers have finished. Also fix the context having been already cancelled case by using an unrelated context for the cleanup. * Implement graceful stop for wizard and shut down installer gRPC server when requested. * Rewrite plan/operation handling code to be able to deal with active/inactive wizard process. In inactvie case, the plan data is retrieved directly from the state directory.

* Simplify interrupt handling

…rk for automatic restarts. * Factored common gRPC code into lib/install/server.Server. * Moved system uninstall code to lib/system/cleanup to be able to reuse it with installer service * Termination handler logic separate for client/server

…s simplifies service handling w.r.t. to the location of the socket file and the name of the service * Implement Abort in gRPC agents

* Store service socket in ephemeral directory root * Mask service to prevent auto-restart upon completion

…per resume command (installer node vs join nodes) * Another pass at handling of completion and aborts. Uninstall services upon successful operation completion * Show proper error message when the operation is aborted and not a random error from the middle of the operation * Move agent from engine to installer and run it at start if necessary. This necessitated changes to reconnect logic * Make token specified for install be the provisioning token. This necessitated changes to provisioning tokens - install will only create a single long-lived expand token * Update peer reconnect strategy to check explicitly for all known terminal conditions * Fix tests

…t properly in 'plan resume' * Handle operation completion properly for interactive wizard (disconnect client, handle interrupt as shutdown instead of abort) * Move wizard state to wizard directory in ephemeral location

* Revisit context cancellation vs waiting for completion. Do away with using top-level context canclled by interrupt handler - instead rely on the handler cancellation behavior and use unbounded context where necessary - cancellation/abort is controlled by the interrupt handler. This fixes multiple issues with remote calls being cancelled half-way when they need to complete.

* Move all install-specific code to base repository and call into the service for plan subcommands.

…stall/join

binary's directory. Using binary directory will only work for relative invocations - absolute invocations will almost always use the wrong directory for state. * Validate that the temporary state directory is not rooted in one of the temporary locations (in addition to tmpfs).

…ttributes. Clean up operation state after completion

* Revisit abort logic and move abort/completion handlers to client where they're more meaningful.

* Remove OpsCenter-based installer flow. * Use Aborters/Completers on the client side - this prompted to extend the server Event struct with a Status instead of Completion flag to support the interactive install flow when user needs to terminate the installer explicitly. * Updated bolt dependency to refer to a fork with changes to support no timeout for access.

…alls.

service. * Improve service init by ensuring no left-overs from previous services configuration. * Revert changes to InterruptHandler to let it properly cancel the external context. * Send completion event for expand.

…expand.

service-driven execution does not yet work with agents (for install). This will require restructuring lib/expand. * Load RPC credentials directly from the RPC package for install.

* Improve handling of premature interrupts (at init stage) * Fix the problem with abort handler that did not properly uninstall services.

a-palchikov requested review from r0mant and knisbet May 17, 2019 20:56

a-palchikov commented May 17, 2019

View reviewed changes

r0mant reviewed May 21, 2019

View reviewed changes

a-palchikov force-pushed the dmitri/3846-install-unattended branch from 870168f to 7dc4753 Compare May 23, 2019 20:35

knisbet reviewed May 27, 2019

View reviewed changes

a-palchikov force-pushed the dmitri/3846-install-unattended branch 2 times, most recently from f356f29 to d481ca1 Compare May 31, 2019 18:35

a-palchikov added 20 commits June 3, 2019 16:33

Install/join plantuml diagrams for existing/proposed flows

165b085

Update design sketches

4fb9ead

Add sketches for Ops Center-driven installation

43b7edb

Simplify install config

522e815

Refactor installer into separate packages per configured mode

fe3ff08

Fix compilation issues

703bd40

Finalize the installer port. Start work on systemd integration

143e21e

Adding a simple gRPC wrapper to the service component

e2afb4a

Update gRPC runtime

ed2e568

Change proto concepts to use Agent instead of Installer as it's for b…

04377a2

…oth the installer and joining agents

Move to read-only config and installer interface. Add explicit Execut…

5ca9bd2

…eOperation to installer agent protocol.

Introduce installer service client package

1bb2664

- Fix an issue with multiple bolt instances and exclusive lock

dfdf5d8

- Fix an issue with signal loop not cancelling the context - Unpack process assets into fixed location in ephemeral state directory - Fix installer client connect

fix

c35654b

Cumulative update

7101727

* Fix tests

dc290d9

* Simplify interrupt handling

Add uninstall handler to installer/join agent services

8046071

Uninstall system on repeated Ctrl+C. Clean exit upon installation.

9058982

a-palchikov added 26 commits June 3, 2019 16:33

* Store all install/join state in the same locaton on each node. Thi…

8e30dbf

…s simplifies service handling w.r.t. to the location of the socket file and the name of the service * Implement Abort in gRPC agents

* Send final completion event with endpoints info

41a1612

* Store service socket in ephemeral directory root * Mask service to prevent auto-restart upon completion

Fix (mostly) interactive installs.

79683f3

* Implement install state in a local directory

e08b02c

* Move all install-specific code to base repository and call into the service for plan subcommands.

API updates and renames. Implement server-side phase execution for in…

e8efd55

…stall/join

Remove out of date designs

84a073a

Fix tests. Clean up, remove unused attributes

75c362d

fix

a997a3b

Revert changes to systemservice/systemd to remove Unmask/RemoveFile a…

b4b15fb

…ttributes. Clean up operation state after completion

* Address review comments.

8051268

* Revisit abort logic and move abort/completion handlers to client where they're more meaningful.

Address review comments.

54603a6

Address more code review points. Use long-lived expand token for inst…

afabdf2

…alls.

Fix API change

0d7a176

Revert systemd package list code changes.

08943b8

Avoid empty lines in progress output. Specify completion message for …

db64625

…expand.

Fix NewLocalEnv to append LocalDir in any case.

7283192

Split state directory handling for install/join and other cases

7da5f6a

* Revert changes to running individual phases directly from state as

3a5bde3

service-driven execution does not yet work with agents (for install). This will require restructuring lib/expand. * Load RPC credentials directly from the RPC package for install.

* Remove default profiling endpoints.

20ae18e

* Improve handling of premature interrupts (at init stage) * Fix the problem with abort handler that did not properly uninstall services.

Update e

0e31d87

a-palchikov force-pushed the dmitri/3846-install-unattended branch from d481ca1 to 0e31d87 Compare June 3, 2019 14:38

Merge branch 'master' into dmitri/3846-install-unattended

cabd2ba

a-palchikov merged commit 7c0d01b into master Jun 3, 2019

a-palchikov deleted the dmitri/3846-install-unattended branch June 3, 2019 17:22

		g.JoinCmd.OperationID = g.JoinCmd.Flag("operation-id", "ID of the operation that was created via UI").Hidden().String()
		g.JoinCmd.FromService = g.JoinCmd.Flag("from-service", "Run in service mode").Hidden().Bool()

		for _, addr := range p.Peers {
		p.Debugf("Trying peer %v.", addr)

Dmitri/3846 install unattended #408

Dmitri/3846 install unattended #408

Conversation

a-palchikov commented May 17, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

a-palchikov May 17, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

r0mant left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

r0mant left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

a-palchikov May 27, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

a-palchikov May 27, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

a-palchikov May 27, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

a-palchikov commented May 17, 2019 •

edited

Loading

a-palchikov May 17, 2019 •

edited

Loading

a-palchikov May 27, 2019 •

edited

Loading

a-palchikov May 27, 2019 •

edited

Loading

a-palchikov May 27, 2019 •

edited

Loading