Skip to content
This repository has been archived by the owner on Feb 9, 2024. It is now read-only.

Dmitri/3846 install unattended #408

Merged
merged 48 commits into from
Jun 3, 2019
Merged

Conversation

a-palchikov
Copy link
Contributor

@a-palchikov a-palchikov commented May 17, 2019

This PR implements first set of requirements towards friendlier installation UX as described in this issue.
The context of the PR is quite large. Following are the main focus areas with most significant changes:

  • lib/install (and e/lib/install) has been refactored to extract the code implementing the installation workflow. It has been implemented as lib/install/engine/{cli,interactive,ops}. This makes the workflow straightforward both to follow and change and does away with the installer mode.
  • the old lib/{install,expand} has become the core of the install/expand service implementation. The service hosts a gRPC server listening on the local unix domain socket (located in the temporary operation state directory). The client is responsible for setting it up and after that merely receives progress updates from the service. The service is configured to always restart and will be explicitly shutdown on aborting or completion of the operation.
  • manual operation execution has also been moved to the service as this has multiple advantages but the main few are - consistency (it should not be possible to execute steps with the service also executing the oepration - i.e. only service is responsible for the operation) and simpler implementation (as execute/rollback and plan completion become the same commands for install/join).
  • I opted for a simpler solution to place the temporary operation state directory in the current working directory. This makes state handling a lot easier and the fact that a node only runs a single operation at any given time only speaks in its favor. I'm planning to have install/join/update state directories stored in this one predictable location. The installer/agent will validate that the state directory is not on volatile storage to support resuming operation after a node reboot.

⚠️ Note: I still have tests since I ran zero during the past couple of days when I was finalizing the implementation and updates quite a bit. Also, Ops Center installation has not been tested.

Requires https://github.com/gravitational/gravity.e/pull/4082.
Fixes https://github.com/gravitational/gravity.e/issues/3846.

Gopkg.toml Outdated
@@ -151,6 +151,8 @@ ignored = [
version = "1.12.17"

[[constraint]]
# FIXME: this is using a version of boltdb that was patched directly in vendor
# to avoid hassle with dep. Make sure to transfer changes when bumping the version
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into issues with teleport when I tried using it from a fork so gave up to free time for main work. Will need to revisit eventually though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the issue with bolt / teleport? We should not patch things directly in vendor, that's for sure. Is it hard to fix?

@@ -98,7 +99,22 @@ func TeleportAuth(ctx context.Context, operator ops.Operator, proxyHost, cluster
if err != nil {
return nil, trace.Wrap(err)
}
return authClient, nil
return &AuthClient{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this to reliably close the auth client since it was leaking connections and it prevented clean shutdown.

// Implements signals.Stopper
func (p *Peer) Stop(ctx context.Context) error {
p.Info("Stop.")
// p.cancel()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to remove this.

return nil, nil, trace.NotFound("no other agents joined yet")
}
}
// FIXME: this is not friends with the option when the node running the installer
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I failed to find a satisfactory solution for this but it prevents the CLI installer from excluding itself from the cluster (i.e. when installer uses local host as implicit future cluster node). @r0mant if you have a better idea - let's discuss.
On the flip side, I don't believe this to be much of a problem if we ignore this. If there's an agent on another node joining at the same time as the single node installer - it is user's problem (since if even tokens are the same, it is joining legitimately).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this was indeed solving an imaginary problem. Does everything work as expected if a "join" agent joins faster than "install" agent? If so, I believe we can remove this.

if err != nil {
listener.Close()
return nil, trace.Wrap(err)
}

if !config.SkipConnectValidation {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this kludge as installer runs an agent which might start before the agent server is up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't it retry until the agentserver is up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does in general, but not this particular call which fails on first failure. I will disable it since I don't see value in this check.

@@ -76,7 +76,7 @@ echo "$(date) [INFO] Install agent will be using ${TMPDIR:-/tmp} for temporary f
%v
{{.service_user_env}}={{.service_uid}} \
{{.service_group_env}}={{.service_gid}} \
{{.gravity_bin_path}} {{if .devmode}}--insecure{{end}} --debug join {{.ops_url}} \
{{.gravity_bin_path}} {{if .devmode}}--insecure{{end}} --debug --httpprofile=localhost:0 join {{.ops_url}} \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had planned to keep profiling on permanently but need to figure out if this really affects performance. Otherwise, it is a major help in troubleshooting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's something to be careful of too, since customers tend to scan the software, and don't tend to like debug endpoints, even if they're bound to localhost.

@@ -238,15 +256,17 @@ func (r serverPeer) Reconnect(ctx context.Context) (Client, error) {

_, err = clt.PeerJoin(ctx, &pb.PeerJoinRequest{Addr: r.addr, Config: &r.config, SystemInfo: payload})
if err != nil {
return nil, &backoff.PermanentError{trace.Wrap(err)}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this to be the wrong decision that the peer's Reconnect was deciding whether to abort the reconnects - instead I moved this to the default ReconnectStrategy implementation.

// GravityInstallDir returns the location of the temporary state directory for
// the install/join operation.
// elems are appended to resulting path
func GravityInstallDir(elems ...string) (path string) {
Copy link
Contributor Author

@a-palchikov a-palchikov May 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using working directory is superior to any fixed location and allows me to avoid ambiguity with services on the same node (i.e. wizard + agent scenario). Let me know if you see this not working for any cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inconvenience with this I think is that you have to always run gravity from the same working dir. I.e. you may have installer in ~/tmp/xxx/yyy/zzz/installer directory and always have to cd into it to execute any command. This is not a huge deal but I think it may be a bit confusing/frustrating from UX point of view. As a user, I would expect that once you've started install/join, you should be able to invoke it from anywhere on the node.

Is there a problem with picking stable sub-directory inside our "ephemeral" dir for example?

Another idea is to write "install/join working dir" location to our state file (/etc/.gravity I think it's called? same where state dir is stored) - gravity CLI commands read it first and can determine work dir this way.

@@ -146,11 +146,21 @@ func (s *systemdManager) installService(service serviceTemplate, noBlock bool) e
return trace.Wrap(err, "error rendering template")
}

if err := s.EnableService(service.Name); err != nil {
if req.Unmask {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like I said above, I will probably revert adding (un)masking.

@@ -76,6 +76,8 @@ type Application struct {
PlanExecuteCmd PlanExecuteCmd
// PlanRollbackCmd rolls back a phase of an active operation
PlanRollbackCmd PlanRollbackCmd
// ResumeCmd resumes active operation
ResumeCmd ResumeCmd
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also tentatively added gravity resume as it was discussed and had also to do with this PR.

Copy link
Contributor

@r0mant r0mant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like I'm about 30% through. Will continue tomorrow.


// AuthClient represents the client to the auth server
type AuthClient struct {
io.Closer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, what's the point of embedding io.Closer into a struct?

@@ -66,18 +66,18 @@ func Mount(disk string, out io.Writer, entry log.FieldLogger) (err error) {
}

// Unmount removes docker devicemapper environment
func Unmount(out io.Writer, entry *log.Entry) (err error) {
disk, err := queryPhysicalVolume(entry)
func Unmount(out io.Writer, logger log.FieldLogger) (err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need all this code? We removed devicemapper support AFAIR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, but I feel this PR has already crossed the line on the amount of changes one can digest in a go. I'll file an issue to remove this.

errC <- p.server.Serve(p, listener)
}()
defer func() {
ctx, cancel := context.WithTimeout(context.Background(), defaults.ShutdownTimeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this use peer's internal context? Also, would it make sense to pass shutdown timeout via peer config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It cannot use the peer's internal context as it's being used for cancelling the peer and this code needs to run regardless. I'll move the timeout to config though.

defer func() {
ctx, cancel := context.WithTimeout(context.Background(), defaults.ShutdownTimeout)
p.stopRPC(ctx)
cancel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd do defer cancel() one line above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does not really make that much sense

select {
case err := <-errC:
if exitErr := p.server.ExitError(); exitErr != nil {
err = exitErr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the error that came thru the channel not important? Maybe use trace.NewAggregate.

log.FieldLogger
utils.Printer
*signals.InterruptHandler
ConnectStrategy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing Godoc here and above.

if s, ok := status.FromError(err); ok && s.Code() == codes.Canceled {
return nil
}
if trace.Unwrap(err) == io.EOF {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trace.IsEOF(err)

// ApplicationDir specifies the directory with installer files
ApplicationDir string
// Validate specifies the environment validation function.
// The service will only be installed when Validate returns true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "when Validate returns nil"?

r.SocketPath = installpb.SocketPath()
}
if r.ConnectTimeout == 0 {
r.ConnectTimeout = 10 * time.Minute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this (10 * time.Minute) should move to defaults - it was used somewhere above too.

Config
// FSMFactory specifies the state machine factory to use
FSMFactory engine.FSMFactory
// CLusterFactory specifies the cluster request factory to use
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: CLuster -> Cluster

Copy link
Contributor

@r0mant r0mant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next chunk. One more to follow :)

return trace.BadParameter("invalid vxlan port: must be in range 1-65535")
}
if err := c.validateCloudConfig(); err != nil {
// Run runs the server operation using the specified engine
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I don't see any "specified engines" in this function.

// Implements Interface
func (i *Installer) NotifyOperationAvailable(op ops.SiteOperation) error {
if i.agent != nil {
i.startAgent(op)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method can return an error AFAICS.

if err := i.emitAuditEvents(i.ctx, operation); err != nil {
errors = append(errors, err)
}
// Explicitly stop agents only iff the operation has been completed successfully
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: either "only if" or just "iff" (no "only") if you meant "if and only if".

}

// CompleteFinalInstallStep marks the final install step as completed unless
// the application has a custom install step - in which case it does nothing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this check for the presence of custom install step in this method - is it checked elsewhere? If so, the comment is a bit misleading as it implies that this method does that.

// Implements Interface
func (i *Installer) CompleteFinalInstallStep(key ops.SiteOperationKey, delay time.Duration) error {
req := ops.CompleteFinalInstallStepRequest{
AccountID: defaults.SystemAccountID,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

key.AccountID?

func (p *Process) startWatchingReloadEvents(ctx context.Context, client *kubernetes.Clientset) error {
go func() {
// runReloadEventsWatch watches reload events and restarts the process.
func (p *Process) runReloadEventsWatch(client *kubernetes.Clientset) clusterService {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above two comments about "cluster services". IIRC I introduced this "cluster service" thing specifically for things that should run only in active gravity site. This should run in all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the problem with having more cluster services? I understand there're services that are only supposed to run on the leader which they continue to do - I just generalized the rest of similar "services" to have the same signature.

}

func DecodeError(err *Error) error {
return errors.New(err.Message)
}

func FormatPeerJoinRequest(req *PeerJoinRequest) string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can't all these structs implement Stringer instead of having Format* methods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct - I did not think about replacing the auto-generated String implementation with custom ones.

// GravityInstallDir returns the location of the temporary state directory for
// the install/join operation.
// elems are appended to resulting path
func GravityInstallDir(elems ...string) (path string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inconvenience with this I think is that you have to always run gravity from the same working dir. I.e. you may have installer in ~/tmp/xxx/yyy/zzz/installer directory and always have to cd into it to execute any command. This is not a huge deal but I think it may be a bit confusing/frustrating from UX point of view. As a user, I would expect that once you've started install/join, you should be able to invoke it from anywhere on the node.

Is there a problem with picking stable sub-directory inside our "ephemeral" dir for example?

Another idea is to write "install/join working dir" location to our state file (/etc/.gravity I think it's called? same where state dir is stored) - gravity CLI commands read it first and can determine work dir this way.

return trace.BadParameter("installer is running from a temporary file system.\n" +
"It is required to run the installer from a non-volatile location.")
}
var volatileDirectories = []string{"/tmp", "/var/tmp"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure this will break installs for half of our customers :) I.e. I know for sure two our major customers run installers from somewhere in /tmp. Where did this requirement come from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requirement comes from the fact that in order to support resumption after the node is rebooted, the state needs to reside somewhere non-volatile. I was already thinking of making this a warning instead of the hard requirement as installs can and do complete without the node being rebooted.

Copy link
Contributor Author

@a-palchikov a-palchikov May 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the choice of the operation state directory - taking from git's approach to storing state in a local .git directory which I think is very flexible.
The fact that it forces the user to execute the binary from the same directory is not a disadvantage - contrast this with the requirement that update places on its operation. Why should this be different for an install or join operation?
Additionally, using a different location per operation is not ideal and leads to proliferation of state which we have to deal with when we try to compute the momentary operational state of the cluster - why do we need to do this if there's just one operation in flight at any given time? Why not keep cluster state in --state-dir and any active operation state in some-other-state-dir (which would be the same for any operation) so we don't need to consult 3-4 places to build the state view.

Having the temporary state directory written to a central directory (i.e. /etc/.gravity) as opposed to local directory relative to where the application state is (i.e. unpacked data for install) is fragile to updates if we need to differentiate the location between install/join/update/update_config/update_environ/etc operation state directories.

Additionally, it is possible to run both the installer and the agent on a single node - this necessitates two services and two state directories. Should it also necessitate two resume commands since it needs to know which process to resume?

}

func validateDirectoryEmpty(stateDir string) error {
empty, err := utils.IsDirectoryEmpty(stateDir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check for empty /var/lib/gravity dir will very often fire when it's mounted on a separate disk which has default "lost+found" directory there. This will e.g. break installs with AWS provisioner. I know b/c I tried something similar in the past :) I think the check for package state that you have below is enough.

Copy link
Contributor Author

@a-palchikov a-palchikov May 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not the state directory for the cluster but the temporary operation state directory. This check verifies that the operation cannot be reinitialized once started but I realize this is poor UX - it would be much better to determine if the operation is still ongoing and do nothing - this would, for instance, involve checking whether the parameters used to create the operation are the same and if the operation is still ongoing. I will try to accommodate that.

@@ -101,6 +101,9 @@ type Config struct {
// ServiceUser specifies the service user to use to
// create a cluster with for wizard-based installation
ServiceUser systeminfo.User
// InstallToken specifies the token to install cluster with.
// The token is used to authenticate agents during the install operation
InstallToken string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, why does webapi handler need some install token?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

webapi handler also handles installation and needs to pass the original token to the cluster so it could be used when the install operation is created. The token itself is used to fix the issue with wizard not properly authenticating agents.
I think I can do away with passing it into webapi by providing a wizard-specific Operator.

LocalApps: env.Apps,
LocalBackend: env.Backend,
LocalClusterClient: env.SiteOperator,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Is this empty line intentional?

@@ -98,6 +93,8 @@ func RegisterCommands(app *kingpin.Application) *Application {
g.InstallCmd.GCENodeTags = g.InstallCmd.Flag("gce-node-tag", "Override node tag on the instance in GCE required for load balanacing. Defaults to cluster name.").Strings()
g.InstallCmd.DNSHosts = g.InstallCmd.Flag("dns-host", "Specify an IP address that will be returned for the given domain within the cluster. Accepts <domain>/<ip> format. Can be specified multiple times.").Hidden().Strings()
g.InstallCmd.DNSZones = g.InstallCmd.Flag("dns-zone", "Specify an upstream server for the given zone within the cluster. Accepts <zone>/<nameserver> format where <nameserver> can be either <ip> or <ip>:<port>. Can be specified multiple times.").Strings()
g.InstallCmd.ExcludeHostFromCluster = g.InstallCmd.Flag("exclude-from-cluster", "Do not use this node in the cluster").Bool()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd consult with Ev on what to call this flag. (If you haven't already.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't. Here's the comment in the corresponding issue: https://github.com/gravitational/gravity.e/issues/3846#issuecomment-496212814

@@ -98,6 +93,8 @@ func RegisterCommands(app *kingpin.Application) *Application {
g.InstallCmd.GCENodeTags = g.InstallCmd.Flag("gce-node-tag", "Override node tag on the instance in GCE required for load balanacing. Defaults to cluster name.").Strings()
g.InstallCmd.DNSHosts = g.InstallCmd.Flag("dns-host", "Specify an IP address that will be returned for the given domain within the cluster. Accepts <domain>/<ip> format. Can be specified multiple times.").Hidden().Strings()
g.InstallCmd.DNSZones = g.InstallCmd.Flag("dns-zone", "Specify an upstream server for the given zone within the cluster. Accepts <zone>/<nameserver> format where <nameserver> can be either <ip> or <ip>:<port>. Can be specified multiple times.").Strings()
g.InstallCmd.ExcludeHostFromCluster = g.InstallCmd.Flag("exclude-from-cluster", "Do not use this node in the cluster").Bool()
g.InstallCmd.FromService = g.InstallCmd.Flag("from-service", "Run in service mode").Hidden().Bool()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this flag? Why is this not the only option? (I assume this is default?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This flag differentiates between the client and the service. It is not directly used by the user, and the client process user communicates with controls the service process - so there're two processes involved which appear to the user as one. But they have different modes of operation hence the flag.

g.JoinCmd.OperationID = g.JoinCmd.Flag("operation-id", "ID of the operation that was created via UI").Hidden().String()
g.JoinCmd.FromService = g.JoinCmd.Flag("from-service", "Run in service mode").Hidden().Bool()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same q as above about the necessity of this flag.

@a-palchikov a-palchikov force-pushed the dmitri/3846-install-unattended branch from 870168f to 7dc4753 Compare May 23, 2019 20:35
@@ -213,6 +214,9 @@ const (
// PeerConnectTimeout is the timeout of an RPC agent connecting to its peer
PeerConnectTimeout = 10 * time.Second

// ServiceConnectTimeout specifies the timeout for connecting to the installer service
ServiceConnectTimeout = 5 * time.Minute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious? why 5 minutes?

Also, if this is specific to the installer (as per the comment), maybe InstallServiceConnectTimeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No real reason, this is more of a left-over from debugging - thanks for pointing it out. I'll shorten it.

// RPCAgentSecretsPackage specifies the name of the RPC credentials package
RPCAgentSecretsPackage = "rpcagent-secrets"

// ShutdownTimeout defines the maximum amount of time to wait for an operation to complete
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't really just the amount of time to wait for an operation to complete is it? I assume based on the variable name, this is the amount of time to wait when a shutdown has been requested?

if err := config.CheckAndSetDefaults(); err != nil {
return nil, trace.Wrap(err)
}
server := server.New(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might just be missing a bit of context, why does each peer get its own instance of the gRPC installer server?

Copy link
Contributor Author

@a-palchikov a-palchikov May 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it seems confusing. Basically, a peer in the context of expand operation is different from a peer in context of an install operation (at least atm), so lib/expand ends up running two errands - for install, the server is not that relevant and I'd keep it for symmetry. For expand, it is required to run the expand operation.

// Execute executes the peer operation (join or just serving an agent).
// Implements server.Executor
func (p *Peer) Execute(phase *installpb.ExecuteRequest_Phase) (err error) {
p.Info("Execute.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include additional context in the log. What is being executed, where the request is coming from, stuff like that.

for _, addr := range p.Peers {
p.Debugf("Trying peer %v.", addr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dialing Peer instead of trying Peer?

@@ -150,7 +150,7 @@ func (p *pullExecutor) pullUserApplication() error {
DstApp: p.LocalApps,
Package: *p.Phase.Data.Package,
})
if err != nil {
if err != nil && !trace.IsAlreadyExists(err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes you point out really great stuff by commenting on your own PR, however, I feel this is a comment that is better suited in the source. It answers the why of a particular line of code and isn't a simple restatement of the code.

Why is AlreadyExists an acceptable error here?
Answer: Needs to be re-entrant.

@@ -76,7 +76,7 @@ echo "$(date) [INFO] Install agent will be using ${TMPDIR:-/tmp} for temporary f
%v
{{.service_user_env}}={{.service_uid}} \
{{.service_group_env}}={{.service_gid}} \
{{.gravity_bin_path}} {{if .devmode}}--insecure{{end}} --debug join {{.ops_url}} \
{{.gravity_bin_path}} {{if .devmode}}--insecure{{end}} --debug --httpprofile=localhost:0 join {{.ops_url}} \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's something to be careful of too, since customers tend to scan the software, and don't tend to like debug endpoints, even if they're bound to localhost.

@a-palchikov a-palchikov force-pushed the dmitri/3846-install-unattended branch 2 times, most recently from f356f29 to d481ca1 Compare May 31, 2019 18:35
 - Fix an issue with signal loop not cancelling the context
 - Unpack process assets into fixed location in ephemeral state
directory
 - Fix installer client connect
…rce flag which has a slightly different meaning (will force even a completed phase).

With resume flag, only in-progress or failed phases are re-ran.
 * Fix a couple of issues with signal handling by adding an explicit done channel. The channel is useful to ensure that all stoppers have finished.
Also fix the context having been already cancelled case by using an unrelated context for the cleanup.
 * Implement graceful stop for wizard and shut down installer gRPC server when requested.
 * Rewrite plan/operation handling code to be able to deal with active/inactive wizard process. In inactvie case, the plan data is retrieved directly from
the state directory.
 * Simplify interrupt handling
…rk for automatic restarts.

  * Factored common gRPC code into lib/install/server.Server.
  * Moved system uninstall code to lib/system/cleanup to be able to reuse it with installer service
  * Termination handler logic separate for client/server
…s simplifies service handling w.r.t.

   to the location of the socket file and the name of the service
 * Implement Abort in gRPC agents
 * Store service socket in ephemeral directory root
 * Mask service to prevent auto-restart upon completion
…per resume command (installer node vs join nodes)

  * Another pass at handling of completion and aborts. Uninstall services upon successful operation completion
  * Show proper error message when the operation is aborted and not a random error from the middle of the operation
  * Move agent from engine to installer and run it at start if necessary. This necessitated changes to reconnect logic
  * Make token specified for install be the provisioning token. This necessitated changes to provisioning tokens - install will only create a single long-lived expand token
  * Update peer reconnect strategy to check explicitly for all known terminal conditions
  * Fix tests
…t properly in 'plan resume'

 * Handle operation completion properly for interactive wizard (disconnect client, handle interrupt as shutdown instead of abort)
 * Move wizard state to wizard directory in ephemeral location
 * Revisit context cancellation vs waiting for completion. Do away with
using top-level context canclled by interrupt handler - instead rely on
the handler cancellation behavior and use unbounded context where
necessary - cancellation/abort is controlled by the interrupt handler.
This fixes multiple issues with remote calls being cancelled half-way
when they need to complete.
 * Move all install-specific code to base repository and call into the
service for plan subcommands.
binary's directory. Using binary directory will only work for relative
invocations - absolute invocations will almost always use the wrong
directory for state.
 * Validate that the temporary state directory is not rooted in one of
the temporary locations (in addition to tmpfs).
…ttributes. Clean up operation state after completion
 * Revisit abort logic and move abort/completion handlers to client
where they're more meaningful.
 * Remove OpsCenter-based installer flow.
 * Use Aborters/Completers on the client side - this prompted to extend
the server Event struct with a Status instead of Completion flag to
support the interactive install flow when user needs to terminate the
installer explicitly.
 * Updated bolt dependency to refer to a fork with changes to support no
timeout for access.
service.
 * Improve service init by ensuring no left-overs from previous services
configuration.
 * Revert changes to InterruptHandler to let it properly cancel the
external context.
 * Send completion event for expand.
service-driven execution does not yet work with agents (for install).
This will require restructuring lib/expand.
 * Load RPC credentials directly from the RPC package for install.
 * Improve handling of premature interrupts (at init stage)
 * Fix the problem with abort handler that did not properly uninstall
services.
@a-palchikov a-palchikov force-pushed the dmitri/3846-install-unattended branch from d481ca1 to 0e31d87 Compare June 3, 2019 14:38
@a-palchikov a-palchikov merged commit 7c0d01b into master Jun 3, 2019
@a-palchikov a-palchikov deleted the dmitri/3846-install-unattended branch June 3, 2019 17:22
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants