Skip to content
This repository has been archived by the owner on Jan 8, 2024. It is now read-only.

Retry on openApplication error (darwin); More error logging #4

Merged
merged 4 commits into from
Apr 12, 2016

Conversation

gabriel
Copy link
Contributor

@gabriel gabriel commented Apr 7, 2016

It will retry open if it fails (after 3 seconds) for OS X... Adds some more logging for errors...

@gabriel
Copy link
Contributor Author

gabriel commented Apr 7, 2016

@cjb @chrisnojima

@cjb
Copy link
Contributor

cjb commented Apr 8, 2016

We could merge this. But we still don't have any theories for:

  • why open failed
  • why it just started failing yesterday after working forever

Any ideas on those?

tryOpen := func() error {
out, err := exec.Command("/usr/bin/open", applicationPath).CombinedOutput()
if err != nil {
return fmt.Errorf("%s; %s", err, string(out))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the error message also say that open caused this problem too?

@gabriel
Copy link
Contributor Author

gabriel commented Apr 10, 2016

I digged a little bit trying to find the source code to /usr/bin/open to see if I could see under what scenarios a exit status of 1 would result... I pulled all the source for https://opensource.apple.com/ and grepped for strings present in the man page, but was unable to find it.

Just trying stuff out, I was able to get /usr/bin/open to return exit code 1 if:

  • the file didn't exist
  • if you tried to open from trash

I'm assuming there are other scenarios where open might exit with this code... maybe it always returns this on error... or maybe uses this code if there is a file related issue...

If the application is already open then it will noop return exit code 0, so it seems like that isn't the issue. We do use SIGKILL on the app process.

This feels like a race condition or possibly flakiness arising from the OS (maybe Gatekeeper?) after a SIGKILL. I have seen Gatekeeper act weird in the past. Perhaps we shouldn't be using SIGKILL and it is too abrupt especially for something like Electron where it has other helper processes, etc?

I don't know why this would appear now, other than there have been lots of Electron patch releases lately so maybe some timing changed.

I looked at Sparkle and it uses NSWorkspace openFile. Squirrel uses NSWorkspace launchApplicationAtURL.

We could make our own open tool (or possibly cgo) which uses these library calls, but I'm not sure if it would give us anything.

@gabriel
Copy link
Contributor Author

gabriel commented Apr 11, 2016

We are also assuming (golang) os.Rename is atomic but maybe we shouldn't be... This would cause the /usr/bin/open to fail with exit code 1 (with file not found)...

"There are systems where it cannot be done atomically, therefore it is not atomic."
https://groups.google.com/forum/#!topic/golang-nuts/ZjRWB8bMhv4

"os: make Rename atomic on Windows",
"In Posix-compliant OSes, the os.Rename() function works properly (and atomically), including when replacing an existing file"
golang/go#8914

@maxtaco
Copy link
Contributor

maxtaco commented Apr 11, 2016

Is a directory or a file being swapped? If a file, we can use exchangedata(2)

@gabriel
Copy link
Contributor Author

gabriel commented Apr 11, 2016

It's an (app bundle) directory.

@cjb
Copy link
Contributor

cjb commented Apr 11, 2016

We are also assuming (golang) os.Rename is atomic but maybe we shouldn't be...

I think it's saying that if it's backed by rename(2), and it's running on POSIX, then it is atomic, as POSIX requires. But not every platform supported by golang is POSIX, so they can't make a general statement.

@cjb
Copy link
Contributor

cjb commented Apr 11, 2016

I think the next step should be to run dtruss while reproing. It'll test our current hypothesis as well as letting us the see the error returned from open.

@cjb
Copy link
Contributor

cjb commented Apr 11, 2016

I got dtruss going. It isn't capturing everything going on, but this looks like what we were looking for:

  579/0x1900:  write_nocancel(0x2, "LSOpenURLsWithRole() failed with error -10810 for the file /Applications/Keybase.app.\0", 0x55)              = 85 0

@cjb
Copy link
Contributor

cjb commented Apr 11, 2016

Good news: I'm now able to repro with a service that I built myself that includes this PR.

Bad news: The three second delay doesn't help.

2016-04-11T11:40:34.562989 ▶ [DEBU keybase updater.go:712] 1ea App reported its PID as 501
2016-04-11T11:40:34.563005 ▶ [DEBU keybase updater.go:717] 1eb Killing app
2016-04-11T11:40:34.563021 ▶ [DEBU keybase updater.go:724] 1ec Opening app at /Applications/Keybase.app
2016-04-11T11:40:34.570120 ▶ [DEBU keybase log.go:131] 1ed {unix} EOF
2016-04-11T11:40:34.570321 ▶ [DEBU keybase main.go:115] 1ee Handle() complete for connection 1
2016-04-11T11:40:34.579557 ▶ [DEBU keybase log.go:131] 1ef {unix} EOF
2016-04-11T11:40:34.579813 ▶ [DEBU keybase main.go:115] 1f0 Handle() complete for connection 3
2016-04-11T11:40:34.579932 ▶ [DEBU keybase log.go:131] 1f1 {unix} EOF
2016-04-11T11:40:34.580061 ▶ [DEBU keybase main.go:115] 1f2 Handle() complete for connection 2
2016-04-11T11:40:34.606877 ▶ [ERRO keybase standard.go:197] 1f3 Open error (trying again in a few seconds): Open error: exit status 1; LSOpenURLsWithRole() failed with error -10810 for the file /Applications/Keybase.app.
2016-04-11T11:40:37.649005 ▶ [DEBU keybase updater.go:735] 1f4 Cleaning up after update
2016-04-11T11:40:37.649066 ▶ [DEBU keybase updater.go:737] 1f5 Removing /var/folders/hg/f5t3mfpj3djbjr61h5x2pbs00000gp/T/KeybaseUpdates/Keybase-1.0.15-20160406130015+05efae8.zip.unzipped
2016-04-11T11:40:37.649173 ▶ [DEBU keybase updater.go:737] 1f6 Removing /var/folders/hg/f5t3mfpj3djbjr61h5x2pbs00000gp/T/KeybaseUpdates/Keybase-1.0.15-20160406130015+05efae8.zip
2016-04-11T11:40:37.651802 ▶ [DEBU keybase update.go:52] 1f7 - UpdateEngine Run
2016-04-11T11:40:37.651835 ▶ [DEBU keybase engine.go:66] 1f8 - RunEngine(Updater) -> ERROR: Error opening app: Open error: exit status 1; LSOpenURLsWithRole() failed with error -10810 for the file /Applications/Keybase.app.
2016-04-11T11:40:37.653652 ▶ [DEBU keybase log.go:131] 1f9 {unix} EOF
2016-04-11T11:40:37.653772 ▶ [DEBU keybase main.go:115] 1fa Handle() complete for connection 7

@cjb
Copy link
Contributor

cjb commented Apr 11, 2016

I tried sending p.Signal(syscall.SIGTERM) instead of p.Kill (SIGKILL). It changed the output but didn't help:

2016-04-11T11:58:00.970415 ▶ [ERRO keybase standard.go:197] 0c5 Open error (trying again in a few seconds): Open error: exit status 1; LSOpenURLsWithRole() failed with error -600 for the file /Applications/Keybase.app.
2016-04-11T11:58:04.010208 ▶ [DEBU keybase updater.go:736] 0c6 Cleaning up after update
2016-04-11T11:58:04.010275 ▶ [DEBU keybase updater.go:738] 0c7 Removing /var/folders/hg/f5t3mfpj3djbjr61h5x2pbs00000gp/T/KeybaseUpdates/Keybase-1.0.15-20160406130015+05efae8.zip.unzipped
2016-04-11T11:58:04.010399 ▶ [DEBU keybase updater.go:738] 0c8 Removing /var/folders/hg/f5t3mfpj3djbjr61h5x2pbs00000gp/T/KeybaseUpdates/Keybase-1.0.15-20160406130015+05efae8.zip
2016-04-11T11:58:04.013207 ▶ [DEBU keybase update.go:52] 0c9 - UpdateEngine Run
2016-04-11T11:58:04.013250 ▶ [DEBU keybase engine.go:66] 0ca - RunEngine(Updater) -> ERROR: Error opening app: Open error: exit status 1; LSOpenURLsWithRole() failed with error -10810 for the file /Applications/Keybase.app.

I'd guess that error -600 is because something's still held by the exiting old process.

@cjb
Copy link
Contributor

cjb commented Apr 11, 2016

I tried waiting 9 seconds instead of 3 seconds, same result.

@cjb
Copy link
Contributor

cjb commented Apr 11, 2016

We tried open()ing directly from the downloaded tempdir rather than moving the download into Applications and that worked.

@cjb
Copy link
Contributor

cjb commented Apr 11, 2016

I tried killing the kbfs process during the 9 second delay between open() attempts, in case its open process from the "old" app bundle was preventing our relaunch, but it didn't help.

@cjb
Copy link
Contributor

cjb commented Apr 11, 2016

Trying open with 10 one second delayed retries worked after 6 retries/opens. But I think launchd is the parent process, not the service? That's strange.

@chrisnojima suggests trying with gatekeeper turned off.

@gabriel
Copy link
Contributor Author

gabriel commented Apr 11, 2016

It might be a bug in Electron... we could try downgrading that to an earlier version?

@cjb
Copy link
Contributor

cjb commented Apr 11, 2016

This is the diff that works for me:

diff --git a/go/vendor/github.com/keybase/go-updater/updater_osx.go b/go/vendor/github.com/keybase/go-updater/updater_osx.go
index 22312ae..271f806 100644
--- a/go/vendor/github.com/keybase/go-updater/updater_osx.go
+++ b/go/vendor/github.com/keybase/go-updater/updater_osx.go
@@ -6,6 +6,7 @@
 package updater

 import (
+       "fmt"
        "os"
        "os/exec"
        "os/user"
@@ -68,9 +69,23 @@ func (u *Updater) checkPlatformSpecificUpdate(sourcePath string, destinationPath
        return nil
 }

-func openApplication(applicationPath string) error {
-       _, err := exec.Command("/usr/bin/open", applicationPath).Output()
-       return err
+func (u *Updater) openApplication(applicationPath string) error {
+       tryOpen := func() error {
+               u.log.Errorf("Trying open")
+               out, err := exec.Command("/usr/bin/open", applicationPath).CombinedOutput()
+               if err != nil {
+                       u.log.Errorf("Open error: %s; %s", err, string(out))
+                       return fmt.Errorf("Open error: %s; %s", err, string(out))
+               }
+               return nil
+       }
+       for i := 0; i < 10; i++ {
+               if err := tryOpen(); err != nil {
+                       u.log.Errorf("Open error (trying again in a few seconds): %s", err)
+                       time.Sleep(1 * time.Second)
+               }
+       }
+       return nil
 }

 func (u *Updater) applyUpdate(localPath string) (err error) {

@cjb
Copy link
Contributor

cjb commented Apr 11, 2016

I'm running my service outside of launchd (as far as launchd is concerned keybase.service is not running), so that might affect launchd's decisions.

@gabriel
Copy link
Contributor Author

gabriel commented Apr 11, 2016

oh i think i know the bug... it looks like the executable name changed from

/Applications/Keybase.app/Contents/MacOS/Electron to /Applications/Keybase.app/Contents/MacOS/Keybase

This causes Gatekeeper to get confused.

@gabriel
Copy link
Contributor Author

gabriel commented Apr 11, 2016

I ran into this a few months ago... I'm checking now to see if a electron-packager change did this

@cjb
Copy link
Contributor

cjb commented Apr 11, 2016

Agree, the filename changed between good build and bad build. Electron changed from v0.37.2 to v0.37.3.

gabriel added a commit to keybase/client that referenced this pull request Apr 11, 2016
gabriel added a commit to keybase/client that referenced this pull request Apr 11, 2016
@gabriel gabriel merged commit 6bcce92 into master Apr 12, 2016
@gabriel gabriel deleted the openretry branch April 12, 2016 17:39
gabriel added a commit to keybase/client that referenced this pull request Apr 12, 2016
gabriel added a commit to keybase/client that referenced this pull request Apr 12, 2016
…2583)

* Symlink Electron to resolve Gatekeeper issue with changing executable name in app bundle

See keybase/go-updater#4

* Vendoring updater changes from keybase/go-updater#4
zanderz pushed a commit that referenced this pull request Jun 20, 2018
Forgot to update the name of the launch command in slackbot
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants