Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix webui log hanging bug #1914

Merged
merged 2 commits into from
Nov 8, 2015
Merged

fix webui log hanging bug #1914

merged 2 commits into from
Nov 8, 2015

Conversation

whyrusleeping
Copy link
Member

the webui can currently cause the entire daemon to deadlock because the writer set up when it queries the log endpoints is never closed (and will hang on writes). This causes all the calls to any log function to hang and subsequently deadlocks the daemon.

I beleive this is the root cause of 1896, and after this PR, I am unable to repro by messing around via the webui.

Interestingly, i was never able to get a repro via curl... only in-browser.

@jbenet jbenet added the status/in-progress In progress label Oct 30, 2015
@whyrusleeping whyrusleeping added kind/bug A bug in existing code (including security flaws) RFCR and removed status/in-progress In progress labels Oct 30, 2015
@whyrusleeping
Copy link
Member Author

In debugging this issue i ended up cleaning up a bunch more logging things, everything is just a little bit nicer now (flatfs was importing its own separate instance of go-log, that was making weirdness)

@daviddias
Copy link
Member

When doing go install ./cmd/ipfs, I'm getting this:

│vendor/QmZUcdoz4Afhc5vr3LEoPHxr1wwZfjBxxjWx2oMmfChG9W/go-log/metadata.go:8:2: cannot find package "github.com/satori/go.uuid" in any of:
                                                                                                                                                               │        /usr/local/go/src/github.com/satori/go.uuid (from $GOROOT)
                                                                                                                                                               │        /Users/david/Documents/Code/go-projects/src/github.com/satori/go.uuid (from $GOPATH)
                                                                                                                                                               │vendor/QmZUcdoz4Afhc5vr3LEoPHxr1wwZfjBxxjWx2oMmfChG9W/go-log/oldlog.go:8:2: cannot find package "github.com/whyrusleeping/go-logging" in any of:
                                                                                                                                                               │        /usr/local/go/src/github.com/whyrusleeping/go-logging (from $GOROOT)
                                                                                                                                                               │        /Users/david/Documents/Code/go-projects/src/github.com/whyrusleeping/go-logging (from $GOPATH)

go get doesn't solve it, thoughts?

@whyrusleeping
Copy link
Member Author

@diasdavid my bad, tired brain forgot to vendor correctly.

@daviddias
Copy link
Member

thank you @whyrusleeping :)

so, 1st thing I noticed were these logs when I opened the webui:

08:08:02.161 ERROR     flatfs: too many open files, retrying in %dms0 flatfs.go:119
08:08:02.161 ERROR     flatfs: too many open files, retrying in %dms100 flatfs.go:119
08:08:02.263 ERROR     flatfs: too many open files, retrying in %dms200 flatfs.go:119
08:08:03.562 ERROR core/serve: Path Resolve error: no link named "fontawesome-webfont.woff2" under QmWjympWW8hpP5Jgfu66ZrqTMykUBjGbFQF2XgfChau5NZ gateway_handler.go:479
08:08:03.782 ERROR     flatfs: too many open files, retrying in %dms0 flatfs.go:119
08:08:03.782 ERROR     flatfs: too many open files, retrying in %dms100 flatfs.go:119

@daviddias
Copy link
Member

Been testing with js-ipfs-api latest and very happy with the results, because it works again :) (0.3.8 made the response parser go nuts)

@whyrusleeping
Copy link
Member Author

@diasdavid yeah, i notice that when on osx as well. its that darn 256 fd limit. Nothing bad should actually be happening though, you'll see an uglier error if it does.

@jbenet
Copy link
Member

jbenet commented Oct 31, 2015

@whyrusleeping the new webui with logging disabled is:

/ipfs/QmR9MzChjp1MdFWik7NjEjqKQMzVmBkdK3dz14A6B5Cupm

case <-cn.CloseNotify():
case <-ctx.Done():
}
cancel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it okay to cancel a context twice?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfectly fine :)

@whyrusleeping whyrusleeping force-pushed the fix/log-hang branch 4 times, most recently from 823df45 to c5b133a Compare November 5, 2015 03:44
@whyrusleeping
Copy link
Member Author

looks like circleCI was trying to run fuse tests? thats weird

License: MIT
Signed-off-by: Jeromy <jeromyj@gmail.com>
License: MIT
Signed-off-by: Jeromy <jeromyj@gmail.com>
@jbenet
Copy link
Member

jbenet commented Nov 8, 2015

@whyrusleeping this LGTM.

that one test failed many times. hangs, or this:

not ok 19 - 'ipfs daemon' should be able to run with a pipe attached to stdin (issue #861)
#   
#     yes | ipfs daemon --init >stdin_daemon_out 2>stdin_daemon_err &
#     pollEndpoint -ep=/version -v -tout=1s -tries=10 >stdin_poll_apiout 2>stdin_poll_apierr &&
#     test_kill_repeat_10_sec $! ||
#     test_fsh cat stdin_daemon_out || test_fsh cat stdin_daemon_err || test_fsh cat stdin_poll_apiout || test_fsh cat stdin_poll_apierr
#   

https://travis-ci.org/ipfs/go-ipfs/jobs/89554589

@jbenet
Copy link
Member

jbenet commented Nov 8, 2015

@whyrusleeping worried these tests failures (hangs!) are related to this PR. cant get things to pass, and even on the off chance it does now, i dont have confidence on it.

i'll give it a few hours for travis to shake up, then rerun everything a few times

@whyrusleeping
Copy link
Member Author

i'm seeing similar tests failing (the stdin pipe one) on other PRs. While thats not a good thing, i'm slightly more confident that its not this PRs fault

@whyrusleeping
Copy link
Member Author

@jbenet clean tests (minus the attached stdin one that i'm seeing elsewhere as well)

@whyrusleeping
Copy link
Member Author

the stdin faiure appears to just be caused by the daemon taking a little longer to startup now (not sure why that is), locally, it takes four or five seconds to startup the daemon. The test is set to fail it after ten seconds, a particularly slow and stressed travis CI machine will likely hit this with some frequency

@whyrusleeping
Copy link
Member Author

Actually, thinking about it a bit more. That test bootstraps to our bootstrap nodes, they have been really slow lately (@lgierth mentioned they havent been rebooted in a while). These failures are very likely a result of bootstrapping taking way longer than usual.

@jbenet
Copy link
Member

jbenet commented Nov 8, 2015

until we are satisfied with ipfs's reliability at long periods we should continue periodically rebooting the nodes. it's a service people depend on.

@jbenet
Copy link
Member

jbenet commented Nov 8, 2015

we should get to the bottom of these tests failing before merging anything else. either it's a problem that snuck in, or its a service problem that needs to be fixed.

@jbenet
Copy link
Member

jbenet commented Nov 8, 2015

ok looks good now

jbenet added a commit that referenced this pull request Nov 8, 2015
@jbenet jbenet merged commit 0b3e037 into master Nov 8, 2015
@jbenet jbenet deleted the fix/log-hang branch November 8, 2015 23:50
@whyrusleeping
Copy link
Member Author

@jbenet in the future, could you just say "RFM" or something and let me rebase and merge it in so we can get a straight line of history?

@jbenet
Copy link
Member

jbenet commented Nov 9, 2015

Sure. Feel free to fix it and force push to master


Sent from Mailbox

On Sun, Nov 8, 2015 at 11:55 PM, Jeromy Johnson notifications@github.com
wrote:

@jbenet in the future, could you just say "RFM" or something and let me rebase and merge it in so we can get a straight line of history?

Reply to this email directly or view it on GitHub:
#1914 (comment)

@jbenet
Copy link
Member

jbenet commented Nov 9, 2015

Just drop the heads here for reference


Sent from Mailbox

On Mon, Nov 9, 2015 at 12:00 AM, Juan Batiz-Benet juan@benet.ai wrote:

Sure. Feel free to fix it and force push to master

Sent from Mailbox
On Sun, Nov 8, 2015 at 11:55 PM, Jeromy Johnson notifications@github.com
wrote:

@jbenet in the future, could you just say "RFM" or something and let me rebase and merge it in so we can get a straight line of history?

Reply to this email directly or view it on GitHub:
#1914 (comment)

@whyrusleeping
Copy link
Member Author

Can just do it for future PR's. force pushing to master is very scary for many reasons other than just "i might screw something up"

@rht rht mentioned this pull request Dec 17, 2015
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants