Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decider and Activity pollers stop polling after 3 tries #113

Closed
ryanwalls opened this issue Jan 22, 2016 · 7 comments
Closed

Decider and Activity pollers stop polling after 3 tries #113

ryanwalls opened this issue Jan 22, 2016 · 7 comments

Comments

@ryanwalls
Copy link
Contributor

When the FSM is started, it polls 3 times and then stops/hangs. Log output:

2016/01/22 08:35:13 component=DecisionTaskPoller at=decision-task-page poll-id="7b61572d-4bb1-4b8e-ab85-2672c0c0144a" task-list="defaultTaskList" workflow="no-workflow-execution" page=1 PreviousStartedEventId=0 StartedEventId=0 NumEvents=0 FirstEventId=nil LastEventId=nil
2016/01/22 08:35:13 component=DecisionTaskPoller at=decision-task-empty-response poll-id="7b61572d-4bb1-4b8e-ab85-2672c0c0144a" task-list="defaultTaskList"
2016/01/22 08:35:13 component=DecisionTaskPoller fn=PollUntilShutdownBy at=poll-no-task poller=example-fsm-poller task-list="defaultTaskList"
2016/01/22 08:36:13 component=DecisionTaskPoller at=decision-task-page poll-id="8c6e79a8-2cf6-48cc-9db2-e342abbcb72d" task-list="defaultTaskList" workflow="no-workflow-execution" page=1 PreviousStartedEventId=0 StartedEventId=0 NumEvents=0 FirstEventId=nil LastEventId=nil
2016/01/22 08:36:13 component=DecisionTaskPoller at=decision-task-empty-response poll-id="8c6e79a8-2cf6-48cc-9db2-e342abbcb72d" task-list="defaultTaskList"
2016/01/22 08:36:13 component=DecisionTaskPoller fn=PollUntilShutdownBy at=poll-no-task poller=example-fsm-poller task-list="defaultTaskList"

Allow 5+ minutes to pass and no more logging is produced.

ActivityTaskPoller exhibits the same behavior. Log output:

2016/01/22 08:45:40 component=ActivityTaskPoller at=activity-task-empty-response
2016/01/22 08:45:40 component=ActivityTaskPoller fn=PollUntilShutdownBy at=poll-no-task poller=preproc-0.0.1-poller task-list="preproc-0.0.1"
2016/01/22 08:46:41 component=ActivityTaskPoller at=activity-task-empty-response
2016/01/22 08:46:41 component=ActivityTaskPoller fn=PollUntilShutdownBy at=poll-no-task poller=preproc-0.0.1-poller task-list="preproc-0.0.1"
@ryanwalls ryanwalls changed the title Decider poller stops polling after 3 tries Decider and Activity pollers stop polling after 3 tries Jan 22, 2016
@sclasen
Copy link
Owner

sclasen commented Jan 22, 2016

interesting, we definitely have not seen this behavior, could you share a code snippet that produces this @ryanwalls ? Thanks!

@ryanwalls
Copy link
Contributor Author

I've created a public copy of my project here: https://github.com/ryanwalls/workflow-demo

Something weird is definitely happening... it fails at varying points now that I've add debug logging to the AWS config.

The key file to look at is simulationworkflow/fsm.go

@sclasen
Copy link
Owner

sclasen commented Jan 22, 2016

@ryanwalls thanks! havent had a chance to build / run just yet but one thing that may (or not) cause an issue is calling FSM.Init(), since the Start() method would call it.

https://github.com/ryanwalls/workflow-demo/blob/master/simulationworkflow/fsm.go#L75

One of the few (sorry! 😄 ) meaningful doc comments is on the Init method, which I think means you shouldnt call it (failing to load the context in my brain as to why, so just trusting my past self)

https://github.com/sclasen/swfsm/blob/master/fsm/fsm.go#L193-L194

Could you try removing that line and let me know if it works or not?

If that doesnt help I'll dig in further when I have some more time.

@ryanwalls
Copy link
Contributor Author

@sclasen Removed the line, and still hangs. It hangs at various points in the polling. I captured the goroutine traces when I killed the app using kill -ABRT <pid> and here's what I got:

SIGABRT: abort
PC=0x5c2bb m=0

goroutine 0 [idle]:
runtime.mach_semaphore_wait(0x1303, 0xeebc50, 0x10a7c50, 0x5a5af01, 0x6410033, 0xeeb580, 0x4f039, 0xffffffffffffffff, 0xc820067980, 0x7fff5fbff66c, ...)
    /usr/local/go/src/runtime/sys_darwin_amd64.s:407 +0xb
runtime.semasleep1(0xffffffffffffffff, 0xc820067980)
    /usr/local/go/src/runtime/os1_darwin.go:385 +0xe5
runtime.semasleep.func1()
    /usr/local/go/src/runtime/os1_darwin.go:401 +0x29
runtime.systemstack(0x7fff5fbff670)
    /usr/local/go/src/runtime/asm_amd64.s:278 +0xab
runtime.semasleep(0xffffffffffffffff, 0x0)
    /usr/local/go/src/runtime/os1_darwin.go:402 +0x36
runtime.notesleep(0xeebc30)
    /usr/local/go/src/runtime/lock_sema.go:169 +0x100
runtime.stopm()
    /usr/local/go/src/runtime/proc1.go:1128 +0x112
runtime.findrunnable(0xc820024000, 0x0)
    /usr/local/go/src/runtime/proc1.go:1530 +0x69e
runtime.schedule()
    /usr/local/go/src/runtime/proc1.go:1639 +0x267
runtime.park_m(0xc820067980)
    /usr/local/go/src/runtime/proc1.go:1698 +0x18b
runtime.mcall(0x7fff5fbff7f0)
    /usr/local/go/src/runtime/asm_amd64.s:204 +0x5b

goroutine 1 [running]:
    goroutine running on other thread; stack unavailable

goroutine 17 [syscall, locked to thread]:
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1696 +0x1

goroutine 29 [IO wait]:
net.runtime_pollWait(0x15a7258, 0x72, 0xc8200760a0)
    /usr/local/go/src/runtime/netpoll.go:157 +0x60
net.(*pollDesc).Wait(0xc820151f00, 0x72, 0x0, 0x0)
    /usr/local/go/src/net/fd_poll_runtime.go:73 +0x3a
net.(*pollDesc).WaitRead(0xc820151f00, 0x0, 0x0)
    /usr/local/go/src/net/fd_poll_runtime.go:78 +0x36
net.(*netFD).Read(0xc820151ea0, 0xc820431c00, 0x400, 0x400, 0x0, 0x15a0028, 0xc8200760a0)
    /usr/local/go/src/net/fd_unix.go:232 +0x23a
net.(*conn).Read(0xc820086230, 0xc820431c00, 0x400, 0x400, 0x0, 0x0, 0x0)
    /usr/local/go/src/net/net.go:172 +0xe4
crypto/tls.(*block).readFromUntil(0xc82049d6e0, 0x15a7340, 0xc820086230, 0x5, 0x0, 0x0)
    /usr/local/go/src/crypto/tls/conn.go:455 +0xcc
crypto/tls.(*Conn).readRecord(0xc82017a840, 0xbcc017, 0x0, 0x0)
    /usr/local/go/src/crypto/tls/conn.go:540 +0x2d1
crypto/tls.(*Conn).Read(0xc82017a840, 0xc8201d8000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
    /usr/local/go/src/crypto/tls/conn.go:901 +0x167
net/http.noteEOFReader.Read(0x15b3fc8, 0xc82017a840, 0xc82013c3c8, 0xc8201d8000, 0x1000, 0x1000, 0xc820032900, 0x0, 0x0)
    /usr/local/go/src/net/http/transport.go:1370 +0x67
net/http.(*noteEOFReader).Read(0xc8204b8120, 0xc8201d8000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
    <autogenerated>:126 +0xd0
bufio.(*Reader).fill(0xc82042fc20)
    /usr/local/go/src/bufio/bufio.go:97 +0x1e9
bufio.(*Reader).Peek(0xc82042fc20, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0)
    /usr/local/go/src/bufio/bufio.go:132 +0xcc
net/http.(*persistConn).readLoop(0xc82013c370)
    /usr/local/go/src/net/http/transport.go:876 +0xf7
created by net/http.(*Transport).dialConn
    /usr/local/go/src/net/http/transport.go:685 +0xc78

goroutine 30 [select]:
net/http.(*persistConn).writeLoop(0xc82013c370)
    /usr/local/go/src/net/http/transport.go:1009 +0x40c
created by net/http.(*Transport).dialConn
    /usr/local/go/src/net/http/transport.go:686 +0xc9d

goroutine 7 [runnable]:
time.Sleep(0x186a0)
    /usr/local/go/src/runtime/time.go:59 +0xf9
time.Sleep(0x186a0)
    /usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
    /usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
    /usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
    /usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
    /usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
    /usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
    /usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
    /usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
    /usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
    /usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
    /usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
    /usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
    /usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
    /usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
    /usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
    /usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
    /usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
    /usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
    /usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
    /usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
    /usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
    /usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
    /usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
    /usr/local/go/src/runtime/time.go:53 +0x37
created by github.com/3dsim/workflow/vendor/github.com/sclasen/swfsm/fsm.(*FSM).Start
    /Users/ryan/Development/gocode/src/github.com/3dsim/workflow/vendor/github.com/sclasen/swfsm/fsm/fsm.go:250 +0x395

rax    0xe
rbx    0x1303
rcx    0x7fff5fbff5f8
rdx    0x7fff5fbff670
rdi    0x1303
rsi    0xeeb580
rbp    0xeebb20
rsp    0x7fff5fbff5f8
r8     0xeebb20
r9     0x16c
r10    0x73203e5d
r11    0x286
r12    0x31b68ce1b2941
r13    0x32eff28f83fa0
r14    0x142bdb52eef5bc00
r15    0x8
rip    0x5c2bb
rflags 0x286
cs     0x7
fs     0x0
gs     0x0

I've changed the source code some trying to debug this, but line 250 (which appears to be kicking off a bunch of sleep) is this line:
go poller.PollUntilShutdownBy(f.ShutdownManager, fmt.Sprintf("%s-poller", f.Name), f.dispatchTask, f.taskReady)

@sclasen
Copy link
Owner

sclasen commented Jan 22, 2016

Thanks for trying. One more thing to give a try, maybe set gomaxprocs to 1
in your main.go ?

Otherwise I will try and get this running to see whats up.

On Fri, Jan 22, 2016 at 1:03 PM, ryanwalls notifications@github.com wrote:

@sclasen https://github.com/sclasen Removed the line, and still hangs.
It hangs at various points in the polling. I captured the goroutine traces
when I killed the app using kill -ABRT and here's what I got:

SIGABRT: abort
PC=0x5c2bb m=0

goroutine 0 [idle]:
runtime.mach_semaphore_wait(0x1303, 0xeebc50, 0x10a7c50, 0x5a5af01, 0x6410033, 0xeeb580, 0x4f039, 0xffffffffffffffff, 0xc820067980, 0x7fff5fbff66c, ...)
/usr/local/go/src/runtime/sys_darwin_amd64.s:407 +0xb
runtime.semasleep1(0xffffffffffffffff, 0xc820067980)
/usr/local/go/src/runtime/os1_darwin.go:385 +0xe5
runtime.semasleep.func1()
/usr/local/go/src/runtime/os1_darwin.go:401 +0x29
runtime.systemstack(0x7fff5fbff670)
/usr/local/go/src/runtime/asm_amd64.s:278 +0xab
runtime.semasleep(0xffffffffffffffff, 0x0)
/usr/local/go/src/runtime/os1_darwin.go:402 +0x36
runtime.notesleep(0xeebc30)
/usr/local/go/src/runtime/lock_sema.go:169 +0x100
runtime.stopm()
/usr/local/go/src/runtime/proc1.go:1128 +0x112
runtime.findrunnable(0xc820024000, 0x0)
/usr/local/go/src/runtime/proc1.go:1530 +0x69e
runtime.schedule()
/usr/local/go/src/runtime/proc1.go:1639 +0x267
runtime.park_m(0xc820067980)
/usr/local/go/src/runtime/proc1.go:1698 +0x18b
runtime.mcall(0x7fff5fbff7f0)
/usr/local/go/src/runtime/asm_amd64.s:204 +0x5b

goroutine 1 [running]:
goroutine running on other thread; stack unavailable

goroutine 17 [syscall, locked to thread]:
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1696 +0x1

goroutine 29 [IO wait]:
net.runtime_pollWait(0x15a7258, 0x72, 0xc8200760a0)
/usr/local/go/src/runtime/netpoll.go:157 +0x60
net.(_pollDesc).Wait(0xc820151f00, 0x72, 0x0, 0x0)
/usr/local/go/src/net/fd_poll_runtime.go:73 +0x3a
net.(_pollDesc).WaitRead(0xc820151f00, 0x0, 0x0)
/usr/local/go/src/net/fd_poll_runtime.go:78 +0x36
net.(_netFD).Read(0xc820151ea0, 0xc820431c00, 0x400, 0x400, 0x0, 0x15a0028, 0xc8200760a0)
/usr/local/go/src/net/fd_unix.go:232 +0x23a
net.(_conn).Read(0xc820086230, 0xc820431c00, 0x400, 0x400, 0x0, 0x0, 0x0)
/usr/local/go/src/net/net.go:172 +0xe4
crypto/tls.(_block).readFromUntil(0xc82049d6e0, 0x15a7340, 0xc820086230, 0x5, 0x0, 0x0)
/usr/local/go/src/crypto/tls/conn.go:455 +0xcc
crypto/tls.(_Conn).readRecord(0xc82017a840, 0xbcc017, 0x0, 0x0)
/usr/local/go/src/crypto/tls/conn.go:540 +0x2d1
crypto/tls.(_Conn).Read(0xc82017a840, 0xc8201d8000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
/usr/local/go/src/crypto/tls/conn.go:901 +0x167
net/http.noteEOFReader.Read(0x15b3fc8, 0xc82017a840, 0xc82013c3c8, 0xc8201d8000, 0x1000, 0x1000, 0xc820032900, 0x0, 0x0)
/usr/local/go/src/net/http/transport.go:1370 +0x67
net/http.(_noteEOFReader).Read(0xc8204b8120, 0xc8201d8000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
:126 +0xd0
bufio.(_Reader).fill(0xc82042fc20)
/usr/local/go/src/bufio/bufio.go:97 +0x1e9
bufio.(_Reader).Peek(0xc82042fc20, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0)
/usr/local/go/src/bufio/bufio.go:132 +0xcc
net/http.(_persistConn).readLoop(0xc82013c370)
/usr/local/go/src/net/http/transport.go:876 +0xf7
created by net/http.(_Transport).dialConn
/usr/local/go/src/net/http/transport.go:685 +0xc78

goroutine 30 [select]:
net/http.(_persistConn).writeLoop(0xc82013c370)
/usr/local/go/src/net/http/transport.go:1009 +0x40c
created by net/http.(_Transport).dialConn
/usr/local/go/src/net/http/transport.go:686 +0xc9d

goroutine 7 [runnable]:
time.Sleep(0x186a0)
/usr/local/go/src/runtime/time.go:59 +0xf9
time.Sleep(0x186a0)
/usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
/usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
/usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
/usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
/usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
/usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
/usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
/usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
/usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
/usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
/usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
/usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
/usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
/usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
/usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
/usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
/usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
/usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
/usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
/usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
/usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
/usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
/usr/local/go/src/runtime/time.go:53 +0x37
time.Sleep(0x186a0)
/usr/local/go/src/runtime/time.go:53 +0x37
created by github.com/3dsim/workflow/vendor/github.com/sclasen/swfsm/fsm.(*FSM).Start
/Users/ryan/Development/gocode/src/github.com/3dsim/workflow/vendor/github.com/sclasen/swfsm/fsm/fsm.go:250 +0x395

rax 0xe
rbx 0x1303
rcx 0x7fff5fbff5f8
rdx 0x7fff5fbff670
rdi 0x1303
rsi 0xeeb580
rbp 0xeebb20
rsp 0x7fff5fbff5f8
r8 0xeebb20
r9 0x16c
r10 0x73203e5d
r11 0x286
r12 0x31b68ce1b2941
r13 0x32eff28f83fa0
r14 0x142bdb52eef5bc00
r15 0x8
rip 0x5c2bb
rflags 0x286
cs 0x7
fs 0x0
gs 0x0


Reply to this email directly or view it on GitHub
#113 (comment).

@ryanwalls
Copy link
Contributor Author

Sorry for the hassle. Always the little things...

I figured it out.... see https://code.google.com/p/go/issues/detail?id=7190

I was using for {} to keep the program from exiting because FSM.Start() is asynchronous and returns immediately... I should have used select {}.

@sclasen
Copy link
Owner

sclasen commented Jan 22, 2016

aha, excellent. @ryanwalls Thanks for taking the time to post the repo, etc....please let me know if you run into anything else!

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants