Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rally froze at the end of a race, did not produce results #298

Closed
azarum opened this issue Jul 6, 2017 · 5 comments
Closed

Rally froze at the end of a race, did not produce results #298

azarum opened this issue Jul 6, 2017 · 5 comments
Labels
bug Something's wrong :Load Driver Changes that affect the core of the load driver such as scheduling, the measurement approach etc.
Milestone

Comments

@azarum
Copy link

azarum commented Jul 6, 2017

Rally version (get with esrally --version):
esrally 0.5.4.dev0 (git revision: aaca9c0)

Invoked command:
esrally --report-format=csv --report-file=myreport.csv --track=geonames --pipeline=benchmark-only --target-hosts=:9200 --client-options="use_ssl:true,verify_certs:false,basic_auth_user:'admin',basic_auth_password:'xxxx'"

Configuration file (located in ~/.rally/rally.ini)):
[meta]
config.version = 8

[system]
env.name = local

[node]
root.dir = /Users/user/.rally/benchmarks

[source]
local.src.dir = /Users/user/.rally/elasticsearch
remote.repo.url = https://github.com/elastic/elasticsearch.git

[build]
gradle.bin = /usr/local/bin/gradle

[runtime]
java8.home = /Library/Java/JavaVirtualMachines/jdk1.8.0_131.jdk/Contents/Home

[benchmarks]
local.dataset.cache = ${node:root.dir}/data

[reporting]
datastore.type = in-memory
datastore.host =
datastore.port =
datastore.secure =
datastore.user =
datastore.password =

[tracks]
default.url = https://github.com/elastic/rally-tracks

[defaults]
preserve_benchmark_candidate = False

JVM version:
java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)

OS version:
mac os sierra 10.12.5

Description of the problem including expected versus actual behavior:
Trying to run the test-mode against an external ES cluster was successful.
Tried to run geonames track, It seems to run all the steps but froze / got stuck at the end - It did not produce results to screen or file.
I ran it a few times and got the same behavior.

Steps to reproduce:

  1. Run track
  2. Wait for completion (gets stuck, then I broke with Ctrl+C)

Provide logs (if relevant):
Last printed on screen:

Running large_filtered_terms [100% done]
Running large_prohibited_terms [100% done]

^C[ERROR] Cannot race. Driver has returned no metrics but instead [None]. Terminating race without result.

Getting further help...

Last in logs: (Broke around 14:45)

2017-07-06 14:45:24,326 PID:55761 rally.driver INFO Storing throughput...
2017-07-06 14:45:24,340 PID:55761 rally.driver INFO Sending benchmark results...
2017-07-06 14:45:24,341 PID:55761 rally.metrics INFO Writing [27960] metrics records temporarily to [/var/folders/b6/btz7_9r50_vbdrzmlmbw56sj081fgs/T/rallyc9vrx733/metrics.json].
2017-07-06 14:45:25,732 PID:55761 rally.driver INFO Closing metrics store...
2017-07-06 14:45:25,733 PID:55761 rally.metrics INFO Closing metrics store.
2017-07-06 14:45:25,749 PID:55761 rally.driver INFO Terminating main driver actor.
2017-07-06 14:45:25,750 PID:55761 rally.driver INFO Main driver received ActorExitRequest and will terminate all load generators.
2017-07-06 14:45:25,751 PID:55761 rally.driver INFO Main driver has notified all load generators of termination.
2017-07-06 14:55:18,254 PID:55732 rally.racecontrol INFO User has cancelled the benchmark.
2017-07-06 14:55:18,262 PID:55732 rally.racecontrol INFO Asking mechanic to stop the engine.
2017-07-06 14:55:18,337 PID:55752 rally.mechanic INFO Stopping engine
2017-07-06 14:55:18,346 PID:55752 rally.mechanic INFO Stopping engine.
2017-07-06 14:55:18,350 PID:55752 rally.metrics INFO Compression changed size of metric store from [64] bytes to [47] bytes
2017-07-06 14:55:18,354 PID:55732 rally.racecontrol INFO Mechanic has stopped engine successfully.
2017-07-06 14:55:18,354 PID:55732 rally.racecontrol INFO Bulk adding system metrics to metrics store.
2017-07-06 14:55:18,354 PID:55732 rally.metrics INFO Restoring in-memory representation of metrics store.
2017-07-06 14:55:18,355 PID:55732 rally.racecontrol INFO Suppressing output of summary report. Cancelled = [False], Error = [True].
2017-07-06 14:55:18,355 PID:55732 rally.metrics INFO Closing metrics store.
2017-07-06 14:55:18,356 PID:55732 rally.main INFO Attempting to shutdown internal actor system.
2017-07-06 14:55:18,393 PID:55732 rally.main INFO Actor system is still running. Waiting...
2017-07-06 14:55:18,394 PID:55743 root INFO ---- Actor System shutdown
2017-07-06 14:55:19,397 PID:55732 rally.main INFO Shutdown completed.
2017-07-06 14:55:19,397 PID:55732 root ERROR Cannot run subcommand [race].
Traceback (most recent call last):
File "/Users/user/elastic/rally/esrally/rally.py", line 435, in dispatch_sub_command
race(cfg)
File "/Users/user/elastic/rally/esrally/rally.py", line 369, in race
with_actor_system(lambda c: racecontrol.run(c), cfg)
File "/Users/user/elastic/rally/esrally/rally.py", line 389, in with_actor_system
runnable(cfg)
File "/Users/user/elastic/rally/esrally/rally.py", line 369, in
with_actor_system(lambda c: racecontrol.run(c), cfg)
File "/Users/user/elastic/rally/esrally/racecontrol.py", line 330, in run
raise e
File "/Users/user/elastic/rally/esrally/racecontrol.py", line 327, in run
pipeline(cfg)
File "/Users/user/elastic/rally/esrally/racecontrol.py", line 42, in call
self.target(cfg)
File "/Users/user/elastic/rally/esrally/racecontrol.py", line 275, in benchmark_only
return race(Benchmark(cfg, external=True))
File "/Users/user/elastic/rally/esrally/racecontrol.py", line 231, in race
may_continue = benchmark.run(lap)
File "/Users/user/elastic/rally/esrally/racecontrol.py", line 159, in run
raise exceptions.RallyError("Driver has returned no metrics but instead [%s]. Terminating race without result." % str(result))
esrally.exceptions.RallyError: Driver has returned no metrics but instead [None]. Terminating race without result.

@danielmitterdorfer
Copy link
Member

Thanks for the detailed report.

I think there are two problems here:

First: Rally is hanging. From the log output it seems that a message between the internal load driver and the coordinator process (called "racecontrol") was lost. We can see from the following lines in the log that the load driver has sent metrics:

2017-07-06 14:45:24,340 PID:55761 rally.driver INFO Sending benchmark results...
2017-07-06 14:45:24,341 PID:55761 rally.metrics INFO Writing [27960] metrics records temporarily to [/var/folders/b6/btz7_9r50_vbdrzmlmbw56sj081fgs/T/rallyc9vrx733/metrics.json].
2017-07-06 14:45:25,732 PID:55761 rally.driver INFO Closing metrics store...

However, the message seems to be lost and thus Rally appears to be hanging. Can you please send rerun the benchmark and send me again the Rally logs but include this time also an internal log of the actor system that Rally is using? To write this file to well-known location you can issue the following command before you start the benchmark:

export THESPLOG_FILE=/Users/user/Desktop/thespian.log

After the benchmark is done you can issue unset THESPLOG_FILE to revert this config change.

Maybe this can shed some light what's going on.

The second problem is much less severe: Although you've cancelled the benchmark Rally still waits for the hanging subprocess and reports an error instead. I'll soon push a commit that corrects this behavior.

danielmitterdorfer added a commit that referenced this issue Jul 7, 2017
With this commit we immediately cancel the benchmark if cancellation is
detected in racecontrol (coordinator) instead of first forwarding the
request to the driver and relying that it will respond correctly. With
this change, Rally behaves more straightforward during cancellation even
in situations where there is a problem with the driver.

Relates #298
@danielmitterdorfer danielmitterdorfer added the bug Something's wrong label Jul 7, 2017
@azarum
Copy link
Author

azarum commented Jul 7, 2017

Thanks for quick response!
rally log:

2017-07-07 11:39:04,880 PID:20396 rally.driver INFO LoadGenerator[0] is executing [Task for [expression]] (90.25% complete).
2017-07-07 11:39:09,881 PID:20396 rally.driver INFO LoadGenerator[0] is executing [Task for [expression]] (92.50% complete).
2017-07-07 11:39:14,882 PID:20396 rally.driver INFO LoadGenerator[0] is executing [Task for [expression]] (94.75% complete).
2017-07-07 11:39:19,884 PID:20396 rally.driver INFO LoadGenerator[0] is executing [Task for [expression]] (97.25% complete).
2017-07-07 11:39:24,886 PID:20396 rally.driver INFO LoadGenerator[0] is executing [Task for [expression]] (99.75% complete).
2017-07-07 11:39:29,887 PID:20396 rally.driver INFO LoadGenerator[0] is ready for the next task.
2017-07-07 11:39:29,887 PID:20396 rally.driver INFO LoadGenerator[0] reached join point [JoinPoint(10)].
2017-07-07 11:39:29,889 PID:20395 rally.driver INFO [8/8] drivers reached join point [10/10].
2017-07-07 11:39:29,889 PID:20395 rally.driver INFO All drivers completed their operations until join point [10/10].
2017-07-07 11:39:29,889 PID:20395 rally.driver INFO All steps completed. Shutting down.
2017-07-07 11:39:29,891 PID:20396 rally.driver INFO LoadGenerator[0] is exiting due to ActorExitRequest.
2017-07-07 11:39:29,891 PID:20397 rally.driver INFO LoadGenerator[1] is exiting due to ActorExitRequest.
2017-07-07 11:39:29,891 PID:20398 rally.driver INFO LoadGenerator[2] is exiting due to ActorExitRequest.
2017-07-07 11:39:29,892 PID:20399 rally.driver INFO LoadGenerator[3] is exiting due to ActorExitRequest.
2017-07-07 11:39:29,892 PID:20400 rally.driver INFO LoadGenerator[4] is exiting due to ActorExitRequest.
2017-07-07 11:39:29,893 PID:20401 rally.driver INFO LoadGenerator[5] is exiting due to ActorExitRequest.
2017-07-07 11:39:29,893 PID:20395 rally.driver INFO Postprocessing samples...
2017-07-07 11:39:29,893 PID:20402 rally.driver INFO LoadGenerator[6] is exiting due to ActorExitRequest.
2017-07-07 11:39:29,894 PID:20395 rally.driver INFO Storing latency and service time... 
2017-07-07 11:39:29,894 PID:20403 rally.driver INFO LoadGenerator[7] is exiting due to ActorExitRequest.
2017-07-07 11:39:30,134 PID:20395 rally.driver INFO Calculating throughput... 
2017-07-07 11:39:30,232 PID:20395 rally.driver INFO Storing throughput... 
2017-07-07 11:39:30,244 PID:20395 rally.driver INFO Sending benchmark results...
2017-07-07 11:39:30,246 PID:20395 rally.metrics INFO Writing [27951] metrics records temporarily to [/var/folders/b6/btz7_9r50_vbdrzmlmbw56sj081fgs/T/rallyzdnvu67b/metrics.json].
2017-07-07 11:39:31,692 PID:20395 rally.driver INFO Closing metrics store...
2017-07-07 11:39:31,693 PID:20395 rally.metrics INFO Closing metrics store.
2017-07-07 11:39:31,707 PID:20395 rally.driver INFO Terminating main driver actor.
2017-07-07 11:39:31,709 PID:20395 rally.driver INFO Main driver received ActorExitRequest and will terminate all load generators.
2017-07-07 11:39:31,709 PID:20395 rally.driver INFO Main driver has notified all load generators of termination.
2017-07-07 12:14:21,192 PID:20372 rally.racecontrol INFO User has cancelled the benchmark.
2017-07-07 12:14:21,192 PID:20372 rally.racecontrol INFO Asking mechanic to stop the engine.
2017-07-07 12:14:21,195 PID:20390 rally.mechanic INFO Stopping engine
2017-07-07 12:14:21,196 PID:20390 rally.mechanic INFO Stopping engine.
2017-07-07 12:14:21,196 PID:20390 rally.metrics INFO Compression changed size of metric store from [64] bytes to [47] bytes
2017-07-07 12:14:21,198 PID:20372 rally.racecontrol INFO Mechanic has stopped engine successfully.
2017-07-07 12:14:21,198 PID:20372 rally.racecontrol INFO Bulk adding system metrics to metrics store.
2017-07-07 12:14:21,199 PID:20372 rally.metrics INFO Restoring in-memory representation of metrics store.
2017-07-07 12:14:21,199 PID:20372 rally.racecontrol INFO Suppressing output of summary report. Cancelled = [False], Error = [True].
2017-07-07 12:14:21,199 PID:20372 rally.metrics INFO Closing metrics store.
2017-07-07 12:14:21,199 PID:20372 rally.main INFO Attempting to shutdown internal actor system.
2017-07-07 12:14:28,204 PID:20372 rally.main WARNING User interrupted shutdown of internal actor system.
2017-07-07 12:14:28,204 PID:20372 rally.main INFO Attempting to shutdown internal actor system.
2017-07-07 12:14:38,213 PID:20372 rally.main INFO Actor system is still running. Waiting...
2017-07-07 12:14:39,214 PID:20372 rally.main INFO Actor system is still running. Waiting...
2017-07-07 12:14:40,219 PID:20372 rally.main INFO Actor system is still running. Waiting...
2017-07-07 12:14:41,220 PID:20372 rally.main INFO Actor system is still running. Waiting...
2017-07-07 12:14:42,222 PID:20372 rally.main INFO Actor system is still running. Waiting...

thespian.log:

2017-07-07 11:49:58.464614 p19941 I    Pending Actor request received for esrally.mechanic.mechanic.MechanicActor reqs {'coordinator': True} from ActorAddr-(T|:65156)
2017-07-07 11:49:58.480827 p19951 I    Starting Actor esrally.mechanic.mechanic.MechanicActor at ActorAddr-(T|:65158) (parent ActorAddr-(T|:1900), admin ActorAddr-(T|:1900))
2017-07-07 11:49:58.492303 p19941 I    Pending Actor request received for esrally.mechanic.mechanic.LocalNodeMechanicActor reqs {'coordinator': True} from ActorAddr-(T|:65158)
2017-07-07 11:49:58.503556 p19952 I    Starting Actor esrally.mechanic.mechanic.LocalNodeMechanicActor at ActorAddr-(T|:65161) (parent ActorAddr-(T|:1900), admin ActorAddr-(T|:1900))
2017-07-07 11:50:00.924775 p19941 I    Pending Actor request received for esrally.driver.driver.Driver reqs {'coordinator': True} from ActorAddr-(T|:65174)
2017-07-07 11:50:00.931317 p19959 I    Starting Actor esrally.driver.driver.Driver at ActorAddr-(T|:65176) (parent ActorAddr-(T|:1900), admin ActorAddr-(T|:1900))
2017-07-07 11:50:03.442144 p19941 I    Pending Actor request received for esrally.driver.driver.LoadGenerator reqs {'coordinator': True} from ActorAddr-(T|:65176)
2017-07-07 11:50:03.447088 p19941 I    Pending Actor request received for esrally.driver.driver.LoadGenerator reqs {'coordinator': True} from ActorAddr-(T|:65176)
2017-07-07 11:50:03.449387 p19964 I    Starting Actor esrally.driver.driver.LoadGenerator at ActorAddr-(T|:65184) (parent ActorAddr-(T|:1900), admin ActorAddr-(T|:1900))
2017-07-07 11:50:03.452949 p19941 I    Pending Actor request received for esrally.driver.driver.LoadGenerator reqs {'coordinator': True} from ActorAddr-(T|:65176)
2017-07-07 11:50:03.455369 p19965 I    Starting Actor esrally.driver.driver.LoadGenerator at ActorAddr-(T|:65186) (parent ActorAddr-(T|:1900), admin ActorAddr-(T|:1900))
2017-07-07 11:50:03.459293 p19941 I    Pending Actor request received for esrally.driver.driver.LoadGenerator reqs {'coordinator': True} from ActorAddr-(T|:65176)
2017-07-07 11:50:03.460954 p19966 I    Starting Actor esrally.driver.driver.LoadGenerator at ActorAddr-(T|:65188) (parent ActorAddr-(T|:1900), admin ActorAddr-(T|:1900))
2017-07-07 11:50:03.466862 p19941 I    Pending Actor request received for esrally.driver.driver.LoadGenerator reqs {'coordinator': True} from ActorAddr-(T|:65176)
2017-07-07 11:50:03.468527 p19967 I    Starting Actor esrally.driver.driver.LoadGenerator at ActorAddr-(T|:65191) (parent ActorAddr-(T|:1900), admin ActorAddr-(T|:1900))
2017-07-07 11:50:03.474033 p19941 I    Pending Actor request received for esrally.driver.driver.LoadGenerator reqs {'coordinator': True} from ActorAddr-(T|:65176)
2017-07-07 11:50:03.475560 p19968 I    Starting Actor esrally.driver.driver.LoadGenerator at ActorAddr-(T|:65194) (parent ActorAddr-(T|:1900), admin ActorAddr-(T|:1900))
2017-07-07 11:50:03.482736 p19969 I    Starting Actor esrally.driver.driver.LoadGenerator at ActorAddr-(T|:65197) (parent ActorAddr-(T|:1900), admin ActorAddr-(T|:1900))
2017-07-07 11:50:03.569448 p19941 I    Pending Actor request received for esrally.driver.driver.LoadGenerator reqs {'coordinator': True} from ActorAddr-(T|:65176)
2017-07-07 11:50:03.574834 p19941 I    Pending Actor request received for esrally.driver.driver.LoadGenerator reqs {'coordinator': True} from ActorAddr-(T|:65176)
2017-07-07 11:50:03.578182 p19970 I    Starting Actor esrally.driver.driver.LoadGenerator at ActorAddr-(T|:65203) (parent ActorAddr-(T|:1900), admin ActorAddr-(T|:1900))
2017-07-07 11:50:03.586625 p19971 I    Starting Actor esrally.driver.driver.LoadGenerator at ActorAddr-(T|:65205) (parent ActorAddr-(T|:1900), admin ActorAddr-(T|:1900))
2017-07-07 11:56:50.610673 p19932 I    ActorSystem shutdown requested.
2017-07-07 11:56:50.756031 p19941 I    ConnRefused to ActorAddr-(T|:65176); declaring as DeadTarget.
2017-07-07 11:56:51.051307 p19941 I    ConnRefused to ActorAddr-(T|:65176); declaring as DeadTarget.
2017-07-07 11:56:51.117640 p19941 I    ConnRefused to ActorAddr-(T|:65176); declaring as DeadTarget.
2017-07-07 11:56:51.191503 p19941 I    ConnRefused to ActorAddr-(T|:65176); declaring as DeadTarget.
2017-07-07 11:56:51.552117 p19941 I    ConnRefused to ActorAddr-(T|:65176); declaring as DeadTarget.
2017-07-07 11:56:51.806566 p19941 I    ConnRefused to ActorAddr-(T|:65176); declaring as DeadTarget.
2017-07-07 11:56:52.019058 p19941 I    ConnRefused to ActorAddr-(T|:65176); declaring as DeadTarget.
2017-07-07 11:56:52.262107 p19941 I    ConnRefused to ActorAddr-(T|:65176); declaring as DeadTarget.
2017-07-07 11:56:52.283748 p19941 I    ---- shutdown completed
2017-07-07 11:56:52.284814 p19932 I    ActorSystem shutdown complete.
2017-07-07 11:58:48.605716 p20381 I    ++++ Admin started @ ActorAddr-(T|:1900) / gen (3, 7)
2017-07-07 11:58:48.767869 p20381 I    Pending Actor request received for esrally.mechanic.mechanic.MechanicActor reqs {'coordinator': True} from ActorAddr-(T|:49425)
2017-07-07 11:58:48.776747 p20389 I    Starting Actor esrally.mechanic.mechanic.MechanicActor at ActorAddr-(T|:49427) (parent ActorAddr-(T|:1900), admin ActorAddr-(T|:1900))
2017-07-07 11:58:48.788626 p20381 I    Pending Actor request received for esrally.mechanic.mechanic.LocalNodeMechanicActor reqs {'coordinator': True} from ActorAddr-(T|:49427)
2017-07-07 11:58:48.796154 p20390 I    Starting Actor esrally.mechanic.mechanic.LocalNodeMechanicActor at ActorAddr-(T|:49430) (parent ActorAddr-(T|:1900), admin ActorAddr-(T|:1900))
2017-07-07 11:58:50.866138 p20381 I    Pending Actor request received for esrally.driver.driver.Driver reqs {'coordinator': True} from ActorAddr-(T|:49436)
2017-07-07 11:58:50.873119 p20395 I    Starting Actor esrally.driver.driver.Driver at ActorAddr-(T|:49438) (parent ActorAddr-(T|:1900), admin ActorAddr-(T|:1900))
2017-07-07 11:58:53.394906 p20381 I    Pending Actor request received for esrally.driver.driver.LoadGenerator reqs {'coordinator': True} from ActorAddr-(T|:49438)
2017-07-07 11:58:53.399788 p20381 I    Pending Actor request received for esrally.driver.driver.LoadGenerator reqs {'coordinator': True} from ActorAddr-(T|:49438)
2017-07-07 11:58:53.402175 p20396 I    Starting Actor esrally.driver.driver.LoadGenerator at ActorAddr-(T|:49445) (parent ActorAddr-(T|:1900), admin ActorAddr-(T|:1900))
2017-07-07 11:58:53.405218 p20381 I    Pending Actor request received for esrally.driver.driver.LoadGenerator reqs {'coordinator': True} from ActorAddr-(T|:49438)
2017-07-07 11:58:53.407687 p20397 I    Starting Actor esrally.driver.driver.LoadGenerator at ActorAddr-(T|:49447) (parent ActorAddr-(T|:1900), admin ActorAddr-(T|:1900))
2017-07-07 11:58:53.412270 p20381 I    Pending Actor request received for esrally.driver.driver.LoadGenerator reqs {'coordinator': True} from ActorAddr-(T|:49438)
2017-07-07 11:58:53.413599 p20398 I    Starting Actor esrally.driver.driver.LoadGenerator at ActorAddr-(T|:49450) (parent ActorAddr-(T|:1900), admin ActorAddr-(T|:1900))
2017-07-07 11:58:53.419195 p20381 I    Pending Actor request received for esrally.driver.driver.LoadGenerator reqs {'coordinator': True} from ActorAddr-(T|:49438)
2017-07-07 11:58:53.422412 p20399 I    Starting Actor esrally.driver.driver.LoadGenerator at ActorAddr-(T|:49453) (parent ActorAddr-(T|:1900), admin ActorAddr-(T|:1900))
2017-07-07 11:58:53.429020 p20381 I    Pending Actor request received for esrally.driver.driver.LoadGenerator reqs {'coordinator': True} from ActorAddr-(T|:49438)
2017-07-07 11:58:53.430916 p20400 I    Starting Actor esrally.driver.driver.LoadGenerator at ActorAddr-(T|:49456) (parent ActorAddr-(T|:1900), admin ActorAddr-(T|:1900))
2017-07-07 11:58:53.437950 p20401 I    Starting Actor esrally.driver.driver.LoadGenerator at ActorAddr-(T|:49459) (parent ActorAddr-(T|:1900), admin ActorAddr-(T|:1900))
2017-07-07 11:58:53.526450 p20381 I    Pending Actor request received for esrally.driver.driver.LoadGenerator reqs {'coordinator': True} from ActorAddr-(T|:49438)
2017-07-07 11:58:53.533294 p20381 I    Pending Actor request received for esrally.driver.driver.LoadGenerator reqs {'coordinator': True} from ActorAddr-(T|:49438)
2017-07-07 11:58:53.538151 p20402 I    Starting Actor esrally.driver.driver.LoadGenerator at ActorAddr-(T|:49465) (parent ActorAddr-(T|:1900), admin ActorAddr-(T|:1900))
2017-07-07 11:58:53.544312 p20403 I    Starting Actor esrally.driver.driver.LoadGenerator at ActorAddr-(T|:49467) (parent ActorAddr-(T|:1900), admin ActorAddr-(T|:1900))
2017-07-07 12:39:31.721620 p20381 Warn Remote ActorAddr-(T|:49438) closed connection before ack received at ActorAddr-(T|:1900) for TransportIntent(ActorAddr-(T|:49438)-pending-ExpiresIn_0:04:58.174942-<class 'thespian.actors.ChildActorExited'>-ChildActorExited:ActorAddr-(T|:49445)-quit_0:04:58.174931)
2017-07-07 12:39:31.724391 p20381 I    ConnRefused to ActorAddr-(T|:49438); declaring as DeadTarget.
2017-07-07 12:39:31.725388 p20381 I    ConnRefused to ActorAddr-(T|:49438); declaring as DeadTarget.
2017-07-07 12:39:31.726187 p20381 I    ConnRefused to ActorAddr-(T|:49438); declaring as DeadTarget.
2017-07-07 12:39:31.726856 p20381 I    ConnRefused to ActorAddr-(T|:49438); declaring as DeadTarget.
2017-07-07 12:39:31.727494 p20381 I    ConnRefused to ActorAddr-(T|:49438); declaring as DeadTarget.
2017-07-07 12:39:31.728178 p20381 I    ConnRefused to ActorAddr-(T|:49438); declaring as DeadTarget.
2017-07-07 12:39:31.728810 p20381 I    ConnRefused to ActorAddr-(T|:49438); declaring as DeadTarget.
2017-07-07 13:14:21.201909 p20372 I    ActorSystem shutdown requested.
2017-07-07 13:14:28.205072 p20372 I    ActorSystem shutdown requested.
2017-07-07 13:14:38.210677 p20372 ERR  No response to Admin shutdown request; Actor system not completely shutdown
2017-07-07 13:14:38.211162 p20372 I    completion error: ************* TransportIntent(ActorAddr-(T|:1900)-=-SENDSTS_FAILED-<class 'thespian.system.messages.admin.SystemShutdown'>-<thespian.system.messages.admin.SystemShutdown object at 0x109463f98>-quit_0:04:42.990834)
2017-07-07 13:14:38.212199 p20372 I    completion error: ************* TransportIntent(ActorAddr-(T|:1900)-=-SENDSTS_FAILED-WAITSLOT-<class 'thespian.system.messages.admin.SystemShutdown'>-<thespian.system.messages.admin.SystemShutdown object at 0x109463160>-quit_0:04:49.993040)
2017-07-07 13:14:38.212760 p20372 I    ActorSystem shutdown complete.```

danielmitterdorfer added a commit that referenced this issue Jul 10, 2017
With this commit we have race control terminate the main load driver.
Previously it self-terminated but we suspect that this can lead to a
race condition which will in turn lead to message loss.

Relates #298
@danielmitterdorfer
Copy link
Member

danielmitterdorfer commented Jul 10, 2017

I think I know now what's going on although I wonder a bit about the circumstances. Namely, that it shows up all the time on your machine in "normal" mode but not in test mode. Anyway, the reason seems to be a timing issue. The load test driver first sends the results and then immediately self-terminates. Based on the following line:

2017-07-07 12:39:31.721620 p20381 Warn Remote ActorAddr-(T|:49438) closed connection before ack received at ActorAddr-(T|:1900) for TransportIntent(ActorAddr-(T|:49438)-pending-ExpiresIn_0:04:58.174942-<class 'thespian.actors.ChildActorExited'>-ChildActorExited:ActorAddr-(T|:49445)-quit_0:04:58.174931)

I have the impression that this is the reason that the results never arrive at the coordinator. I've changed the message flow now so that this does not happen anymore.

Based on what I see in your original report you seem to run of the master version of Rally. Can you please update to the latest master (just run git pull --rebase origin in your clone of the Rally source repo and run python3 setup.py develop. If you use the rally wrapper script (see docs) then there is no need for all that and you can just run ./rally --version and it will take care of everything if your working copy is clean) and retry whether my changes fix the error?

Before you start Rally, please check that you have the correct version. It should say:

$ esrally --version
esrally 0.6.2.dev0 (git revision: c8265ad)

Just in case, please also enable the actor system's log again with export THESPLOG_FILE=/Users/user/Desktop/thespian.log.

@danielmitterdorfer danielmitterdorfer added the :Load Driver Changes that affect the core of the load driver such as scheduling, the measurement approach etc. label Jul 10, 2017
@azarum
Copy link
Author

azarum commented Jul 11, 2017

Thanks a lot @danielmitterdorfer !
This fixed the issue for me

@danielmitterdorfer
Copy link
Member

Glad to hear that @azarum. Then I had indeed the right intuition here. Thank you for your help in debugging it.

@danielmitterdorfer danielmitterdorfer added this to the 0.6.2 milestone Jul 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something's wrong :Load Driver Changes that affect the core of the load driver such as scheduling, the measurement approach etc.
Projects
None yet
Development

No branches or pull requests

2 participants