-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rally froze at the end of a race, did not produce results #298
Comments
Thanks for the detailed report. I think there are two problems here: First: Rally is hanging. From the log output it seems that a message between the internal load driver and the coordinator process (called "racecontrol") was lost. We can see from the following lines in the log that the load driver has sent metrics:
However, the message seems to be lost and thus Rally appears to be hanging. Can you please send rerun the benchmark and send me again the Rally logs but include this time also an internal log of the actor system that Rally is using? To write this file to well-known location you can issue the following command before you start the benchmark:
After the benchmark is done you can issue Maybe this can shed some light what's going on. The second problem is much less severe: Although you've cancelled the benchmark Rally still waits for the hanging subprocess and reports an error instead. I'll soon push a commit that corrects this behavior. |
With this commit we immediately cancel the benchmark if cancellation is detected in racecontrol (coordinator) instead of first forwarding the request to the driver and relying that it will respond correctly. With this change, Rally behaves more straightforward during cancellation even in situations where there is a problem with the driver. Relates #298
Thanks for quick response!
thespian.log:
|
With this commit we have race control terminate the main load driver. Previously it self-terminated but we suspect that this can lead to a race condition which will in turn lead to message loss. Relates #298
I think I know now what's going on although I wonder a bit about the circumstances. Namely, that it shows up all the time on your machine in "normal" mode but not in test mode. Anyway, the reason seems to be a timing issue. The load test driver first sends the results and then immediately self-terminates. Based on the following line:
I have the impression that this is the reason that the results never arrive at the coordinator. I've changed the message flow now so that this does not happen anymore. Based on what I see in your original report you seem to run of the master version of Rally. Can you please update to the latest master (just run Before you start Rally, please check that you have the correct version. It should say:
Just in case, please also enable the actor system's log again with |
Thanks a lot @danielmitterdorfer ! |
Glad to hear that @azarum. Then I had indeed the right intuition here. Thank you for your help in debugging it. |
Rally version (get with
esrally --version
):esrally 0.5.4.dev0 (git revision: aaca9c0)
Invoked command:
esrally --report-format=csv --report-file=myreport.csv --track=geonames --pipeline=benchmark-only --target-hosts=:9200 --client-options="use_ssl:true,verify_certs:false,basic_auth_user:'admin',basic_auth_password:'xxxx'"
Configuration file (located in
~/.rally/rally.ini
)):[meta]
config.version = 8
[system]
env.name = local
[node]
root.dir = /Users/user/.rally/benchmarks
[source]
local.src.dir = /Users/user/.rally/elasticsearch
remote.repo.url = https://github.com/elastic/elasticsearch.git
[build]
gradle.bin = /usr/local/bin/gradle
[runtime]
java8.home = /Library/Java/JavaVirtualMachines/jdk1.8.0_131.jdk/Contents/Home
[benchmarks]
local.dataset.cache = ${node:root.dir}/data
[reporting]
datastore.type = in-memory
datastore.host =
datastore.port =
datastore.secure =
datastore.user =
datastore.password =
[tracks]
default.url = https://github.com/elastic/rally-tracks
[defaults]
preserve_benchmark_candidate = False
JVM version:
java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)
OS version:
mac os sierra 10.12.5
Description of the problem including expected versus actual behavior:
Trying to run the test-mode against an external ES cluster was successful.
Tried to run geonames track, It seems to run all the steps but froze / got stuck at the end - It did not produce results to screen or file.
I ran it a few times and got the same behavior.
Steps to reproduce:
Provide logs (if relevant):
Last printed on screen:
Running large_filtered_terms [100% done]
Running large_prohibited_terms [100% done]
^C[ERROR] Cannot race. Driver has returned no metrics but instead [None]. Terminating race without result.
Getting further help...
Last in logs: (Broke around 14:45)
2017-07-06 14:45:24,326 PID:55761 rally.driver INFO Storing throughput...
2017-07-06 14:45:24,340 PID:55761 rally.driver INFO Sending benchmark results...
2017-07-06 14:45:24,341 PID:55761 rally.metrics INFO Writing [27960] metrics records temporarily to [/var/folders/b6/btz7_9r50_vbdrzmlmbw56sj081fgs/T/rallyc9vrx733/metrics.json].
2017-07-06 14:45:25,732 PID:55761 rally.driver INFO Closing metrics store...
2017-07-06 14:45:25,733 PID:55761 rally.metrics INFO Closing metrics store.
2017-07-06 14:45:25,749 PID:55761 rally.driver INFO Terminating main driver actor.
2017-07-06 14:45:25,750 PID:55761 rally.driver INFO Main driver received ActorExitRequest and will terminate all load generators.
2017-07-06 14:45:25,751 PID:55761 rally.driver INFO Main driver has notified all load generators of termination.
2017-07-06 14:55:18,254 PID:55732 rally.racecontrol INFO User has cancelled the benchmark.
2017-07-06 14:55:18,262 PID:55732 rally.racecontrol INFO Asking mechanic to stop the engine.
2017-07-06 14:55:18,337 PID:55752 rally.mechanic INFO Stopping engine
2017-07-06 14:55:18,346 PID:55752 rally.mechanic INFO Stopping engine.
2017-07-06 14:55:18,350 PID:55752 rally.metrics INFO Compression changed size of metric store from [64] bytes to [47] bytes
2017-07-06 14:55:18,354 PID:55732 rally.racecontrol INFO Mechanic has stopped engine successfully.
2017-07-06 14:55:18,354 PID:55732 rally.racecontrol INFO Bulk adding system metrics to metrics store.
2017-07-06 14:55:18,354 PID:55732 rally.metrics INFO Restoring in-memory representation of metrics store.
2017-07-06 14:55:18,355 PID:55732 rally.racecontrol INFO Suppressing output of summary report. Cancelled = [False], Error = [True].
2017-07-06 14:55:18,355 PID:55732 rally.metrics INFO Closing metrics store.
2017-07-06 14:55:18,356 PID:55732 rally.main INFO Attempting to shutdown internal actor system.
2017-07-06 14:55:18,393 PID:55732 rally.main INFO Actor system is still running. Waiting...
2017-07-06 14:55:18,394 PID:55743 root INFO ---- Actor System shutdown
2017-07-06 14:55:19,397 PID:55732 rally.main INFO Shutdown completed.
2017-07-06 14:55:19,397 PID:55732 root ERROR Cannot run subcommand [race].
Traceback (most recent call last):
File "/Users/user/elastic/rally/esrally/rally.py", line 435, in dispatch_sub_command
race(cfg)
File "/Users/user/elastic/rally/esrally/rally.py", line 369, in race
with_actor_system(lambda c: racecontrol.run(c), cfg)
File "/Users/user/elastic/rally/esrally/rally.py", line 389, in with_actor_system
runnable(cfg)
File "/Users/user/elastic/rally/esrally/rally.py", line 369, in
with_actor_system(lambda c: racecontrol.run(c), cfg)
File "/Users/user/elastic/rally/esrally/racecontrol.py", line 330, in run
raise e
File "/Users/user/elastic/rally/esrally/racecontrol.py", line 327, in run
pipeline(cfg)
File "/Users/user/elastic/rally/esrally/racecontrol.py", line 42, in call
self.target(cfg)
File "/Users/user/elastic/rally/esrally/racecontrol.py", line 275, in benchmark_only
return race(Benchmark(cfg, external=True))
File "/Users/user/elastic/rally/esrally/racecontrol.py", line 231, in race
may_continue = benchmark.run(lap)
File "/Users/user/elastic/rally/esrally/racecontrol.py", line 159, in run
raise exceptions.RallyError("Driver has returned no metrics but instead [%s]. Terminating race without result." % str(result))
esrally.exceptions.RallyError: Driver has returned no metrics but instead [None]. Terminating race without result.
The text was updated successfully, but these errors were encountered: