-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update DiskIo telemetry device to persist the counters #731
Conversation
Ensure that DiskIo telemetry does not rely on Rally being a parent process of Elasticsearch and persists the disk counters at the beginning of a benchmark and can read it again afterwards. Relates to elastic#697
Re-opening #721 here as the repository state got messed up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks mostly fine. I left a couple more comments.
tests/mechanic/telemetry_test.py
Outdated
t.detach_from_node(node, running=True) | ||
t.detach_from_node(node, running=False) | ||
node2 = cluster.Node(pid=None, host_name="localhost", node_name="rally0", telemetry=t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant in #721 (comment) is that we recreate the telemetry device, not the node. This ensures that we don't rely on any state that is tied to that instance.
tests/mechanic/telemetry_test.py
Outdated
|
||
# expected result is 1 byte because there are two nodes on the machine. Result is calculated with total_bytes / node_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please wrap that line at 120 characters?
tests/mechanic/telemetry_test.py
Outdated
t.detach_from_node(node, running=True) | ||
t.detach_from_node(node, running=False) | ||
node2 = cluster.Node(pid=None, host_name="localhost", node_name="rally0", telemetry=t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly to the test above we should create a new instance of DiskIo
instead of the node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating. I left a couple of suggestions and thoughts.
@@ -774,14 +775,16 @@ class DiskIo(InternalTelemetryDevice): | |||
""" | |||
Gathers disk I/O stats. | |||
""" | |||
def __init__(self, metrics_store, node_count_on_host): | |||
def __init__(self, metrics_store, node_count_on_host, log_root, node_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DockerLauncher
calls this constructor as well. Can you please update the constructor invocation accordingly and double-check you have adapted all call-sites?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per offline discussion, I fixed the constructor call. But uncovered that attach_to_node
isn't called for telemetry devices in DockerLauncher and so DIskIo isn't working properly: opened #733 to fix it separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you.
esrally/mechanic/telemetry.py
Outdated
super().__init__() | ||
self.metrics_store = metrics_store | ||
self.node_count_on_host = node_count_on_host | ||
self.node = None | ||
self.process = None | ||
self.disk_start = None | ||
self.process_start = None | ||
self.node_name = node_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Can we we match the order in the constructor's argument list here?
esrally/mechanic/telemetry.py
Outdated
read_bytes = 0 | ||
write_bytes = 0 | ||
io.ensure_dir(self.log_root) | ||
tmp_io_file = os.path.join(self.log_root, "%s.io" % self.node_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: As per convention we prefer the format
method to %
formatting in code that is not performance-critical, i.e. you'd need to change this to "{}.io".format(self.node_name))
(similarly in on_benchmark_stop
)
esrally/mechanic/telemetry.py
Outdated
io.ensure_dir(self.log_root) | ||
tmp_io_file = os.path.join(self.log_root, "%s.io" % self.node_name) | ||
with open(tmp_io_file, "rt", encoding="utf-8") as f: | ||
io_bytes = json.load(f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you now also store the PID, the name io_bytes
does not seem appropriate anymore? How about io_stats
?
tests/mechanic/telemetry_test.py
Outdated
t2 = telemetry.Telemetry(enabled_devices=[], devices=[device2]) | ||
t2.on_benchmark_stop() | ||
t.detach_from_node(node, running=True) | ||
t.detach_from_node(node, running=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both detach_from_node
calls should be on t2
, not on t
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we call t.on_benchmark_start()
only on t
does it make sense to call detach only on t
and not on t2
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because you need to consider that we test here the behavior with #722:
- In the first invocation (the new
start
subcommand), the telemetry device will be instantiated and we attach the telemetry device. After that the process terminates andt
will be gone. - In the second invocation (the new
stop
subcommand), a new instance of the telemetry device will be created and we detach it from the node.
Hence, we need to call the start-related methods on one instance and the stop-related ones on the other to simulate this behavior and to test that we do not rely on any state of the previous instance (t
in that case).
Makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thank you for explaining. I'll add the fix.
tests/mechanic/telemetry_test.py
Outdated
t2 = telemetry.Telemetry(enabled_devices=[], devices=[device2]) | ||
t2.on_benchmark_stop() | ||
t.detach_from_node(node, running=True) | ||
t.detach_from_node(node, running=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both detach_from_node
calls should be on t2
, not on t
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating! LGTM
Ensure that DiskIo telemetry does not rely on Rally being a parent process of Elasticsearch and persists the disk counters at the beginning of a benchmark and can read it again afterwards. Relates to elastic#697
Ensure that DiskIo telemetry does not rely on Rally being a parent process of Elasticsearch and persists the disk counters at the beginning of a benchmark and can read it again afterwards. Relates to elastic#697
Ensure that DiskIo telemetry does not rely on Rally being a parent process of Elasticsearch and persists the disk counters at the beginning of a benchmark and can read it again afterwards. Relates to elastic#697
Ensure that DiskIo telemetry does not rely on Rally being a parent
process of Elasticsearch and persists the disk counters at the beginning
of a benchmark and can read it again afterwards.
Relates to #697