Update DiskIo telemetry device to persist the counters #731

ebadyano · 2019-07-17T15:05:04Z

Ensure that DiskIo telemetry does not rely on Rally being a parent
process of Elasticsearch and persists the disk counters at the beginning
of a benchmark and can read it again afterwards.

Relates to #697

Ensure that DiskIo telemetry does not rely on Rally being a parent process of Elasticsearch and persists the disk counters at the beginning of a benchmark and can read it again afterwards. Relates to elastic#697

ebadyano · 2019-07-17T15:05:16Z

Re-opening #721 here as the repository state got messed up.

danielmitterdorfer

Looks mostly fine. I left a couple more comments.

danielmitterdorfer · 2019-07-17T15:12:49Z

tests/mechanic/telemetry_test.py

        t.detach_from_node(node, running=True)
        t.detach_from_node(node, running=False)
+        node2 = cluster.Node(pid=None, host_name="localhost", node_name="rally0", telemetry=t)


What I meant in #721 (comment) is that we recreate the telemetry device, not the node. This ensures that we don't rely on any state that is tied to that instance.

danielmitterdorfer · 2019-07-17T15:14:07Z

tests/mechanic/telemetry_test.py


+        # expected result is 1 byte because there are two nodes on the machine. Result is calculated with total_bytes / node_count


Can you please wrap that line at 120 characters?

danielmitterdorfer · 2019-07-17T15:14:39Z

tests/mechanic/telemetry_test.py

        t.detach_from_node(node, running=True)
        t.detach_from_node(node, running=False)
+        node2 = cluster.Node(pid=None, host_name="localhost", node_name="rally0", telemetry=t)


Similarly to the test above we should create a new instance of DiskIo instead of the node.

danielmitterdorfer

Thanks for iterating. I left a couple of suggestions and thoughts.

danielmitterdorfer · 2019-07-18T11:40:48Z

esrally/mechanic/telemetry.py

@@ -774,14 +775,16 @@ class DiskIo(InternalTelemetryDevice):
    """
    Gathers disk I/O stats.
    """
-    def __init__(self, metrics_store, node_count_on_host):
+    def __init__(self, metrics_store, node_count_on_host, log_root, node_name):


The DockerLauncher calls this constructor as well. Can you please update the constructor invocation accordingly and double-check you have adapted all call-sites?

As per offline discussion, I fixed the constructor call. But uncovered that attach_to_node isn't called for telemetry devices in DockerLauncher and so DIskIo isn't working properly: opened #733 to fix it separately.

danielmitterdorfer · 2019-07-18T11:41:09Z

esrally/mechanic/telemetry.py

        super().__init__()
        self.metrics_store = metrics_store
        self.node_count_on_host = node_count_on_host
        self.node = None
        self.process = None
        self.disk_start = None
        self.process_start = None
+        self.node_name = node_name


Nit: Can we we match the order in the constructor's argument list here?

danielmitterdorfer · 2019-07-18T11:42:41Z

esrally/mechanic/telemetry.py

+            read_bytes = 0
+            write_bytes = 0
+            io.ensure_dir(self.log_root)
+            tmp_io_file = os.path.join(self.log_root, "%s.io" % self.node_name)


Nit: As per convention we prefer the format method to % formatting in code that is not performance-critical, i.e. you'd need to change this to "{}.io".format(self.node_name)) (similarly in on_benchmark_stop)

danielmitterdorfer · 2019-07-18T11:47:42Z

esrally/mechanic/telemetry.py

+            io.ensure_dir(self.log_root)
+            tmp_io_file = os.path.join(self.log_root, "%s.io" % self.node_name)
+            with open(tmp_io_file, "rt", encoding="utf-8") as f:
+                io_bytes = json.load(f)


As you now also store the PID, the name io_bytes does not seem appropriate anymore? How about io_stats?

danielmitterdorfer · 2019-07-22T11:39:55Z

tests/mechanic/telemetry_test.py

+        t2 = telemetry.Telemetry(enabled_devices=[], devices=[device2])
+        t2.on_benchmark_stop()
+        t.detach_from_node(node, running=True)
+        t.detach_from_node(node, running=False)


Both detach_from_node calls should be on t2, not on t?

since we call t.on_benchmark_start() only on t does it make sense to call detach only on t and not on t2?

No, because you need to consider that we test here the behavior with #722:

In the first invocation (the new start subcommand), the telemetry device will be instantiated and we attach the telemetry device. After that the process terminates and t will be gone.

In the second invocation (the new stop subcommand), a new instance of the telemetry device will be created and we detach it from the node.

Hence, we need to call the start-related methods on one instance and the stop-related ones on the other to simulate this behavior and to test that we do not rely on any state of the previous instance (t in that case).

Makes sense?

Yes, thank you for explaining. I'll add the fix.

danielmitterdorfer · 2019-07-22T11:40:02Z

tests/mechanic/telemetry_test.py

+        t2 = telemetry.Telemetry(enabled_devices=[], devices=[device2])
+        t2.on_benchmark_stop()
+        t.detach_from_node(node, running=True)
+        t.detach_from_node(node, running=False)


Both detach_from_node calls should be on t2, not on t?

danielmitterdorfer

Thanks for iterating! LGTM

Ensure that DiskIo telemetry does not rely on Rally being a parent process of Elasticsearch and persists the disk counters at the beginning of a benchmark and can read it again afterwards. Relates to elastic#697

ebadyano added 6 commits July 17, 2019 10:01

Update DiskIo telemetry device to persist the counters

b85fae3

Ensure that DiskIo telemetry does not rely on Rally being a parent process of Elasticsearch and persists the disk counters at the beginning of a benchmark and can read it again afterwards. Relates to elastic#697

Fix typo

6a2e261

Use Json for storing/loading start io data

e3e5b45

Address Daniel's comments. (Still need to write a unit test)

0678257

Adding unit test for diskio

461bc67

Address Daniel's comments

a8d60bf

ebadyano requested a review from danielmitterdorfer July 17, 2019 15:05

ebadyano requested a review from dliappis July 17, 2019 15:05

danielmitterdorfer reviewed Jul 17, 2019

View reviewed changes

Fix telemetry test

4c7c976

ebadyano requested a review from danielmitterdorfer July 18, 2019 11:14

danielmitterdorfer requested changes Jul 18, 2019

View reviewed changes

Fix DiskIo call in DockerLauncher

891bf38

ebadyano mentioned this pull request Jul 18, 2019

Missing attach_to_node call for telemetry devices in DockerLauncher #733

Closed

danielmitterdorfer reviewed Jul 22, 2019

View reviewed changes

Fix telemetry test

15a6527

ebadyano requested a review from danielmitterdorfer July 24, 2019 12:39

ebadyano added enhancement Improves the status quo :Telemetry Telemetry Devices that gather additional metrics :misc Changes that don't affect users directly: linter fixes, test improvements, etc. labels Jul 24, 2019

ebadyano added this to the 1.3.0 milestone Jul 24, 2019

danielmitterdorfer approved these changes Jul 25, 2019

View reviewed changes

ebadyano merged commit fe1ff28 into elastic:master Jul 25, 2019

danielmitterdorfer mentioned this pull request Oct 14, 2019

Allow to manage Elasticsearch nodes separately from benchmarking #697

Closed

7 tasks

ebadyano deleted the diskio2 branch December 16, 2022 15:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update DiskIo telemetry device to persist the counters #731

Update DiskIo telemetry device to persist the counters #731

ebadyano commented Jul 17, 2019

ebadyano commented Jul 17, 2019

danielmitterdorfer left a comment

danielmitterdorfer Jul 17, 2019

danielmitterdorfer Jul 17, 2019

danielmitterdorfer Jul 17, 2019

danielmitterdorfer left a comment

danielmitterdorfer Jul 18, 2019

ebadyano Jul 18, 2019

danielmitterdorfer Jul 19, 2019

danielmitterdorfer Jul 18, 2019

danielmitterdorfer Jul 18, 2019

danielmitterdorfer Jul 18, 2019

danielmitterdorfer Jul 22, 2019

ebadyano Jul 23, 2019 •

edited

Loading

danielmitterdorfer Jul 24, 2019

ebadyano Jul 24, 2019

danielmitterdorfer Jul 22, 2019

danielmitterdorfer left a comment


		# expected result is 1 byte because there are two nodes on the machine. Result is calculated with total_bytes / node_count

Update DiskIo telemetry device to persist the counters #731

Update DiskIo telemetry device to persist the counters #731

Conversation

ebadyano commented Jul 17, 2019

ebadyano commented Jul 17, 2019

danielmitterdorfer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielmitterdorfer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebadyano Jul 23, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielmitterdorfer left a comment

Choose a reason for hiding this comment

ebadyano Jul 23, 2019 •

edited

Loading