Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update DiskIo telemetry device to persist the counters #731

Merged
merged 9 commits into from
Jul 25, 2019

Conversation

ebadyano
Copy link
Contributor

Ensure that DiskIo telemetry does not rely on Rally being a parent
process of Elasticsearch and persists the disk counters at the beginning
of a benchmark and can read it again afterwards.

Relates to #697

ebadyano added 6 commits July 17, 2019 10:01
Ensure that DiskIo telemetry does not rely on Rally being a parent
process of Elasticsearch and persists the disk counters at the beginning
of a benchmark and can read it again afterwards.

Relates to elastic#697
@ebadyano
Copy link
Contributor Author

Re-opening #721 here as the repository state got messed up.

@ebadyano ebadyano requested a review from dliappis July 17, 2019 15:05
Copy link
Member

@danielmitterdorfer danielmitterdorfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly fine. I left a couple more comments.

t.detach_from_node(node, running=True)
t.detach_from_node(node, running=False)
node2 = cluster.Node(pid=None, host_name="localhost", node_name="rally0", telemetry=t)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant in #721 (comment) is that we recreate the telemetry device, not the node. This ensures that we don't rely on any state that is tied to that instance.


# expected result is 1 byte because there are two nodes on the machine. Result is calculated with total_bytes / node_count
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please wrap that line at 120 characters?

t.detach_from_node(node, running=True)
t.detach_from_node(node, running=False)
node2 = cluster.Node(pid=None, host_name="localhost", node_name="rally0", telemetry=t)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to the test above we should create a new instance of DiskIo instead of the node.

Copy link
Member

@danielmitterdorfer danielmitterdorfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating. I left a couple of suggestions and thoughts.

@@ -774,14 +775,16 @@ class DiskIo(InternalTelemetryDevice):
"""
Gathers disk I/O stats.
"""
def __init__(self, metrics_store, node_count_on_host):
def __init__(self, metrics_store, node_count_on_host, log_root, node_name):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DockerLauncher calls this constructor as well. Can you please update the constructor invocation accordingly and double-check you have adapted all call-sites?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per offline discussion, I fixed the constructor call. But uncovered that attach_to_node isn't called for telemetry devices in DockerLauncher and so DIskIo isn't working properly: opened #733 to fix it separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

super().__init__()
self.metrics_store = metrics_store
self.node_count_on_host = node_count_on_host
self.node = None
self.process = None
self.disk_start = None
self.process_start = None
self.node_name = node_name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can we we match the order in the constructor's argument list here?

read_bytes = 0
write_bytes = 0
io.ensure_dir(self.log_root)
tmp_io_file = os.path.join(self.log_root, "%s.io" % self.node_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: As per convention we prefer the format method to % formatting in code that is not performance-critical, i.e. you'd need to change this to "{}.io".format(self.node_name)) (similarly in on_benchmark_stop)

io.ensure_dir(self.log_root)
tmp_io_file = os.path.join(self.log_root, "%s.io" % self.node_name)
with open(tmp_io_file, "rt", encoding="utf-8") as f:
io_bytes = json.load(f)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you now also store the PID, the name io_bytes does not seem appropriate anymore? How about io_stats?

t2 = telemetry.Telemetry(enabled_devices=[], devices=[device2])
t2.on_benchmark_stop()
t.detach_from_node(node, running=True)
t.detach_from_node(node, running=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both detach_from_node calls should be on t2, not on t?

Copy link
Contributor Author

@ebadyano ebadyano Jul 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we call t.on_benchmark_start() only on t does it make sense to call detach only on t and not on t2?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because you need to consider that we test here the behavior with #722:

  • In the first invocation (the new start subcommand), the telemetry device will be instantiated and we attach the telemetry device. After that the process terminates and t will be gone.
  • In the second invocation (the new stop subcommand), a new instance of the telemetry device will be created and we detach it from the node.

Hence, we need to call the start-related methods on one instance and the stop-related ones on the other to simulate this behavior and to test that we do not rely on any state of the previous instance (t in that case).

Makes sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thank you for explaining. I'll add the fix.

t2 = telemetry.Telemetry(enabled_devices=[], devices=[device2])
t2.on_benchmark_stop()
t.detach_from_node(node, running=True)
t.detach_from_node(node, running=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both detach_from_node calls should be on t2, not on t?

@ebadyano ebadyano added enhancement Improves the status quo :Telemetry Telemetry Devices that gather additional metrics :misc Changes that don't affect users directly: linter fixes, test improvements, etc. labels Jul 24, 2019
@ebadyano ebadyano added this to the 1.3.0 milestone Jul 24, 2019
Copy link
Member

@danielmitterdorfer danielmitterdorfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating! LGTM

@ebadyano ebadyano merged commit fe1ff28 into elastic:master Jul 25, 2019
novosibman pushed a commit to novosibman/rally that referenced this pull request Aug 12, 2019
Ensure that DiskIo telemetry does not rely on Rally being a parent
process of Elasticsearch and persists the disk counters at the beginning
of a benchmark and can read it again afterwards.

Relates to elastic#697
novosibman pushed a commit to novosibman/rally that referenced this pull request Aug 12, 2019
Ensure that DiskIo telemetry does not rely on Rally being a parent
process of Elasticsearch and persists the disk counters at the beginning
of a benchmark and can read it again afterwards.

Relates to elastic#697
novosibman pushed a commit to novosibman/rally that referenced this pull request Aug 12, 2019
Ensure that DiskIo telemetry does not rely on Rally being a parent
process of Elasticsearch and persists the disk counters at the beginning
of a benchmark and can read it again afterwards.

Relates to elastic#697
@ebadyano ebadyano deleted the diskio2 branch December 16, 2022 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improves the status quo :misc Changes that don't affect users directly: linter fixes, test improvements, etc. :Telemetry Telemetry Devices that gather additional metrics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants