-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow to customize track preparation #1135
Allow to customize track preparation #1135
Conversation
With this commit we introduce track processors and define two phases in the track lifecycle where they can hook into: * `on_after_load_track`: This callback is executed after the track has been successfully loaded and converted into Rally's domain representation. Rally defines some out-of-the-box track processors for task filters and test mode support which will be called in any case. * `on_prepare_track`: This callback is executed before a load generator will start executing the benchmark. It can be used to download or generate any datasets that are needed for the benchmark. Rally defines a default behavior that downloads and extracts all relevant corpora. That behavior can be overridden via a custom track processor (i.e. Rally's default behavior is then *disabled*.). Like all custom track components, track processors are registered via the `register` function and we implement a new registration method on the registry called `register_track_processor` for that, which expects an instance that implements the two methods above. Note that track processors cannot share state in between these phases. Some phases might only be invoked on the coordinating node of a benchmark (`on_after_load_track`) whereas others will need to be called on each load generator machine (`on_prepare_track`). We introduce a new `parameters` dictionary on track as well as on challenge level, where the challenge properties override the defaults defined track level. These parameters are only exposed on challenge level and track processors are expected to operate only on that level. Track processors can determine the current challenge with the new property `selected_challenge_or_default`. Mid-term we intend to restrict the current track parameter feature in a way that all values provided via `--track-params` on the command line must be declared within the `parameters` block, either on track or on challenge level. That's the reason why we have called this block already `parameters` although it will only be used by track processors for now. We intentionally do not provide user documentation for this feature yet as we want to gain more experience first before exposing it to a wider user base. Closes elastic#1066
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Left a few comments and questions, but the implementation looks solid.
esrally/track/loader.py
Outdated
|
||
:param track: The current track. | ||
""" | ||
... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for preferring ellipsis to the usual pass
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change to pass
esrally/track/loader.py
Outdated
net.download(data_url, target_path, size_in_bytes, progress_indicator=progress) | ||
progress.finish() | ||
self.logger.info("Downloaded data from [%s] to [%s].", data_url, target_path) | ||
except urllib.error.HTTPError as e: | ||
if e.code == 404 and self.test_mode: | ||
raise exceptions.DataError("Track [%s] does not support test mode. Please ask the track author to add it or " | ||
"disable test mode and retry." % self.track_name) | ||
raise exceptions.DataError("This track does not support test mode. Please ask the track author " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's standardize on no please, see https://developers.google.com/style/tone#politeness-and-use-of-please
esrally/track/loader.py
Outdated
raise exceptions.SystemSetupError( | ||
"Cannot download from %s to %s. Please verify that data are available at %s and " | ||
"check your Internet connection." % (data_url, target_path, data_url)) | ||
raise exceptions.SystemSetupError(f"Could not download [{data_url}] to [{target_path}]. Please verify data " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment like earlier, we can start removing please as per the usual style guides e.g. https://developers.google.com/style/tone#politeness-and-use-of-please
if self._filter_out_match(leaf_task): | ||
leafs_to_remove.append(leaf_task) | ||
for leaf_task in leafs_to_remove: | ||
self.logger.info("Removing sub-task [%s] from challenge [%s] due to task filter.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sense such info message will be VERY helpful for debugging! 💯
esrally/track/loader.py
Outdated
path, ext = io.splitext(document_set.document_archive) | ||
path_2, ext_2 = io.splitext(path) | ||
|
||
document_set.document_archive = "%s-1k%s%s" % (path_2, ext_2, ext) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason against using an f-string here and lines below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this was mainly because I was only moving lines around but did not change the implementation. I'll adapt accordingly.
from typing import Any, Generator, Mapping | ||
|
||
|
||
def merge_dicts(d1: Mapping[str, Any], d2: Mapping[str, Any]) -> Generator[Any, None, Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative non-recursive implementation that seems to be quite speedy can be found here: https://stackoverflow.com/a/61708681/12047396, however, it won't merge lists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd go for the more generic implementation as I don't think performance matters too much here.
@elasticmachine test this please |
With this commit we drop the `name` argument from the registration method for track processors. Contrary to runners, schedulers and parameter sources there is no need to refer to track processors by name and this parameter has been introduced by mistake. Relates elastic#1135
With this commit we drop the `name` argument from the registration method for track processors. Contrary to runners, schedulers and parameter sources there is no need to refer to track processors by name and this parameter has been introduced by mistake. Relates #1135
With this commit we introduce track processors and define two phases in the
track lifecycle where they can hook into:
on_after_load_track
: This callback is executed after the track has beensuccessfully loaded and converted into Rally's domain representation. Rally
defines some out-of-the-box track processors for task filters and test mode
support which will be called in any case.
on_prepare_track
: This callback is executed before a load generator willstart executing the benchmark. It can be used to download or generate any
datasets that are needed for the benchmark. Rally defines a default behavior
that downloads and extracts all relevant corpora. That behavior can be
overridden via a custom track processor (i.e. Rally's default behavior is then
disabled.).
Like all custom track components, track processors are registered via the
register
function and we implement a new registration method on the registrycalled
register_track_processor
for that, which expects an instance thatimplements the two methods above.
Note that track processors cannot share state in between these phases. Some
phases might only be invoked on the coordinating node of a benchmark
(
on_after_load_track
) whereas others will need to be called on each loadgenerator machine (
on_prepare_track
).We introduce a new
parameters
dictionary on track as well as on challengelevel, where the challenge properties override the defaults defined track level.
These parameters are only exposed on challenge level and track processors are
expected to operate only on that level. Track processors can determine the
current challenge with the new property
selected_challenge_or_default
.Mid-term we intend to restrict the current track parameter feature in a way that
all values provided via
--track-params
on the command line must be declaredwithin the
parameters
block, either on track or on challenge level. That's thereason why we have called this block already
parameters
although it will onlybe used by track processors for now.
We intentionally do not provide user documentation for this feature yet as we
want to gain more experience first before exposing it to a wider user base.
Closes #1066