Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to customize track preparation #1135

Merged

Conversation

danielmitterdorfer
Copy link
Member

With this commit we introduce track processors and define two phases in the
track lifecycle where they can hook into:

  • on_after_load_track: This callback is executed after the track has been
    successfully loaded and converted into Rally's domain representation. Rally
    defines some out-of-the-box track processors for task filters and test mode
    support which will be called in any case.
  • on_prepare_track: This callback is executed before a load generator will
    start executing the benchmark. It can be used to download or generate any
    datasets that are needed for the benchmark. Rally defines a default behavior
    that downloads and extracts all relevant corpora. That behavior can be
    overridden via a custom track processor (i.e. Rally's default behavior is then
    disabled.).

Like all custom track components, track processors are registered via the
register function and we implement a new registration method on the registry
called register_track_processor for that, which expects an instance that
implements the two methods above.

Note that track processors cannot share state in between these phases. Some
phases might only be invoked on the coordinating node of a benchmark
(on_after_load_track) whereas others will need to be called on each load
generator machine (on_prepare_track).

We introduce a new parameters dictionary on track as well as on challenge
level, where the challenge properties override the defaults defined track level.
These parameters are only exposed on challenge level and track processors are
expected to operate only on that level. Track processors can determine the
current challenge with the new property selected_challenge_or_default.
Mid-term we intend to restrict the current track parameter feature in a way that
all values provided via --track-params on the command line must be declared
within the parameters block, either on track or on challenge level. That's the
reason why we have called this block already parameters although it will only
be used by track processors for now.

We intentionally do not provide user documentation for this feature yet as we
want to gain more experience first before exposing it to a wider user base.

Closes #1066

With this commit we introduce track processors and define two phases in the
track lifecycle where they can hook into:

* `on_after_load_track`: This callback is executed after the track has been
successfully loaded and converted into Rally's domain representation. Rally
defines some out-of-the-box track processors for task filters and test mode
support which will be called in any case.
* `on_prepare_track`: This callback is executed before a load generator will
start executing the benchmark. It can be used to download or generate any
datasets that are needed for the benchmark. Rally defines a default behavior
that downloads and extracts all relevant corpora. That behavior can be
overridden via a custom track processor (i.e. Rally's default behavior is then
*disabled*.).

Like all custom track components, track processors are registered via the
`register` function and we implement a new registration method on the registry
called `register_track_processor` for that, which expects an instance that
implements the two methods above.

Note that track processors cannot share state in between these phases. Some
phases might only be invoked on the coordinating node of a benchmark
(`on_after_load_track`) whereas others will need to be called on each load
generator machine (`on_prepare_track`).

We introduce a new `parameters` dictionary on track as well as on challenge
level, where the challenge properties override the defaults defined track level.
These parameters are only exposed on challenge level and track processors are
expected to operate only on that level. Track processors can determine the
current challenge with the new property `selected_challenge_or_default`.
Mid-term we intend to restrict the current track parameter feature in a way that
all values provided via `--track-params` on the command line must be declared
within the `parameters` block, either on track or on challenge level. That's the
reason why we have called this block already `parameters` although it will only
be used by track processors for now.

We intentionally do not provide user documentation for this feature yet as we
want to gain more experience first before exposing it to a wider user base.

Closes elastic#1066
@danielmitterdorfer danielmitterdorfer added enhancement Improves the status quo :Track Management New operations, changes in the track format, track download changes and the like labels Dec 16, 2020
@danielmitterdorfer danielmitterdorfer added this to the 2.0.3 milestone Dec 16, 2020
@danielmitterdorfer danielmitterdorfer self-assigned this Dec 16, 2020
Copy link
Contributor

@dliappis dliappis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Left a few comments and questions, but the implementation looks solid.


:param track: The current track.
"""
...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for preferring ellipsis to the usual pass?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change to pass

net.download(data_url, target_path, size_in_bytes, progress_indicator=progress)
progress.finish()
self.logger.info("Downloaded data from [%s] to [%s].", data_url, target_path)
except urllib.error.HTTPError as e:
if e.code == 404 and self.test_mode:
raise exceptions.DataError("Track [%s] does not support test mode. Please ask the track author to add it or "
"disable test mode and retry." % self.track_name)
raise exceptions.DataError("This track does not support test mode. Please ask the track author "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raise exceptions.SystemSetupError(
"Cannot download from %s to %s. Please verify that data are available at %s and "
"check your Internet connection." % (data_url, target_path, data_url))
raise exceptions.SystemSetupError(f"Could not download [{data_url}] to [{target_path}]. Please verify data "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment like earlier, we can start removing please as per the usual style guides e.g. https://developers.google.com/style/tone#politeness-and-use-of-please

if self._filter_out_match(leaf_task):
leafs_to_remove.append(leaf_task)
for leaf_task in leafs_to_remove:
self.logger.info("Removing sub-task [%s] from challenge [%s] due to task filter.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sense such info message will be VERY helpful for debugging! 💯

path, ext = io.splitext(document_set.document_archive)
path_2, ext_2 = io.splitext(path)

document_set.document_archive = "%s-1k%s%s" % (path_2, ext_2, ext)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason against using an f-string here and lines below?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this was mainly because I was only moving lines around but did not change the implementation. I'll adapt accordingly.

from typing import Any, Generator, Mapping


def merge_dicts(d1: Mapping[str, Any], d2: Mapping[str, Any]) -> Generator[Any, None, Any]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative non-recursive implementation that seems to be quite speedy can be found here: https://stackoverflow.com/a/61708681/12047396, however, it won't merge lists.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd go for the more generic implementation as I don't think performance matters too much here.

@danielmitterdorfer
Copy link
Member Author

Thanks for your review @dliappis. I've addressed all your review comments in 9d752bb. I let CI run once more and will merge it then.

@danielmitterdorfer
Copy link
Member Author

@elasticmachine test this please

@danielmitterdorfer danielmitterdorfer merged commit bf95cb8 into elastic:master Dec 17, 2020
@danielmitterdorfer danielmitterdorfer deleted the track-processors branch December 17, 2020 07:00
danielmitterdorfer added a commit to danielmitterdorfer/rally that referenced this pull request Jan 12, 2021
With this commit we drop the `name` argument from the registration
method for track processors. Contrary to runners, schedulers and
parameter sources there is no need to refer to track processors by name
and this parameter has been introduced by mistake.

Relates elastic#1135
danielmitterdorfer added a commit that referenced this pull request Jan 13, 2021
With this commit we drop the `name` argument from the registration
method for track processors. Contrary to runners, schedulers and
parameter sources there is no need to refer to track processors by name
and this parameter has been introduced by mistake.

Relates #1135
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improves the status quo :Track Management New operations, changes in the track format, track download changes and the like
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow to customize track preparation
2 participants