Allow to customize track preparation #1135

danielmitterdorfer · 2020-12-16T09:10:13Z

With this commit we introduce track processors and define two phases in the
track lifecycle where they can hook into:

on_after_load_track: This callback is executed after the track has been
successfully loaded and converted into Rally's domain representation. Rally
defines some out-of-the-box track processors for task filters and test mode
support which will be called in any case.
on_prepare_track: This callback is executed before a load generator will
start executing the benchmark. It can be used to download or generate any
datasets that are needed for the benchmark. Rally defines a default behavior
that downloads and extracts all relevant corpora. That behavior can be
overridden via a custom track processor (i.e. Rally's default behavior is then
disabled.).

Like all custom track components, track processors are registered via the
register function and we implement a new registration method on the registry
called register_track_processor for that, which expects an instance that
implements the two methods above.

Note that track processors cannot share state in between these phases. Some
phases might only be invoked on the coordinating node of a benchmark
(on_after_load_track) whereas others will need to be called on each load
generator machine (on_prepare_track).

We introduce a new parameters dictionary on track as well as on challenge
level, where the challenge properties override the defaults defined track level.
These parameters are only exposed on challenge level and track processors are
expected to operate only on that level. Track processors can determine the
current challenge with the new property selected_challenge_or_default.
Mid-term we intend to restrict the current track parameter feature in a way that
all values provided via --track-params on the command line must be declared
within the parameters block, either on track or on challenge level. That's the
reason why we have called this block already parameters although it will only
be used by track processors for now.

We intentionally do not provide user documentation for this feature yet as we
want to gain more experience first before exposing it to a wider user base.

Closes #1066

With this commit we introduce track processors and define two phases in the track lifecycle where they can hook into: * `on_after_load_track`: This callback is executed after the track has been successfully loaded and converted into Rally's domain representation. Rally defines some out-of-the-box track processors for task filters and test mode support which will be called in any case. * `on_prepare_track`: This callback is executed before a load generator will start executing the benchmark. It can be used to download or generate any datasets that are needed for the benchmark. Rally defines a default behavior that downloads and extracts all relevant corpora. That behavior can be overridden via a custom track processor (i.e. Rally's default behavior is then *disabled*.). Like all custom track components, track processors are registered via the `register` function and we implement a new registration method on the registry called `register_track_processor` for that, which expects an instance that implements the two methods above. Note that track processors cannot share state in between these phases. Some phases might only be invoked on the coordinating node of a benchmark (`on_after_load_track`) whereas others will need to be called on each load generator machine (`on_prepare_track`). We introduce a new `parameters` dictionary on track as well as on challenge level, where the challenge properties override the defaults defined track level. These parameters are only exposed on challenge level and track processors are expected to operate only on that level. Track processors can determine the current challenge with the new property `selected_challenge_or_default`. Mid-term we intend to restrict the current track parameter feature in a way that all values provided via `--track-params` on the command line must be declared within the `parameters` block, either on track or on challenge level. That's the reason why we have called this block already `parameters` although it will only be used by track processors for now. We intentionally do not provide user documentation for this feature yet as we want to gain more experience first before exposing it to a wider user base. Closes elastic#1066

dliappis

LGTM! Left a few comments and questions, but the implementation looks solid.

dliappis · 2020-12-16T15:09:18Z

esrally/track/loader.py

+
+        :param track: The current track.
+        """
+        ...


Is there a reason for preferring ellipsis to the usual pass?

I'll change to pass

dliappis · 2020-12-16T15:28:28Z

esrally/track/loader.py

            net.download(data_url, target_path, size_in_bytes, progress_indicator=progress)
            progress.finish()
            self.logger.info("Downloaded data from [%s] to [%s].", data_url, target_path)
        except urllib.error.HTTPError as e:
            if e.code == 404 and self.test_mode:
-                raise exceptions.DataError("Track [%s] does not support test mode. Please ask the track author to add it or "
-                                           "disable test mode and retry." % self.track_name)
+                raise exceptions.DataError("This track does not support test mode. Please ask the track author "


Let's standardize on no please, see https://developers.google.com/style/tone#politeness-and-use-of-please

dliappis · 2020-12-16T15:29:05Z

esrally/track/loader.py

-            raise exceptions.SystemSetupError(
-                "Cannot download from %s to %s. Please verify that data are available at %s and "
-                "check your Internet connection." % (data_url, target_path, data_url))
+            raise exceptions.SystemSetupError(f"Could not download [{data_url}] to [{target_path}]. Please verify data "


Same comment like earlier, we can start removing please as per the usual style guides e.g. https://developers.google.com/style/tone#politeness-and-use-of-please

dliappis · 2020-12-16T15:31:46Z

esrally/track/loader.py

+                        if self._filter_out_match(leaf_task):
+                            leafs_to_remove.append(leaf_task)
+                    for leaf_task in leafs_to_remove:
+                        self.logger.info("Removing sub-task [%s] from challenge [%s] due to task filter.",


I sense such info message will be VERY helpful for debugging! 💯

dliappis · 2020-12-16T15:38:57Z

esrally/track/loader.py

+                        path, ext = io.splitext(document_set.document_archive)
+                        path_2, ext_2 = io.splitext(path)
+
+                        document_set.document_archive = "%s-1k%s%s" % (path_2, ext_2, ext)


Any reason against using an f-string here and lines below?

No, this was mainly because I was only moving lines around but did not change the implementation. I'll adapt accordingly.

dliappis · 2020-12-16T16:00:09Z

esrally/utils/collections.py

+from typing import Any, Generator, Mapping
+
+
+def merge_dicts(d1: Mapping[str, Any], d2: Mapping[str, Any]) -> Generator[Any, None, Any]:


An alternative non-recursive implementation that seems to be quite speedy can be found here: https://stackoverflow.com/a/61708681/12047396, however, it won't merge lists.

I'd go for the more generic implementation as I don't think performance matters too much here.

danielmitterdorfer · 2020-12-16T16:28:32Z

Thanks for your review @dliappis. I've addressed all your review comments in 9d752bb. I let CI run once more and will merge it then.

danielmitterdorfer · 2020-12-17T06:11:45Z

@elasticmachine test this please

With this commit we drop the `name` argument from the registration method for track processors. Contrary to runners, schedulers and parameter sources there is no need to refer to track processors by name and this parameter has been introduced by mistake. Relates elastic#1135

With this commit we drop the `name` argument from the registration method for track processors. Contrary to runners, schedulers and parameter sources there is no need to refer to track processors by name and this parameter has been introduced by mistake. Relates #1135

danielmitterdorfer added enhancement Improves the status quo :Track Management New operations, changes in the track format, track download changes and the like labels Dec 16, 2020

danielmitterdorfer added this to the 2.0.3 milestone Dec 16, 2020

danielmitterdorfer requested a review from dliappis December 16, 2020 09:10

danielmitterdorfer self-assigned this Dec 16, 2020

dliappis approved these changes Dec 16, 2020

View reviewed changes

Address review comments

9d752bb

danielmitterdorfer merged commit bf95cb8 into elastic:master Dec 17, 2020

danielmitterdorfer deleted the track-processors branch December 17, 2020 07:00

danielmitterdorfer mentioned this pull request Dec 21, 2020

Support ability to generate track config identifier #1123

Closed

danielmitterdorfer mentioned this pull request Jan 12, 2021

Don't require a name to register a track processor #1147

Merged

danielmitterdorfer mentioned this pull request Feb 3, 2021

Natively support parallelized track preparation #1166

Closed

dliappis mentioned this pull request Feb 19, 2021

Add grok and dissect to runtime fields in http_logs elastic/rally-tracks#156

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow to customize track preparation #1135

Allow to customize track preparation #1135

danielmitterdorfer commented Dec 16, 2020

dliappis left a comment

dliappis Dec 16, 2020

danielmitterdorfer Dec 16, 2020

dliappis Dec 16, 2020

dliappis Dec 16, 2020

dliappis Dec 16, 2020

dliappis Dec 16, 2020

danielmitterdorfer Dec 16, 2020

dliappis Dec 16, 2020

danielmitterdorfer Dec 16, 2020

danielmitterdorfer commented Dec 16, 2020

danielmitterdorfer commented Dec 17, 2020

		from typing import Any, Generator, Mapping


		def merge_dicts(d1: Mapping[str, Any], d2: Mapping[str, Any]) -> Generator[Any, None, Any]:

Allow to customize track preparation #1135

Allow to customize track preparation #1135

Conversation

danielmitterdorfer commented Dec 16, 2020

dliappis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielmitterdorfer commented Dec 16, 2020

danielmitterdorfer commented Dec 17, 2020