Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: backport dataset_tools to coffea 0.7.x #1036

Merged
merged 5 commits into from
Feb 20, 2024

Conversation

ikrommyd
Copy link
Collaborator

This PR aims to backport @valsdav's dataset tools to coffea 0.7.x.
The preprocessing part has to be removed since that is tied to coffea 202x.

@ikrommyd ikrommyd marked this pull request as draft February 13, 2024 00:42
@ikrommyd
Copy link
Collaborator Author

ikrommyd commented Feb 13, 2024

I played around with querying from this branch and it seems to be working fine.
@valsdav can you please review and also check if I've removed the preprocessing part of dataset tools properly?. Below is the diff of dataset_query.py between master and this branch.

diff --git a/src/coffea/dataset_tools/dataset_query.py b/coffea/dataset_tools/dataset_query.py
index 905daa12..0f9f90a8 100644
--- a/src/coffea/dataset_tools/dataset_query.py
+++ b/coffea/dataset_tools/dataset_query.py
@@ -7,7 +7,6 @@ from collections import defaultdict
 from typing import List
 
 import yaml
-from dask.distributed import Client
 from rich import print
 from rich.console import Console
 from rich.prompt import Confirm, IntPrompt, Prompt
@@ -15,7 +14,6 @@ from rich.table import Table
 from rich.tree import Tree
 
 from . import rucio_utils
-from .preprocess import preprocess
 
 
 def print_dataset_query(query, dataset_list, console, selected=[]):
@@ -101,8 +99,6 @@ class DataDiscoveryCLI:
         self.sites_regex = None
         self.last_replicas_results = None
         self.final_output = None
-        self.preprocessed_total = None
-        self.preprocessed_available = None
 
         self.replica_results = defaultdict(list)
         self.replica_results_metadata = {}
@@ -118,7 +114,6 @@ class DataDiscoveryCLI:
             "replicas",
             "list-replicas",
             "save",
-            "preprocess",
             "allow-sites",
             "block-sites",
             "regex-sites",
@@ -145,7 +140,6 @@ Some basic commands:
   - [bold cyan]block-sites[/]: Exclude grid sites from the available sites for replicas query
   - [bold cyan]regex-sites[/]: Select sites with a regex for replica queries: e.g.  "T[123]_(FR|IT|BE|CH|DE)_\w+"
   - [bold cyan]save[/]: Save the replicas query results to file (json or yaml) for further processing
-  - [bold cyan]preprocess[/]: Preprocess the replicas with dask and save the fileset for further processing with uproot/coffea
   - [bold cyan]help[/]: Print this help message
             """
                 )
@@ -168,8 +162,6 @@ Some basic commands:
                 self.do_list_replicas()
             elif command == "save":
                 self.do_save()
-            elif command == "preprocess":
-                self.do_preprocess()
             elif command == "allow-sites":
                 self.do_allowlist_sites()
             elif command == "block-sites":
@@ -520,46 +512,6 @@ Some basic commands:
                 json.dump(self.final_output, file, indent=2)
         print(f"[green]File {filename} saved!")
 
-    def do_preprocess(
-        self,
-        output_file=None,
-        step_size=None,
-        align_to_clusters=None,
-        scheduler_url=None,
-    ):
-        """Perform preprocessing for concrete fileset extraction.
-        Args:  output_file [step_size] [align to file cluster boundaries] [dask scheduler url]
-        """
-        if not output_file:
-            output_file = Prompt.ask(
-                "[yellow bold]Output name", default="output_preprocessing"
-            )
-        if step_size is None:
-            step_size = IntPrompt.ask("[yellow bold]Step size", default=None)
-        if align_to_clusters is None:
-            align_to_clusters = Confirm.ask(
-                "[yellow bold]Align to clusters", default=True
-            )
-
-        # init a local Dask cluster
-        with self.console.status(
-            "[red] Preprocessing files to extract available chunks with dask[/]"
-        ):
-            with Client(scheduler_url) as _:
-                self.preprocessed_available, self.preprocessed_total = preprocess(
-                    self.final_output,
-                    step_size=step_size,
-                    align_clusters=align_to_clusters,
-                    skip_bad_files=True,
-                )
-        with gzip.open(f"{output_file}_available.json.gz", "wt") as file:
-            print(f"Saved available fileset chunks to {output_file}_available.json.gz")
-            json.dump(self.preprocessed_total, file, indent=2)
-        with gzip.open(f"{output_file}_all.json.gz", "wt") as file:
-            print(f"Saved all fileset chunks to {output_file}_all.json.gz")
-            json.dump(self.preprocessed_available, file, indent=2)
-        return self.preprocessed_total, self.preprocessed_available
-
     def load_dataset_definition(
         self,
         dataset_definition,
@@ -631,27 +583,6 @@ if __name__ == "__main__":
         required=False,
         default="output_dataset.json",
     )
-    parser.add_argument(
-        "-fo",
-        "--fileset-output",
-        help="Output name for fileset",
-        type=str,
-        required=False,
-        default="output_fileset",
-    )
-    parser.add_argument(
-        "-p",
-        "--preprocess",
-        help="Preprocess with dask",
-        action="store_true",
-        default=False,
-    )
-    parser.add_argument(
-        "--step-size", help="Step size for preprocessing", type=int, default=500000
-    )
-    parser.add_argument(
-        "--scheduler-url", help="Dask scheduler url", type=str, default=None
-    )
     parser.add_argument(
         "-as",
         "--allow-sites",
@@ -706,13 +637,6 @@ if __name__ == "__main__":
         # Save
         if args.output:
             cli.do_save(filename=args.output)
-        if args.preprocess:
-            cli.do_preprocess(
-                output_file=args.fileset_output,
-                step_size=args.step_size,
-                scheduler_url=args.scheduler_url,
-                align_to_clusters=False,
-            )
 
     if args.cli:
         cli.start_cli()

@ikrommyd ikrommyd marked this pull request as ready for review February 13, 2024 04:00
@ikrommyd
Copy link
Collaborator Author

I also set the version to 0.7.23 (let me know if you don't like that), and added rucio optional dependencies to setup.py. pyyaml is required to be added there in this branch since dask isn't mandatory in 0.7.x, On master, dask insallation takes care of that.

@lgray
Copy link
Collaborator

lgray commented Feb 13, 2024

I'll take a look at it later this week. Thanks!

@ikrommyd
Copy link
Collaborator Author

The dataset tools now output filesets where "files" are of the form {file: "Events"}. Is that correct for 0.7.x or do we need a list of file paths? I wasn't here before 202x so I'm not sure.

@valsdav
Copy link
Contributor

valsdav commented Feb 19, 2024

Thanks @iasonkrom I think your changes are fine. I think we can remove {'file': "Events"}, the fileset is usually defined like this in 0.7 https://github.com/PocketCoffea/AnalysisConfigs/blob/main/configs/ttHbb/semileptonic/semileptonic_triggerSF/datasets/backgrounds_MC_ttbar.json

@ikrommyd
Copy link
Collaborator Author

Reading {'file': "Events"} was added in uproot4 I believe (@jpivarski) so it should be fine to read that I guess unless we tend to specify the tree name in another way in 0.7. That can be removed by just changing one line so just let me know.

cc @lgray

@lgray
Copy link
Collaborator

lgray commented Feb 19, 2024

The only issue being that the "Events" tree specification will be override by treename in NanoEvents, which is confusing. We should reduce this to just being a list of files.

@ikrommyd
Copy link
Collaborator Author

Done, should we also add the binder tutorial in 0.7 and just remove the preprocessing shell? Other than that this should be ready.

@lgray
Copy link
Collaborator

lgray commented Feb 19, 2024

Yeah add another notebook for docs if you have time, that would be great!

@ikrommyd
Copy link
Collaborator Author

Done. Should be ready to review!

@lgray lgray merged commit a3726cc into CoffeaTeam:backports-v0.7.x Feb 20, 2024
1 check passed
@ikrommyd ikrommyd deleted the backport-dataset-tools branch February 20, 2024 06:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants