Questions about migrating analysis code to coffea 2023 #972

kmohrman · 2023-12-18T21:04:23Z

kmohrman
Dec 18, 2023

I'm attempting to migrate this analysis code to coffea 2023. Just for future reference, here is a link to the repo at the specific current commit (so that we have a "pre coffea 2023" to refer against). Thank you to @btovar for pointing me to @cmoore24-24's coffea 2023 processor here. This, along with the references from @lgray (here and here) are what I'm trying to base this migration off of.

Ok, so in the current version of my code, this is where the processor was run. Here are the relevant lines for the iterative executor:

    exec_instance = processor.IterativeExecutor()
    runner = processor.Runner(exec_instance, schema=NanoAODSchema, chunksize=chunksize, maxchunks=nchunks)
    output = runner(flist, treename, processor_instance)

When I naively just attempt to run with coffea2023, the first thing that breaks is from that block (with an error AttributeError: module 'coffea.processor' has no attribute 'IterativeExecutor'). So is seems like this block is a good thing to try to focus on first. So, attempting to rewrite this block for coffea2023, I think I need to do something like:

    events = something
    histos_to_compute = processor_instance.process(events)
    output = dask.compute(histos_to_compute)

(Though of course eventually will have to change things in the process function too, but I wanted to just focus on trying to call it the right way first.) If that's the right direction, then I think what I am stuck on next is events. It is not clear to me what this is supposed to correspond to. In the pre-coffea2023 version, we never directly call the processor's process function ourselves, but I think its events argument just corresponds to the nanoevents object for just the events in the particular chunk.

In @lgray's announcement of coffea 2023 (on mattermost here) it was mentioned that the arrays correspond to an entire dataset, not a chunk of a dataset. So I think my questions for now would be:

Does this mean the events argument should correspond to all of the events in just one dataset? Or would it correspond to all of the events that we are processing in total?
If it corresponds to all events, then I am not sure if I understand how the metadata for each dataset is handled (unless we can pass e.g. a dictionary instead of just events?)
If it corresponds to just the events in a dataset, then I am not sure if I understand the best way to process all of the datasets in parallel (maybe just looping over the datasets? Assuming that multiple dask.compute commands can be running simultaneously?)

Sorry for the long message, and thank you in advance for any help or tips!

lgray · 2023-12-18T21:14:40Z

lgray
Dec 18, 2023
Maintainer

To answer to the bulleted questions:

events will be whatever size you configure it to be but it can be multiple files of multiple chunks of data
metadata is more or less handled entirely client side now, the metadata is just a dict of objects that can be pickled more or less.
for each dataset you pass through processor_instance.process you'll get some set of dask collections back that represent the computing work to be done once you ask for .compute() or dask.compute(...), the latter of which can accept multiple arguments. So you could do something like dask.compute({"dataset1": stuff_to_compute, ...}) and then you'll get a result for each dataset.

On the last one, once you get the histograms back you can manipulate and sum them.

You can get an idea of the multi-dataset way to do things here, sorry it's just the tests for now. I'll make something more expository after the break:
https://github.com/CoffeaTeam/coffea/blob/master/tests/test_dataset_tools.py#L171

Which uses some of the functions we've put together for handling multiple datasets, test runs, recovering the set of failed files and re-running on that.

11 replies

lgray Dec 19, 2023
Maintainer

OK - a bit of profiling makes things clear. The majority of time is spent opening only the first file in each dataset and fetching the TTree metadata. This was hidden in the past since opening the files was done once per chunk on the remote worker, but since the dask task-graph has to be generated on the client side we must open the file and generate the metadata.

Still I'm surprised that's so slow. We can figure out a way to make it faster for sure, either by caching the read schemas or doing the parsing in parallel.

How many datasets have you defined? 6 minutes seems quite a lot since from below that would correspond to ~300 datasets, are you reading over xrootd?

Here's the profile for posterity (this is opening the same file 20 times):

15.527 <module>  jrueb_repro.py:1
├─ 14.825 events  coffea/nanoevents/factory.py:667
│  └─ 14.825 dask  uproot/_dask.py:28
│     └─ 14.824 _get_dak_array_delay_open  uproot/_dask.py:1470
│        ├─ 10.025 regularize_object_path  uproot/_util.py:949
│        │  └─ 10.007 __getitem__  uproot/reading.py:2040
│        │     └─ 10.007 get  uproot/reading.py:2427
│        │        └─ 9.972 read  uproot/model.py:1273
│        │           └─ 9.972 read  uproot/model.py:752
│        │              └─ 9.911 read_members  uproot/models/TTree.py:687
│        │                 └─ 9.911 read  uproot/model.py:752
│        │                    └─ 9.910 read_members  uproot/models/TObjArray.py:30
│        │                       └─ 9.878 read_object_any  uproot/deserialization.py:189
│        │                          └─ 9.621 read  uproot/model.py:1273
│        │                             └─ 9.466 read  uproot/model.py:752
│        │                                ├─ 8.852 read_members  uproot/models/TBranch.py:436
│        │                                │  ├─ 7.506 read  uproot/model.py:752
│        │                                │  │  ├─ 4.452 read_members  uproot/models/TObjArray.py:30
│        │                                │  │  │  ├─ 3.053 read_object_any  uproot/deserialization.py:189
│        │                                │  │  │  │  └─ 2.883 read  uproot/model.py:1273
│        │                                │  │  │  │     └─ 2.760 read  uproot/model.py:752
│        │                                │  │  │  │        ├─ 1.668 read_members  uproot/models/TLeaf.py:94
│        │                                │  │  │  │        │  └─ 1.541 read  uproot/model.py:752
│        │                                │  │  │  │        │     └─ 1.068 read_members  uproot/models/TLeaf.py:27
│        │                                │  │  │  │        │        ├─ 0.801 read  uproot/model.py:752
│        │                                │  │  │  │        │        │  └─ 0.614 read_members  uproot/models/TNamed.py:18
│        │                                │  │  │  │        │        │     └─ 0.429 read  uproot/model.py:752
│        │                                │  │  │  │        │        │        └─ 0.166 [self]  
│        │                                │  │  │  │        │        └─ 0.167 class_named  uproot/reading.py:1083
│        │                                │  │  │  │        └─ 0.363 read_members  uproot/models/TLeaf.py:436
│        │                                │  │  │  │           └─ 0.309 read  uproot/model.py:752
│        │                                │  │  │  │              └─ 0.239 read_members  uproot/models/TLeaf.py:27
│        │                                │  │  │  │                 └─ 0.179 read  uproot/model.py:752
│        │                                │  │  │  ├─ 0.951 read  uproot/model.py:752
│        │                                │  │  │  │  ├─ 0.431 [self]  
│        │                                │  │  │  │  ├─ 0.276 read_members  uproot/models/TObject.py:26
│        │                                │  │  │  │  └─ 0.163 check_numbytes  uproot/model.py:913
│        │                                │  │  │  └─ 0.240 string  uproot/source/cursor.py:394
│        │                                │  │  │     └─ 0.226 bytestring  uproot/source/cursor.py:361
│        │                                │  │  ├─ 0.878 read_members  uproot/models/TNamed.py:18
│        │                                │  │  │  ├─ 0.593 read  uproot/model.py:752
│        │                                │  │  │  │  └─ 0.208 copy  uproot/source/cursor.py:109
│        │                                │  │  │  └─ 0.233 string  uproot/source/cursor.py:394
│        │                                │  │  │     └─ 0.215 bytestring  uproot/source/cursor.py:361
│        │                                │  │  ├─ 0.599 check_numbytes  uproot/model.py:913
│        │                                │  │  │  └─ 0.466 classname  uproot/model.py:403
│        │                                │  │  │     └─ 0.415 classname_decode  uproot/model.py:171
│        │                                │  │  ├─ 0.565 read_numbytes_version  uproot/model.py:872
│        │                                │  │  │  └─ 0.498 numbytes_version  uproot/deserialization.py:103
│        │                                │  │  │     ├─ 0.275 [self]  
│        │                                │  │  │     └─ 0.212 fields  uproot/source/cursor.py:175
│        │                                │  │  ├─ 0.539 [self]  
│        │                                │  │  └─ 0.202 copy  uproot/source/cursor.py:109
│        │                                │  │     └─ 0.164 __init__  uproot/source/cursor.py:48
│        │                                │  │        └─ 0.161 [self]  
│        │                                │  ├─ 0.651 class_named  uproot/reading.py:1083
│        │                                │  │  └─ 0.532 classname_regularize  uproot/model.py:147
│        │                                │  │     └─ 0.481 sub  re.py:203
│        │                                │  ├─ 0.196 [self]  
│        │                                │  └─ 0.185 array  uproot/source/cursor.py:330
│        │                                └─ 0.201 [self]  
│        ├─ 1.876 keys  uproot/behaviors/TBranch.py:1138
│        │  └─ 1.873 iterkeys  uproot/behaviors/TBranch.py:1291
│        │     └─ 1.866 iteritems  uproot/behaviors/TBranch.py:1365
│        │        └─ 1.752 _remove_not_interpretable  coffea/nanoevents/factory.py:29
│        │           ├─ 0.756 interpretation  uproot/behaviors/TBranch.py:1910
│        │           │  └─ 0.740 interpretation_of  uproot/interpretation/identify.py:298
│        │           │     ├─ 0.222 classname  uproot/model.py:403
│        │           │     │  └─ 0.193 classname_decode  uproot/model.py:171
│        │           │     ├─ 0.178 _from_leaves  uproot/interpretation/identify.py:131
│        │           │     └─ 0.161 _leaf_to_dtype  uproot/interpretation/identify.py:64
│        │           ├─ 0.570 awkward_form  uproot/interpretation/numerical.py:260
│        │           │  ├─ 0.283 awkward  uproot/extras.py:19
│        │           │  │  └─ 0.263 parse_version  uproot/_util.py:110
│        │           │  │     └─ 0.255 parse  packaging/version.py:45
│        │           │  │        └─ 0.246 __init__  packaging/version.py:186
│        │           │  └─ 0.262 awkward_form  uproot/_util.py:510
│        │           │     └─ 0.243 awkward  uproot/extras.py:19
│        │           │        └─ 0.228 parse_version  uproot/_util.py:110
│        │           │           └─ 0.225 parse  packaging/version.py:45
│        │           │              └─ 0.217 __init__  packaging/version.py:186
│        │           └─ 0.397 awkward_form  uproot/interpretation/jagged.py:105
│        │              └─ 0.287 awkward_form  uproot/_util.py:510
│        │                 └─ 0.181 awkward_form  uproot/interpretation/numerical.py:260
│        ├─ 1.174 _get_ttree_form  uproot/_dask.py:1263
│        │  ├─ 0.564 awkward_form  uproot/interpretation/numerical.py:260
│        │  │  ├─ 0.274 awkward_form  uproot/_util.py:510
│        │  │  │  └─ 0.257 awkward  uproot/extras.py:19
│        │  │  │     └─ 0.236 parse_version  uproot/_util.py:110
│        │  │  │        └─ 0.223 parse  packaging/version.py:45
│        │  │  │           └─ 0.217 __init__  packaging/version.py:186
│        │  │  └─ 0.272 awkward  uproot/extras.py:19
│        │  │     └─ 0.260 parse_version  uproot/_util.py:110
│        │  │        └─ 0.252 parse  packaging/version.py:45
│        │  │           └─ 0.245 __init__  packaging/version.py:186
│        │  └─ 0.396 awkward_form  uproot/interpretation/jagged.py:105
│        │     └─ 0.292 awkward_form  uproot/_util.py:510
│        │        └─ 0.198 awkward_form  uproot/interpretation/numerical.py:260
│        ├─ 1.134 from_map  dask_awkward/lib/io/io.py:504
│        │  └─ 1.016 mock  uproot/_dask.py:955
│        │     └─ 1.016 typetracer_from_form  awkward/typetracer.py:212
│        │        ├─ 0.692 to_typetracer  awkward/contents/content.py:238
│        │        │  └─ 0.692 _to_typetracer  awkward/contents/recordarray.py:359
│        │        │     └─ 0.689 <listcomp>  awkward/contents/recordarray.py:361
│        │        │        ├─ 0.532 _to_typetracer  awkward/contents/recordarray.py:359
│        │        │        │  └─ 0.522 <listcomp>  awkward/contents/recordarray.py:361
│        │        │        │     └─ 0.516 _to_typetracer  awkward/contents/numpyarray.py:221
│        │        │        │        └─ 0.308 _raw  awkward/contents/numpyarray.py:196
│        │        │        │           └─ 0.303 to_nplike  awkward/_nplikes/__init__.py:17
│        │        │        │              └─ 0.275 asarray  awkward/_nplikes/typetracer.py:651
│        │        │        │                 └─ 0.244 _new  awkward/_nplikes/typetracer.py:176
│        │        │        │                    └─ 0.192 [self]  
│        │        │        └─ 0.155 _to_typetracer  awkward/contents/listoffsetarray.py:221
│        │        └─ 0.324 length_zero_array  awkward/forms/form.py:494
│        │           └─ 0.324 _impl  awkward/operations/ak_from_buffers.py:117
│        │              └─ 0.324 _reconstitute  awkward/operations/ak_from_buffers.py:187
│        │                 └─ 0.323 <listcomp>  awkward/operations/ak_from_buffers.py:403
│        │                    └─ 0.322 _reconstitute  awkward/operations/ak_from_buffers.py:187
│        │                       └─ 0.205 <listcomp>  awkward/operations/ak_from_buffers.py:403
│        │                          └─ 0.197 _reconstitute  awkward/operations/ak_from_buffers.py:187
│        └─ 0.613 __call__  coffea/nanoevents/factory.py:128
│           ├─ 0.270 __init__  coffea/nanoevents/schemas/nanoaod.py:165
│           │  └─ 0.270 _build_collections  coffea/nanoevents/schemas/nanoaod.py:201
│           └─ 0.176 _lazify_form  coffea/nanoevents/mapping/uproot.py:30
└─ 0.702 from_root  coffea/nanoevents/factory.py:232
   └─ 0.701 behavior  coffea/nanoevents/schemas/nanoaod.py:324
      └─ 0.700 <module>  coffea/nanoevents/methods/nanoaod.py:1
         └─ 0.656 <module>  coffea/nanoevents/methods/candidate.py:1
            └─ 0.656 <module>  coffea/nanoevents/methods/vector.py:1
               └─ 0.655 wrap  numba/np/ufunc/decorators.py:128
                  └─ 0.613 add  numba/np/ufunc/dufunc.py:170
                     └─ 0.613 _compile_for_argtys  numba/np/ufunc/dufunc.py:223
                        └─ 0.529 _compile_element_wise_function  numba/np/ufunc/ufuncbuilder.py:173
                           └─ 0.529 compile  numba/np/ufunc/ufuncbuilder.py:107
                              └─ 0.529 _compile_core  numba/np/ufunc/ufuncbuilder.py:126
                                 └─ 0.529 compile_extra  numba/core/compiler.py:744
                                    ├─ 0.372 __init__  numba/core/compiler.py:413
                                    │  └─ 0.372 refresh  numba/core/typing/context.py:153
                                    │     └─ 0.360 _load_builtins  numba/core/typing/context.py:415
                                    │        └─ 0.357 install_registry  numba/core/typing/context.py:428
                                    │           └─ 0.357 __init__  numba/core/typing/templates.py:1047
                                    │              └─ 0.357 _init_once  numba/core/typing/templates.py:1093
                                    │                 └─ 0.357 _get_target_registry  numba/core/typing/templates.py:900
                                    │                    └─ 0.357 refresh  numba/core/base.py:261
                                    │                       └─ 0.262 load_additional_registries  numba/core/cpu.py:60
                                    └─ 0.157 compile_extra  numba/core/compiler.py:455
                                       └─ 0.157 _compile_bytecode  numba/core/compiler.py:524
                                          └─ 0.157 _compile_core  numba/core/compiler.py:478

lgray Dec 19, 2023
Maintainer

Here's the profile just using uproot.dask directly. I'll talk to Jim and company to see what we can do here.

13.119 <module>  uproot_dask_open.py:1
└─ 13.116 dask  uproot/_dask.py:28
   └─ 13.115 _get_dak_array_delay_open  uproot/_dask.py:1470
      ├─ 9.107 regularize_object_path  uproot/_util.py:949
      │  └─ 9.096 __getitem__  uproot/reading.py:2040
      │     └─ 9.096 get  uproot/reading.py:2427
      │        └─ 9.064 read  uproot/model.py:1273
      │           └─ 9.064 read  uproot/model.py:752
      │              └─ 9.005 read_members  uproot/models/TTree.py:687
      │                 └─ 9.005 read  uproot/model.py:752
      │                    └─ 9.005 read_members  uproot/models/TObjArray.py:30
      │                       └─ 8.987 read_object_any  uproot/deserialization.py:189
      │                          ├─ 8.720 read  uproot/model.py:1273
      │                          │  └─ 8.582 read  uproot/model.py:752
      │                          │     └─ 8.079 read_members  uproot/models/TBranch.py:436
      │                          │        ├─ 6.840 read  uproot/model.py:752
      │                          │        │  ├─ 3.969 read_members  uproot/models/TObjArray.py:30
      │                          │        │  │  ├─ 2.699 read_object_any  uproot/deserialization.py:189
      │                          │        │  │  │  └─ 2.531 read  uproot/model.py:1273
      │                          │        │  │  │     └─ 2.400 read  uproot/model.py:752
      │                          │        │  │  │        ├─ 1.314 read_members  uproot/models/TLeaf.py:94
      │                          │        │  │  │        │  ├─ 1.110 read  uproot/model.py:752
      │                          │        │  │  │        │  │  └─ 0.874 read_members  uproot/models/TLeaf.py:27
      │                          │        │  │  │        │  │     └─ 0.635 read  uproot/model.py:752
      │                          │        │  │  │        │  │        └─ 0.453 read_members  uproot/models/TNamed.py:18
      │                          │        │  │  │        │  │           ├─ 0.274 read  uproot/model.py:752
      │                          │        │  │  │        │  │           └─ 0.152 string  uproot/source/cursor.py:394
      │                          │        │  │  │        │  │              └─ 0.136 bytestring  uproot/source/cursor.py:361
      │                          │        │  │  │        │  └─ 0.139 class_named  uproot/reading.py:1083
      │                          │        │  │  │        ├─ 0.396 read_members  uproot/models/TLeaf.py:436
      │                          │        │  │  │        │  └─ 0.348 read  uproot/model.py:752
      │                          │        │  │  │        │     └─ 0.266 read_members  uproot/models/TLeaf.py:27
      │                          │        │  │  │        │        └─ 0.220 read  uproot/model.py:752
      │                          │        │  │  │        ├─ 0.164 read_members  uproot/models/TLeaf.py:271
      │                          │        │  │  │        │  └─ 0.146 read  uproot/model.py:752
      │                          │        │  │  │        └─ 0.158 [self]  
      │                          │        │  │  ├─ 0.804 read  uproot/model.py:752
      │                          │        │  │  │  ├─ 0.286 read_members  uproot/models/TObject.py:26
      │                          │        │  │  │  ├─ 0.151 copy  uproot/source/cursor.py:109
      │                          │        │  │  │  ├─ 0.145 check_numbytes  uproot/model.py:913
      │                          │        │  │  │  └─ 0.135 [self]  
      │                          │        │  │  └─ 0.252 string  uproot/source/cursor.py:394
      │                          │        │  │     └─ 0.233 bytestring  uproot/source/cursor.py:361
      │                          │        │  ├─ 0.796 read_members  uproot/models/TNamed.py:18
      │                          │        │  │  ├─ 0.499 read  uproot/model.py:752
      │                          │        │  │  │  ├─ 0.150 read_members  uproot/models/TObject.py:26
      │                          │        │  │  │  └─ 0.136 copy  uproot/source/cursor.py:109
      │                          │        │  │  └─ 0.246 string  uproot/source/cursor.py:394
      │                          │        │  │     └─ 0.228 bytestring  uproot/source/cursor.py:361
      │                          │        │  ├─ 0.524 read_numbytes_version  uproot/model.py:872
      │                          │        │  │  └─ 0.457 numbytes_version  uproot/deserialization.py:103
      │                          │        │  │     ├─ 0.240 [self]  
      │                          │        │  │     └─ 0.193 fields  uproot/source/cursor.py:175
      │                          │        │  │        └─ 0.135 get  uproot/source/chunk.py:402
      │                          │        │  ├─ 0.515 [self]  
      │                          │        │  ├─ 0.509 check_numbytes  uproot/model.py:913
      │                          │        │  │  └─ 0.351 classname  uproot/model.py:403
      │                          │        │  │     └─ 0.315 classname_decode  uproot/model.py:171
      │                          │        │  └─ 0.211 copy  uproot/source/cursor.py:109
      │                          │        │     └─ 0.167 __init__  uproot/source/cursor.py:48
      │                          │        ├─ 0.596 class_named  uproot/reading.py:1083
      │                          │        │  └─ 0.483 classname_regularize  uproot/model.py:147
      │                          │        │     └─ 0.421 sub  re.py:203
      │                          │        │        └─ 0.177 _compile  re.py:289
      │                          │        │           └─ 0.167 [self]  
      │                          │        ├─ 0.204 [self]  
      │                          │        └─ 0.150 array  uproot/source/cursor.py:330
      │                          └─ 0.138 field  uproot/source/cursor.py:203
      ├─ 1.793 _get_ttree_form  uproot/_dask.py:1263
      │  ├─ 0.726 interpretation  uproot/behaviors/TBranch.py:1910
      │  │  └─ 0.713 interpretation_of  uproot/interpretation/identify.py:298
      │  │     ├─ 0.232 classname  uproot/model.py:403
      │  │     │  └─ 0.193 classname_decode  uproot/model.py:171
      │  │     ├─ 0.166 _from_leaves  uproot/interpretation/identify.py:131
      │  │     └─ 0.152 _leaf_to_dtype  uproot/interpretation/identify.py:64
      │  │        └─ 0.137 classname  uproot/model.py:403
      │  ├─ 0.614 awkward_form  uproot/interpretation/numerical.py:260
      │  │  ├─ 0.339 awkward  uproot/extras.py:19
      │  │  │  └─ 0.313 parse_version  uproot/_util.py:110
      │  │  │     └─ 0.305 parse  packaging/version.py:45
      │  │  │        └─ 0.301 __init__  packaging/version.py:186
      │  │  └─ 0.259 awkward_form  uproot/_util.py:510
      │  │     └─ 0.237 awkward  uproot/extras.py:19
      │  │        └─ 0.229 parse_version  uproot/_util.py:110
      │  │           └─ 0.225 parse  packaging/version.py:45
      │  │              └─ 0.218 __init__  packaging/version.py:186
      │  └─ 0.366 awkward_form  uproot/interpretation/jagged.py:105
      │     └─ 0.257 awkward_form  uproot/_util.py:510
      │        └─ 0.195 awkward_form  uproot/interpretation/numerical.py:260
      ├─ 1.325 from_map  dask_awkward/lib/io/io.py:504
      │  └─ 1.250 mock  uproot/_dask.py:955
      │     └─ 1.250 typetracer_from_form  awkward/typetracer.py:212
      │        ├─ 0.827 to_typetracer  awkward/contents/content.py:238
      │        │  └─ 0.827 _to_typetracer  awkward/contents/recordarray.py:359
      │        │     └─ 0.813 <listcomp>  awkward/contents/recordarray.py:361
      │        │        ├─ 0.466 _to_typetracer  awkward/contents/listoffsetarray.py:221
      │        │        │  ├─ 0.167 to_nplike  awkward/index.py:245
      │        │        │  └─ 0.141 _to_typetracer  awkward/contents/numpyarray.py:221
      │        │        └─ 0.342 _to_typetracer  awkward/contents/numpyarray.py:221
      │        │           └─ 0.133 _raw  awkward/contents/numpyarray.py:196
      │        └─ 0.422 length_zero_array  awkward/forms/form.py:494
      │           └─ 0.422 _impl  awkward/operations/ak_from_buffers.py:117
      │              └─ 0.422 _reconstitute  awkward/operations/ak_from_buffers.py:187
      │                 └─ 0.399 <listcomp>  awkward/operations/ak_from_buffers.py:403
      │                    └─ 0.385 _reconstitute  awkward/operations/ak_from_buffers.py:187
      ├─ 0.397 form_with_unique_keys  dask_awkward/lib/utils.py:139
      │  ├─ 0.220 impl  dask_awkward/lib/utils.py:140
      │  │  └─ 0.191 content  awkward/_meta/recordmeta.py:132
      │  │     └─ 0.171 field_to_index  awkward/_meta/recordmeta.py:101
      │  │        └─ 0.166 list.index  <built-in>:0
      │  └─ 0.146 from_dict  awkward/forms/form.py:51
      │     └─ 0.145 <listcomp>  awkward/forms/form.py:108
      │        └─ 0.143 from_dict  awkward/forms/form.py:51
      ├─ 0.222 __init__  uproot/_dask.py:784
      │  └─ 0.222 build_form_key_to_key  uproot/_dask.py:795
      │     └─ 0.222 impl  uproot/_dask.py:799
      │        └─ 0.198 content  awkward/_meta/recordmeta.py:132
      │           └─ 0.180 field_to_index  awkward/_meta/recordmeta.py:101
      │              └─ 0.179 list.index  <built-in>:0
      └─ 0.170 dask_awkward  uproot/extras.py:300
         └─ 0.170 <module>  dask_awkward/__init__.py:1

kmohrman Dec 19, 2023
Author

Ok, thanks Lindsey. I have 182 datasets, and yes I am accessing via xrd.

Sorry if I've just missed it, but I'm still not sure why can't we just put the NanoEventsFactory.from_root into the process function (so just pass the process function a dataset name file list, instead of passing it the events object we obtain ahead of time). Would that work around this issue of having to do NanoEventsFactory.from_root for all datasets ahead of time?

lgray Dec 19, 2023
Maintainer

So the reason for that comes down to how dask task-graphs work, which is broadly split into two phases: building and execution.

For building you need to have the information about the structure of the input file (so you have to process the metadata of at least one file in that specific dataset), then you start performing operations on that representation to build up the task graph of stuff you want to do (histograms, skims). Compared to coffea 0.7 where arrays you wanted were provided when you asked for this, arrays are never read from a file until you go to the execution stage, and that requires that you open at least one file in the dataset where you are building the task graph to generate the initial file structure (and you can't just do it later as with coffea 0.7). Another way to put that is that process functions only ever operate on these starting bits of metadata from dask, and process is being run on the client side since it is building up the task graph to execute. Before the task graph to execute was just calling process over and over, and now the task graph includes every single operation you do to the data.

When executed, that taskgraph is optimized and the tasks defined within it are mapped over the partitions of the data to be read in, using the already determined metadata to figure out exactly what needs to be read from the input file.

So, as opposed to previously where the file would be opened and its metadata determined later we have to do it eagerly on the client side. I think fixing it is just a matter of organizing things correctly or otherwise making some portions of that initial determination more lazily evaluated.

I also just checked and opening to get the metadata is slower in uproot4. So overall it's a pretty significant improvement, it's just organized such that it appears slower to the user right now.

lgray Dec 19, 2023
Maintainer

tl;dr - it's slower because we have to do something that's slow we were hiding in the jobs before on the client, but it's not something we can't fix.

kmohrman · 2023-12-18T23:36:02Z

kmohrman
Dec 18, 2023
Author

Ok, thanks @lgray! I've created an events object and am now trying to run for just a single root file. Without changing any of my processor code, the first crash happens in the lepton object selection here, the error is a NotImplementedError error:

Traceback (most recent call last):
  File "/home/k.mohrman/coffea_dir/migrate_to_coffea2023_repo/ewkcoffea/analysis/wwz/run_wwz4l.py", line 323, in <module>
    histos_to_compute = processor_instance.process(events)
  File "/home/k.mohrman/coffea_dir/migrate_to_coffea2023_repo/ewkcoffea/analysis/wwz/wwz4l.py", line 256, in process
    ele["topmva"] = os_ec.get_topmva_score_ele(events, year)
  File "/home/k.mohrman/coffea_dir/migrate_to_coffea2023_repo/ewkcoffea/ewkcoffea/modules/objects_wwz.py", line 66, in get_topmva_score_ele
    in_vals = np.array([
  File "/home/k.mohrman/miniconda3/envs/coffea23-env00/lib/python3.9/site-packages/dask_awkward/lib/core.py", line 1614, in __array__
    raise NotImplementedError
NotImplementedError

This seems like it might be something obvious, but I'm not sure I understand it, sorry. Wondering if you'd have any tips? Is it related to the array being a np array not a dask array?

14 replies

lgray Jan 4, 2024
Maintainer

Also in your particular implementation of xgb_test, whatever you call it, you have to define the data manipulation it does to events (or events.Muon, etc.) such that it maps your data into the input structure the BDT expects.

You can see the reduction from 20 columns to 16 in the one that Yi-mu made. https://github.com/CoffeaTeam/coffea/blob/master/tests/test_ml_tools.py#L169-L174

kmohrman Jan 4, 2024
Author

Ah, okay, thank you! So I think I'd been misunderstanding. The elements in that ak_events array do not necessarily need to correspond to "one element event" but more like one element per object. Then at the end I guess we'd still need to unflatten as before.

Ok, so I think I've gotten the xgboost stuff in my code more or less working here. But I still have a few questions:

I am still having a hard time wrapping my head around the xgboost_test and prepare_awkward part (which I just copy pasted from the coffea test example). I am wondering if you could say a bit more about what this is conceptually doing?
When I run, I get this warning. I'm wondering what this means and if this is something I should be concerned about?

  warnings.warn(smsg, UserWarning)
/home/k.mohrman/miniconda3/envs/coffea23-env00/lib/python3.9/site-packages/dask_awkward/lib/structure.py:895: UserWarning: Please ensure that dask.awkward<num, npartitions=1>
        is partitionwise-compatible with dask.awkward<numpy-call-xgboost-test, npartitions=1>
        (e.g. counts comes from a dak.num(array, axis=1)),
        otherwise this unflatten operation will fail when computed!
  warnings.warn(

The numbers I'm getting with the new implementation are very close to the old. I tried to print them out and do a diff, but when I try to print out all the values like this:
```
for i in score.compute():
    for j in i:
        print(j)
```
I get e.g. 0.9850477 for the first one, while I get 0.9850476980209351 in the coffea 0.7 print statement. I'm not sure what's controlling the rounding here or why it's printing with a different number of digits than it had been in coffea 0.7. Wondering if you'd have any tips for how to get the coffea 2023 print statement to give me more digits? Otherwise of course I can compare in a fancier way (but diff is just so easy that it'd be nice to be able to use it if possible...).

lgray Jan 4, 2024
Maintainer

this function takes whatever raw awkward arrays you have as input and turns them into the format your ML model accepts (so for most BDTs you'll flatten stuff in this function and unflatten the returned result)
It's a reminder so you don't shoot yourself in the foot when unflattening :-)
It's probably just differences in rounding when printing in awkward2 vs. awkward1. If you turn them both into numpy arrays and compare them, it'll be on more equal footting. ( also, FWIW, if they're the same to 7 decimal points already you're close enough ;-) )

kmohrman Jan 4, 2024
Author

Ok, thank you! One more quick question on 2, I am still not sure if I understand what situation it is saying unflatten would fail in... what does "partitionwise-compatible" mean in the Please ensure that dask.awkward<num, npartitions=1> is partitionwise-compatible with dask.awkward<numpy-call-xgboost-test, npartitions=1>? Is it just saying make sure npartitions is the same in both of those?

lgray Jan 5, 2024
Maintainer

Yes it more or less means don't mix counts and data that come from differently partitioned sources, which is a thing you can do if you're not careful.

kmohrman · 2023-12-19T18:22:54Z

kmohrman
Dec 19, 2023
Author

Also have another question about another I've run into further into the processor (I've just commenting the xgboost stuff for now.. will try to figure it out later). It's on this line:

weights_obj_base = coffea.analysis_tools.Weights(len(events),storeIndividual=True)

The error is this:

Traceback (most recent call last):
  File "/home/k.mohrman/coffea_dir/migrate_to_coffea2023_repo/ewkcoffea/analysis/wwz/run_wwz4l.py", line 344, in <module>
    histos_to_compute[json_name] = processor_instance.process(json_name,flist[json_name])
  File "/home/k.mohrman/coffea_dir/migrate_to_coffea2023_repo/ewkcoffea/analysis/wwz/wwz4l.py", line 321, in process
    weights_obj_base = coffea.analysis_tools.Weights(len(events),storeIndividual=True)
  File "/home/k.mohrman/miniconda3/envs/coffea23-env00/lib/python3.9/site-packages/dask_awkward/lib/core.py", line 978, in __len__
    raise NotImplementedError(
NotImplementedError: Cannot determine length of collection with unknown partition sizes without executing the graph.
Use `dask_awkward.num(..., axis=0)` if you want a lazy Scalar of the length.
If you want to eagerly compute the partition sizes to have the ability to call `len` on the collection, use `.eager_compute_divisions()` on the collection.

So I guess I cannot use len(events). I've tried to replace it with dak.num(events, axis=0), but this gives me an error too:

Traceback (most recent call last):
  File "/home/k.mohrman/coffea_dir/migrate_to_coffea2023_repo/ewkcoffea/analysis/wwz/run_wwz4l.py", line 344, in <module>
    histos_to_compute[json_name] = processor_instance.process(json_name,flist[json_name])
  File "/home/k.mohrman/coffea_dir/migrate_to_coffea2023_repo/ewkcoffea/analysis/wwz/wwz4l.py", line 324, in process
    weights_obj_base = coffea.analysis_tools.Weights(dak.num(events, axis=0),storeIndividual=True)
  File "/home/k.mohrman/miniconda3/envs/coffea23-env00/lib/python3.9/site-packages/coffea/analysis_tools.py", line 69, in __init__
    self._weight = None if size is None else numpy.ones(size)
  File "/home/k.mohrman/miniconda3/envs/coffea23-env00/lib/python3.9/site-packages/numpy/core/numeric.py", line 204, in ones
    a = empty(shape, dtype, order)
TypeError: expected a sequence of integers or a single integer, got 'dask.awkward<numaxis0, type=Scalar, dtype=int64>'

Wondering if @lgray or @nsmith- would have any tips on how to work around this? Is there maybe some different version of coffea.analysis_tools.Weights() that I'm supposed to use with coffea2023?

19 replies

kmohrman Dec 19, 2023
Author

Ahh, I am sorry. Please ignore my previous message (for some reason it was not showing me your "so all you have do do" message before I sent my message).

lgray Dec 19, 2023
Maintainer

Yeah they both use the same axes, which are outside the distinction of delayed or eager execution, really.

kmohrman Dec 19, 2023
Author

Wow, it actually ran and made a histogram! Was just for a single root file (and the xgboost stuff is still disabled), but anyway it ran :)

I'll keep working on the xgboost stuff, and once that's figured out, I think will be pretty much done.

Thank you so much @lgray for all of the help. I hope you have a very nice Holiday break.

lgray Dec 19, 2023
Maintainer

Sure thing! Have a good break as well! If you can summarize your process with a few examples it'll help a lot for making documentation!

kmohrman Dec 19, 2023
Author

Thanks! And sounds good, once I get the xgboost stuff finished and everything else cleaned up (and once I've validated that I'm getting the same results with the old and new versions) I'll try to summarize things and post it here.

kmohrman · 2023-12-19T23:43:30Z

kmohrman
Dec 19, 2023
Author

I have a quick question, with coffea2023 is anything like this needed for processors (using processor.ProcessorABC class)?

    @property
    def columns(self):
        return self._columns

I'm actually not really sure what (if anything) it does (even in coffea 0.7)... but I guess I must have copy pasted it from some example at some point.

3 replies

lgray Dec 19, 2023
Maintainer

No that's not needed at all!

kmohrman Dec 19, 2023
Author

Ok, thanks!

lgray Dec 19, 2023
Maintainer

Though that reminds me you should check dak.necessary_columns(analysis output) and make sure it's close to what you expect for the columns your analysis uses.

kmohrman · 2024-01-03T17:21:10Z

kmohrman
Jan 3, 2024
Author

Happy New Year!

I have a question about the structure of the output histogram object for the coffea 2023 version of my processor.

The way I'd previously been running with coffea 0.7, the output histogram object had a StrCategory axis for samples (e.g. ttH, ttW, etc). This structure was convenient for manipulating and plotting (to do e.g. summing and grouping etc).

However, in my initial attempt to migrate to coffea 2023, I'm now passing each dataset one by one to my process function, so I eventually pass an object like {"sample1": stuff_to_compute, ...} to dask.compute. This means I end up with an output object that is a dictionary of histogram objects where the keys are sample names and the values are the histogram objects.
I'm wondering if it's easy/advisable to set this up so that the samples are a category axis in my histogram object (as I'd been doing before)?

The way I can think of to do this with coffea2023 would be to pass something like {"sample1":events1,...} to my process function, and then put in an explicit for loop over keys of that input dictionary. However, having the loop over samples explicitly in the processor feels like it would inhibit some parallelization, but I also feel like I might be thinking about that the wrong way (since I guess now the processor does not actually get run till we dask.compute?). Also I might be missing some other obvious solution.

Anyway, sorry for the long question (and sorry that it's not very well formulated), but I'm wondering if you'd have any thoughts or advice on this? Thanks!

9 replies

lgray Jan 3, 2024
Maintainer

a fileset looks like:

{
    "ZJets": {
        "files": {
            "tests/samples/nano_dy.root": {
                "object_path": "Events",
                "steps": [
                    [0, 5], [5, 10], [10, 15], [15, 20], [20, 25], [25, 30], [30, 35], [35, 40],
                ],
            }
        },
        "metadata": {...},
        ...
    },
    "Data": {
        "files": {
            "tests/samples/nano_dimuon.root": "Events",
            "tests/samples/nano_dimuon_not_there.root": "Events",
        },
        "metadata": {...},
        ...
    },
}

i.e. it's just a list of datasets where each dataset has files, possibly with chunks defined, and metadata.

It shouldn't be related to post-process at all. I'd have to see your exact code to understand why you are getting a tuple, given what you said and the code you linked that should not be so.

I'll patch out the need for postprocess. Nominally you don't really need to inherit from ProcessorABC in the first place. You could just turn it into a class with __call__ defined instead of process.

lgray Jan 3, 2024
Maintainer

Oh, maybe you're talking about the output of dask.compute, which will always be a tuple of stuff? In the case of just passing it one thing just do x = dask.compute(the_thing)[0] and x will be the computed version of the_thing.

kmohrman Jan 3, 2024
Author

Ah, okay, I did not realize the output of dask.compute would always be a tuple. Thanks!

nsmith- Jan 4, 2024
Maintainer

Alternatively, x, = dask.compute(the_thing)

lgray Jan 4, 2024
Maintainer

and I am still learning things about python syntax...

kmohrman · 2024-01-04T14:20:18Z

kmohrman
Jan 4, 2024
Author

I have a quick question (probably something obvious that I'm just missing). When I try to print e.g. events.Electron.pt I just get dask.awkward<pt, npartitions=1>. I'm wondering how to see the values in the array?

2 replies

lgray Jan 4, 2024
Maintainer

print(events.Electron.pt.compute()) - but obviously you don't want to do this in the middle of your processor in a production setting, just for testing on small inputs. Everytime you do compute it has to read the data (unless you use (dask).persist). You could also pass the array out of your processor return hist_dict, events.Electron.pt and then compute it later, etc.

kmohrman Jan 4, 2024
Author

Thanks!

kmohrman · 2024-01-04T22:36:31Z

kmohrman
Jan 4, 2024
Author

Thank you again @lgray for all of the help with this. I think my analysis code is finally pretty much fully migrated. I should clean some parts up, but as of now everything seems to be working fine for my tests with a single file anyway (yields for the single file agree, and my CI is passing).

Next up I will try to run at scale. Just for reference, with coffea 0.7 scaled out with Work Queue, this analysis (which is still fairly preliminary so does not have any systematics yet) was able to turn around in ~20m using ~500 cores. I will talk to @btovar and @cmoore24-24 about how to run with TaskVine, and once we have it running I can post any interesting performance numbers here.

But I have one quick question before trying to scale up. @lgray I am wondering if you could explain a bit more about how/where the dask.compute(stuff_to_compute) runs when (when no scheduler is specified)? Is it just running locally? Also, I'm wondering how is its performance expected to compare to the iterative executor with coffea 0.7? For context, for my single testing file, the runtime is about 15s with coffea 0.7 iterative executor, but seems to be almost a minute for coffea 2023.

14 replies

lgray Jan 5, 2024
Maintainer

Hmm, they appear to be exactly the same (testing set equality in the python terminal). There's definitely some stuff your code isn't asking for yet (like the jet indices) that are getting touched despite not being needed.

One thing you can do is go through the code and do dak.necessary_columns(some_pertinent_variable) and see at what point in the code the number of columns you expect to read increases by a dramatic amount.

This may expose yet another overtouching bug we need to fix.

lgray Jan 5, 2024
Maintainer

@Jailbone please make a new discussion! Your comment comes a bit in the weeds of a very long discussion and I'd want to make sure it's easy for people to find.

kmohrman Jan 5, 2024
Author

Printing out dask.necessary_columns, looks like here is where we go from columns that are expected (i.e. just 'Electron_tightCharge', 'Electron_convVeto', 'Electron_dxy', 'Electron_pt', 'Electron_miniPFRelIso_all', 'Electron_lostHits', 'Electron_eta', 'Electron_sip3d', 'nElectron', 'Electron_dz' at this point) to a bunch of columns that I'm not using (e.g. Electron_mvaTTH).

So I guess it must be something inside of get_topmva_score_ele() that's the culprit (at least for the electron related variables). I'll keep digging...

Edit: Sorry, the link was to the master branch. This is the function on the coffea2023 branch here.

kmohrman Jan 5, 2024
Author

Actually, it seems that as soon as I do ele = events.Electron at the very start of the processor, dak.necessary_columns(ele) shows many columns:
{'from-uproot-a9fb28be8b7991a0240b9233b6fda061': frozenset({'Electron_mvaFall17V2noIso_WPL', 'Electron_mvaFall17V2noIso', 'Electron_mvaFall17V2Iso', 'Electron_pt', 'Electron_dr03TkSumPt', 'Electron_hoe', 'Electron_ip3d', 'Electron_cleanmask', 'Electron_dzErr', 'Electron_tightCharge', 'Electron_r9', 'Electron_jetPtRelv2', 'Electron_cutBased', 'Electron_dr03EcalRecHitSumEt', 'Electron_mvaFall17V2noIso_WP80', 'Electron_eta', 'Electron_mvaFall17V2noIso_WP90', 'Electron_dEsigmaDown', 'Electron_eInvMinusPInv', 'Electron_mvaTTH', 'Electron_mvaFall17V2Iso_WPL', 'Electron_scEtOverPt', 'Electron_pdgId', 'Electron_vidNestedWPBitmap', 'Electron_convVeto', 'Electron_dEscaleUp', 'Electron_dxyErr', 'Electron_dz', 'Electron_dr03HcalDepth1TowerSumEt', 'Electron_sieie', 'Electron_seedGain', 'Electron_dEscaleDown', 'nJet', 'Electron_mvaFall17V2Iso_WP90', 'Electron_pfRelIso03_chg', 'Electron_jetNDauCharged', 'Electron_charge', 'Electron_jetRelIso', 'nGenPart', 'Electron_isPFcand', 'Electron_energyErr', 'Electron_genPartIdx', 'Electron_lostHits', 'Electron_jetIdx', 'Electron_dxy', 'Electron_phi', 'Electron_dEsigmaUp', 'Electron_photonIdx', 'Electron_genPartFlav', 'Electron_pfRelIso03_all', 'Electron_sip3d', 'Electron_dr03TkSumPtHEEP', 'Electron_mass', 'Electron_mvaFall17V2Iso_WP80', 'Electron_cutBased_HEEP', 'Electron_vidNestedWPBitmapHEEP', 'Electron_miniPFRelIso_chg', 'nElectron', 'Electron_eCorr', 'nPhoton', 'Electron_deltaEtaSC', 'Electron_miniPFRelIso_all'})}

Wondering if this is expected, or if at this point we would not expect to see any columns since we have not used any particular columns (e.g. ele.pt or anything like that) yet?

lgray Jan 5, 2024
Maintainer

Yes if you ask for the columns necessary for the whole electron object it'll need to grab all the data.
Usually this overtouching is detectable when you ask for a specific value like .pt, or the column of BDT output values you calculate, and necessary_columns says it needs the whole input object.

kmohrman · 2024-01-09T16:12:05Z

kmohrman
Jan 9, 2024
Author

Hi @lgray I'm wondering if there is any type of progress bar that would be available with coffea 2023 (e.g. something similar to the extremely useful progress bars available in coffea 0.7)?

3 replies

lgray Jan 9, 2024
Maintainer

https://docs.dask.org/en/stable/diagnostics-local.html#progress-bar

lgray Jan 9, 2024
Maintainer

Though, really, the dashboard (localhost:8787) that the distributed client creates is faaaar more useful than a basic progress bar.

kmohrman Jan 9, 2024
Author

Ok, thank you

Questions about migrating analysis code to coffea 2023 #972

kmohrman Dec 18, 2023

Replies: 8 comments · 75 replies

lgray Dec 18, 2023 Maintainer

lgray Dec 19, 2023 Maintainer

lgray Dec 19, 2023 Maintainer

kmohrman Dec 19, 2023 Author

lgray Dec 19, 2023 Maintainer

lgray Dec 19, 2023 Maintainer

kmohrman Dec 18, 2023 Author

lgray Jan 4, 2024 Maintainer

kmohrman Jan 4, 2024 Author

lgray Jan 4, 2024 Maintainer

kmohrman Jan 4, 2024 Author

lgray Jan 5, 2024 Maintainer

kmohrman Dec 19, 2023 Author

kmohrman Dec 19, 2023 Author

lgray Dec 19, 2023 Maintainer

kmohrman Dec 19, 2023 Author

lgray Dec 19, 2023 Maintainer

kmohrman Dec 19, 2023 Author

kmohrman Dec 19, 2023 Author

lgray Dec 19, 2023 Maintainer

kmohrman Dec 19, 2023 Author

lgray Dec 19, 2023 Maintainer

kmohrman Jan 3, 2024 Author

lgray Jan 3, 2024 Maintainer

lgray Jan 3, 2024 Maintainer

kmohrman Jan 3, 2024 Author

nsmith- Jan 4, 2024 Maintainer

lgray Jan 4, 2024 Maintainer

kmohrman Jan 4, 2024 Author

lgray Jan 4, 2024 Maintainer

kmohrman Jan 4, 2024 Author

kmohrman Jan 4, 2024 Author

lgray Jan 5, 2024 Maintainer

lgray Jan 5, 2024 Maintainer

kmohrman Jan 5, 2024 Author

kmohrman Jan 5, 2024 Author

lgray Jan 5, 2024 Maintainer

kmohrman Jan 9, 2024 Author

lgray Jan 9, 2024 Maintainer

lgray Jan 9, 2024 Maintainer

kmohrman Jan 9, 2024 Author

kmohrman
Dec 18, 2023

Replies: 8 comments 75 replies

lgray
Dec 18, 2023
Maintainer

lgray Dec 19, 2023
Maintainer

lgray Dec 19, 2023
Maintainer

kmohrman Dec 19, 2023
Author

lgray Dec 19, 2023
Maintainer

lgray Dec 19, 2023
Maintainer

kmohrman
Dec 18, 2023
Author

lgray Jan 4, 2024
Maintainer

kmohrman Jan 4, 2024
Author

lgray Jan 4, 2024
Maintainer

kmohrman Jan 4, 2024
Author

lgray Jan 5, 2024
Maintainer

kmohrman
Dec 19, 2023
Author

kmohrman Dec 19, 2023
Author

lgray Dec 19, 2023
Maintainer

kmohrman Dec 19, 2023
Author

lgray Dec 19, 2023
Maintainer

kmohrman Dec 19, 2023
Author

kmohrman
Dec 19, 2023
Author

lgray Dec 19, 2023
Maintainer

kmohrman Dec 19, 2023
Author

lgray Dec 19, 2023
Maintainer

kmohrman
Jan 3, 2024
Author

lgray Jan 3, 2024
Maintainer

lgray Jan 3, 2024
Maintainer

kmohrman Jan 3, 2024
Author

nsmith- Jan 4, 2024
Maintainer

lgray Jan 4, 2024
Maintainer

kmohrman
Jan 4, 2024
Author

lgray Jan 4, 2024
Maintainer

kmohrman Jan 4, 2024
Author

kmohrman
Jan 4, 2024
Author

lgray Jan 5, 2024
Maintainer

lgray Jan 5, 2024
Maintainer

kmohrman Jan 5, 2024
Author

kmohrman Jan 5, 2024
Author

lgray Jan 5, 2024
Maintainer

kmohrman
Jan 9, 2024
Author

lgray Jan 9, 2024
Maintainer

lgray Jan 9, 2024
Maintainer

kmohrman Jan 9, 2024
Author