Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broadcasting of unions overtouches data #2911

Closed
lgray opened this issue Dec 21, 2023 · 3 comments · Fixed by #2913
Closed

Broadcasting of unions overtouches data #2911

lgray opened this issue Dec 21, 2023 · 3 comments · Fixed by #2913
Labels
bug (unverified) The problem described would be a bug, but needs to be triaged

Comments

@lgray
Copy link
Contributor

lgray commented Dec 21, 2023

Version of Awkward Array

master

Description and code to reproduce

from coffea.nanoevents import NanoEventsFactory, NanoAODSchema
import awkward as ak
import dask_awkward as dak

if __name__ == "__main__":
    NanoAODSchema.error_missing_event_ids = False

    events = NanoEventsFactory.from_root(
        {"tests/samples/nano_dy.root": {"object_path": "Events", "steps": [[0,40]]}},
        metadata={"dataset": "nano_dy"},
        schemaclass=NanoAODSchema,
    ).events()

    events["leptons"] = ak.concatenate(
        [events.Electron, events.Muon],
        axis=1,
    )

    sum_p4 = events.leptons + events.leptons

    print(dak.necessary_columns(sum_p4.pt))

    print(sum_p4.pt.compute())

While this does function now and does not yield incorrect data it reads a very large number of fields when it should need 10 (the counts of electrons, muons, and the four-momentum components of each). The current performance is rather suboptimal.

The code presently outputs:

(coffea-dev) lgray@Lindseys-MacBook-Pro coffea % python -i concat_issue_nano.py
/Users/lgray/coffea-dev/coffea/src/coffea/nanoevents/schemas/nanoaod.py:243: RuntimeWarning: Missing cross-reference index for FatJet_genJetAK8Idx => GenJetAK8
  warnings.warn(
{'from-uproot-ad7806f24d5481e9a913897472bb24fa': frozenset({'Electron_cutBased', 'Muon_miniPFRelIso_chg', 'Electron_charge', 'Muon_highPtId', 'Electron_cutBased_HEEP', 'Muon_isGlobal', 'Muon_nStations', 'Muon_softMvaId', 'Electron_mvaFall17V1noIso_WP90', 'nGenPart', 'Electron_vidNestedWPBitmap', 'Muon_tkRelIso', 'Electron_pdgId', 'Electron_energyErr', 'Muon_pfIsoId', 'Electron_mvaFall17V1Iso_WPL', 'Muon_dxyErr', 'Electron_mvaFall17V1Iso_WP90', 'Muon_miniIsoId', 'Electron_r9', 'Electron_dr03TkSumPt', 'Electron_dr03HcalDepth1TowerSumEt', 'Electron_mvaFall17V2noIso', 'Electron_mvaFall17V2Iso_WP80', 'Electron_isPFcand', 'Electron_cutBased_Fall17_V1', 'Muon_dz', 'Muon_isTracker', 'Electron_hoe', 'Electron_mvaFall17V1noIso_WP80', 'nFsrPhoton', 'Electron_mvaTTH', 'Electron_photonIdx', 'Electron_dxy', 'Muon_nTrackerLayers', 'Muon_segmentComp', 'Muon_genPartFlav', 'Muon_tightCharge', 'Electron_dz', 'Muon_dzErr', 'Muon_tunepRelPt', 'Muon_dxy', 'Electron_genPartFlav', 'Muon_mediumId', 'Muon_looseId', 'Electron_mvaFall17V1Iso_WP80', 'Electron_jetRelIso', 'Electron_lostHits', 'Electron_dr03EcalRecHitSumEt', 'Electron_vidNestedWPBitmapHEEP', 'Muon_mvaLowPt', 'Muon_mediumPromptId', 'Muon_softMva', 'nPhoton', 'Electron_miniPFRelIso_chg', 'Muon_fsrPhotonIdx', 'Muon_mvaTTH', 'Muon_pfRelIso04_all', 'Muon_mvaId', 'Electron_mvaFall17V2Iso', 'Muon_ip3d', 'Electron_phi', 'Muon_inTimeMuon', 'Electron_eInvMinusPInv', 'Muon_mass', 'Muon_jetPtRelv2', 'Muon_sip3d', 'Muon_tightId', 'Electron_eCorr', 'Electron_dxyErr', 'Electron_dr03TkSumPtHEEP', 'Muon_jetIdx', 'Electron_pfRelIso03_chg', 'Electron_mvaFall17V1noIso_WPL', 'Electron_mvaFall17V2Iso_WP90', 'Muon_pfRelIso03_chg', 'Electron_pfRelIso03_all', 'Muon_pfRelIso03_all', 'Electron_eta', 'Electron_convVeto', 'Electron_mvaFall17V2Iso_WPL', 'Muon_pt', 'Electron_mvaFall17V1noIso', 'nElectron', 'Electron_sip3d', 'Electron_mvaFall17V1Iso', 'Muon_triggerIdLoose', 'Electron_pt', 'Electron_mvaFall17V2noIso_WPL', 'Muon_multiIsoId', 'Electron_tightCharge', 'Muon_phi', 'Muon_ptErr', 'Muon_eta', 'Electron_miniPFRelIso_all', 'nMuon', 'Electron_deltaEtaSC', 'Muon_jetRelIso', 'Muon_isPFcand', 'Electron_mvaFall17V2noIso_WP90', 'nJet', 'Electron_jetPtRelv2', 'Electron_sieie', 'Electron_cleanmask', 'Muon_tkIsoId', 'Muon_cleanmask', 'Electron_mass', 'Electron_ip3d', 'Muon_genPartIdx', 'Electron_genPartIdx', 'Electron_seedGain', 'Muon_miniPFRelIso_all', 'Muon_charge', 'Electron_dzErr', 'Electron_mvaFall17V2noIso_WP80', 'Muon_pdgId', 'Muon_softId', 'Electron_jetIdx'})}
[[], [59.1], [120, 103], [21.3, 17.2], [154, ...], ..., [], [15.4], [], []]

Apologies for the coffea reproducer.

@lgray lgray added the bug (unverified) The problem described would be a bug, but needs to be triaged label Dec 21, 2023
@lgray
Copy link
Contributor Author

lgray commented Dec 21, 2023

This was working as expected in a prior version of awkward/dask-awkward (query 8 here run for CHEP):
https://github.com/CoffeaTeam/coffea-benchmarks/blob/master/coffea-adl-benchmarks.ipynb

I know things have changed greatly, but I'm sure we can get back to that state. This issue may likely hang between awkward and dask-awkward. @douglasdavis FYI

@agoose77
Copy link
Collaborator

What's happening here is that we broadcast through the union, which necessarily involves projecting out each content before recursing. This projection is eager, so we end up touching everything below-and-including each content of the union. Once we recurse through each content, we find the same ufunc overload for np.add (because leptons are leptons), and the resulting union over the result therefore simplifies away.

@jpivarski this concerns the choice for allow_lazy=False vs allow_lazy=True in UnionArray.project. This code predates my time on Awkward, and I know you usually have well-formed ideas about whether it's reasonable to create an index or not. There are two solutions: either we change UnionArray.project to use allow_lazy=True internally, or we add a new UnionArray.project_as_indexed that explicitly sets allow_lazy=True. I think the former is fine; it makes a lot of sense to me that we should avoid the carry; it's trivial to project an IndexedArray.

N.B.: presently, there is an implicit dependency upon the union formed by ak.concatenate.

@jpivarski
Copy link
Member

We can change UnionArray.project to use allow_lazy=True internally.

This mechanism was introduced in #261 as an optimization for RecordArrays (#204). We weren't thinking about UnionArrays, but if unions can be optimized (for Dask) by the same mechanism, then great!

You can't directly nest an IndexedArray inside of a UnionArray, but since a UnionArray has an index anyway, those indexes can be combined without affecting the union's contents.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug (unverified) The problem described would be a bug, but needs to be triaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants