arrays
dropping (or overwriting?) branches with similar names
#1057
-
I have a TTree with branches that are all the same C++ class and I would like to read all of them into memory. I was able to generate a small enough file which I've just uploaded here if you wish to play with it: test.root.tar.gz This is where I'm at version wise.
For those following along at home, I've loaded the TTree from the example file with t = uproot.open({'test.root': 'performance/by_event'}) The shape of my TTree is pretty simple. As mentioned, each branch (except a completion-flag) is the same C++ class and so there is a lot of repetition. In [3]: t.show()
name | typename | interpretation
---------------------+--------------------------+-------------------------------
completed | bool | AsDtype('bool')
__ALL__ | framework::performanc... | AsGroup(<TBranchElement '__...
__ALL__/start_time_ | int64_t | AsDtype('>i8')
__ALL__/duration_ | double | AsDtype('>f8')
MultiTry | framework::performanc... | AsGroup(<TBranchElement 'Mu...
MultiTry/start_time_ | int64_t | AsDtype('>i8')
MultiTry/duration_ | double | AsDtype('>f8')
Recon | framework::performanc... | AsGroup(<TBranchElement 'Re...
Recon/start_time_ | int64_t | AsDtype('>i8')
Recon/duration_ | double | AsDtype('>f8') Both the In [4]: t.arrays(library='ak').type.show()
4 * {
completed: bool,
start_time_: int64,
duration_: float64
}
In [5]: t.arrays(library='np')
Out[5]:
{'completed': array([ True, False, True, True]),
'start_time_': array([1701965077841635851, 1701965077856929667, 1701965077857039110,
1701965077857187765]),
'duration_': array([7.22315e-04, 5.76170e-05, 8.84370e-05, 4.10540e-05])} The In [6]: t.arrays(library='pd')
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[6], line 1
----> 1 t.arrays(library='pd')
File ~/.local/lib/python3.10/site-packages/uproot/behaviors/TBranch.py:904, in HasBranches.arrays(self, expressions, cut, filter_name, filter_typename, filter_branch, aliases, language, entry_start, entry_stop, decompression_executor, interpretation_executor, array_cache, library, ak_add_doc, how)
897 del arrays
899 expression_context = [
900 (e, c) for e, c in expression_context if c["is_primary"] and not c["is_cut"]
901 ]
903 return _ak_add_doc(
--> 904 library.group(output, expression_context, how), self, ak_add_doc
905 )
File ~/.local/lib/python3.10/site-packages/uproot/interpretation/library.py:872, in Pandas.group(self, arrays, expression_context, how)
870 elif uproot._util.isstr(how) or how is None:
871 arrays, names = _pandas_only_series(pandas, arrays, expression_context)
--> 872 return _pandas_memory_efficient(pandas, arrays, names)
874 else:
875 raise TypeError(
876 f"for library {self.name}, how must be tuple, list, dict, str (for "
877 "pandas.merge's 'how' parameter, or None (for one or more"
878 "DataFrames without merging)"
879 )
File ~/.local/lib/python3.10/site-packages/uproot/interpretation/library.py:798, in _pandas_memory_efficient(pandas, series, names)
796 out = series[name].to_frame(name=name)
797 else:
--> 798 out[name] = series[name]
799 del series[name]
800 if out is None:
KeyError: 'start_time_' To me, this looks like an issue with how the branch-name-shortening is handling branches of similar/same types. It doesn't actually struggle with loading the data since I can instruct the In [7]: t.arrays(library='np', how=tuple)
Out[7]:
(array([ True, False, True, True]),
array([1701965077841635851, 1701965077856929667, 1701965077857039110,
1701965077857187765]),
array([7.22315e-04, 5.76170e-05, 8.84370e-05, 4.10540e-05]),
array([1701965077841635851, 1701965077856929667, 1701965077857039110,
1701965077857187765]),
array([7.22315e-04, 5.76170e-05, 8.84370e-05, 4.10540e-05]),
array([1701965077841635851, 1701965077856929667, 1701965077857039110,
1701965077857187765]),
array([7.22315e-04, 5.76170e-05, 8.84370e-05, 4.10540e-05])) My questions are
|
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 7 replies
-
Without updating the In [14]: t.arrays(expressions=['__ALL__/duration_','MultiTry/duration_'], library='np')
Out[14]:
{'__ALL__/duration_': array([7.22315e-04, 5.76170e-05, 8.84370e-05, 4.10540e-05]),
'MultiTry/duration_': array([5.4085e-04, 5.5649e-05, 7.8560e-05, 3.5065e-05])} We can get exactly what I'm interested in if we ask for the base In [18]: t.arrays(expressions=[k for k in t.keys() if '/' not in k]).type.show()
4 * {
completed: bool,
__ALL__: {
start_time_: int64,
duration_: float64
},
MultiTry: {
start_time_: int64,
duration_: float64
},
Recon: {
start_time_: int64,
duration_: float64
}
}
In [19]: t.arrays(expressions=[k for k in t.keys() if '/' not in k], library='np')
Out[19]:
{'completed': array([ True, False, True, True]),
'__ALL__': {'start_time_': array([1701965077841635851, 1701965077856929667, 1701965077857039110,
1701965077857187765]),
'duration_': array([7.22315e-04, 5.76170e-05, 8.84370e-05, 4.10540e-05])},
'MultiTry': {'start_time_': array([1701965077841635851, 1701965077856929667, 1701965077857039110,
1701965077857187765]),
'duration_': array([7.22315e-04, 5.76170e-05, 8.84370e-05, 4.10540e-05])},
'Recon': {'start_time_': array([1701965077841635851, 1701965077856929667, 1701965077857039110,
1701965077857187765]),
'duration_': array([7.22315e-04, 5.76170e-05, 8.84370e-05, 4.10540e-05])}} `library='pd'` still errors out thoughIn [20]: t.arrays(expressions=[k for k in t.keys() if '/' not in k], library='pd')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-20-0272b5a4f214> in ?()
----> 1 t.arrays(expressions=[k for k in t.keys() if '/' not in k], library='pd')
~/.local/lib/python3.10/site-packages/uproot/behaviors/TBranch.py in ?(self, expressions, cut, filter_name, filter_typename, filter_branch, aliases, language, entry_start, entry_stop, decompression_executor, interpretation_executor, array_cache, library, ak_add_doc, how)
900 (e, c) for e, c in expression_context if c["is_primary"] and not c["is_cut"]
901 ]
902
903 return _ak_add_doc(
--> 904 library.group(output, expression_context, how), self, ak_add_doc
905 )
~/.local/lib/python3.10/site-packages/uproot/interpretation/library.py in ?(self, arrays, expression_context, how)
868 return {_rename(name, c): arrays[name] for name, c in expression_context}
869
870 elif uproot._util.isstr(how) or how is None:
871 arrays, names = _pandas_only_series(pandas, arrays, expression_context)
--> 872 return _pandas_memory_efficient(pandas, arrays, names)
873
874 else:
875 raise TypeError(
~/.local/lib/python3.10/site-packages/uproot/interpretation/library.py in ?(pandas, series, names)
794 out = pandas.Series(data=series[name]).to_frame(name=name)
795 else:
796 out = series[name].to_frame(name=name)
797 else:
--> 798 out[name] = series[name]
799 del series[name]
800 if out is None:
801 return pandas.DataFrame(data=series, columns=names)
/usr/local/lib/python3.10/dist-packages/pandas/core/frame.py in ?(self, key, value)
3964 self._setitem_frame(key, value)
3965 elif isinstance(key, (Series, np.ndarray, list, Index)):
3966 self._setitem_array(key, value)
3967 elif isinstance(value, DataFrame):
-> 3968 self._set_item_frame_value(key, value)
3969 elif (
3970 is_list_like(value)
3971 and not self.columns.is_unique
/usr/local/lib/python3.10/dist-packages/pandas/core/frame.py in ?(self, key, value)
4119 self._set_item_mgr(key, arraylike)
4120 return
4121
4122 if len(value.columns) != 1:
-> 4123 raise ValueError(
4124 "Cannot set a DataFrame with multiple columns to the single "
4125 f"column {key}"
4126 )
ValueError: Cannot set a DataFrame with multiple columns to the single column __ALL__
This still leaves the question about if this is a bug or if this is expected/understood behavior. I was expecting the results shown above to be the default behavior, but perhaps the implementation is too convoluted to do that? |
Beta Was this translation helpful? Give feedback.
-
ROOT never does it by default and it is not mandatory. If you want it, when creating a new branch, for some object which data members will be split into subbranches, you yourself have to explicitly set its name ending with a "
That is easy. The name of the " {
TObject o;
TTree *t = new TTree("t", "t");
t->Branch("dumb", &o);
t->Branch("nice.", &o);
t->Print();
} |
Beta Was this translation helpful? Give feedback.
-
Yes, as shown in my micro example (in my previous post). |
Beta Was this translation helpful? Give feedback.
A more complete answer which is shorter and a little more refined is
for
library='ak'
(the default) orlibrary='np'
, andfor
library='pd'
.