-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: pandas and double nested vectors issue 885 #912
Conversation
171b32d
to
fb736ec
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is generally the right solution: there are array
types that uproot.interpretation.library.Pandas.finalize
should handle, but doesn't.
However, passing any array
through ak.to_numpy(array)
is too all-inclusive. Also, using any exception in that (and also the pandas.Series
constructor) to trigger awkward-pandas is also too broad. There are details below: the safest way to fix this particular bug is to catch the previously-unhandled type of array
(is it ObjectArray
?) and prepare that the same way Awkward-handling code does, then pass it to awkward-pandas.
(I have a suggested change below, but the more I think about it, that suggested change is wrong: it assumes that the previously-unhandled type of array
is Awkward, but I think it might be ObjectArray
.)
src/uproot/interpretation/library.py
Outdated
try: | ||
return pandas.Series(uproot.extras.awkward().to_numpy(array), index=index) | ||
except: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If array
is a NumPy array, we don't want to convert it into an Awkward Array just to convert it back.
try: | |
return pandas.Series(uproot.extras.awkward().to_numpy(array), index=index) | |
except: | |
if ( | |
isinstance(array, numpy.ndarray) | |
and array.dtype.names is None | |
and len(array.shape) == 1 | |
): | |
return pandas.Series(array, index=index) | |
elif type(array).__module__.startswith("awkward"): | |
try: | |
tmp = array.to_numpy() | |
except ValueError: | |
pass | |
return pandas.Series(tmp, index=index) | |
if True: |
The next section is not appropriately indented, so I included a temporary if True:
that will allow it to be tested, but it should be fixed.
Generally, we want try
-except
logic to be as narrow as possible because over-catching exceptions hides bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the originally failing case, what is the actual type of array
? The safest way to fix it would be to only add a check for that additional type. It might mean we'd have to come back to it if there's another case we didn't handle, but fixing them one at a time minimizes the chance of breaking currently-working code.
If array
is an ObjectArray
because it went through the no-Forth fallback, then there are other steps in preparing the Python objects it contains as an Awkward Array (to wrap in a Pandas column). Specifically,
uproot5/src/uproot/interpretation/library.py
Lines 570 to 582 in fcb7b14
elif isinstance(interpretation, uproot.interpretation.objects.AsObjects): | |
form = json.loads( | |
interpretation.awkward_form(interpretation.branch.file).to_json() | |
) | |
unlabeled = awkward.from_iter( | |
(_object_to_awkward_json(form, x) for x in array), highlevel=False | |
) | |
return _awkward_add_doc( | |
awkward, | |
awkward.Array(_awkward_json_to_array(awkward, form, unlabeled)), | |
branch, | |
ak_add_doc, | |
) |
The intermediate steps are needed to set numeric types appropriately (Python floating point numbers are float64
, but the Form might call for float32
) and it adds doc strings to the Awkward __doc__
parameter (so that it shows up in Jupyter and IDEs).
(The best fix is likely different from my suggestion above.)
75e2b09
to
68fbaf9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect! Thank you!
I think it's done and can be merged, but I'll let you do that in case you have anything to add at the last minute.
Double nested ROOT arrays when opened with pandas their type remained to be an
STLVector
. This happened becauseAwkwardForth
was not being invoked forpandas
and the processing result was a numpy array which was then added to aseries
without any other transformations.