Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: pandas and double nested vectors issue 885 #912

Merged
merged 11 commits into from
Jul 7, 2023

Conversation

ioanaif
Copy link
Collaborator

@ioanaif ioanaif commented Jul 3, 2023

Double nested ROOT arrays when opened with pandas their type remained to be an STLVector. This happened because AwkwardForth was not being invoked for pandas and the processing result was a numpy array which was then added to a series without any other transformations.

@ioanaif ioanaif force-pushed the ioanaif/fix-885-pandas-double-nested-vectors branch from 171b32d to fb736ec Compare July 3, 2023 10:08
@ioanaif ioanaif linked an issue Jul 3, 2023 that may be closed by this pull request
@ioanaif ioanaif requested a review from jpivarski July 5, 2023 12:02
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is generally the right solution: there are array types that uproot.interpretation.library.Pandas.finalize should handle, but doesn't.

However, passing any array through ak.to_numpy(array) is too all-inclusive. Also, using any exception in that (and also the pandas.Series constructor) to trigger awkward-pandas is also too broad. There are details below: the safest way to fix this particular bug is to catch the previously-unhandled type of array (is it ObjectArray?) and prepare that the same way Awkward-handling code does, then pass it to awkward-pandas.

(I have a suggested change below, but the more I think about it, that suggested change is wrong: it assumes that the previously-unhandled type of array is Awkward, but I think it might be ObjectArray.)

src/uproot/interpretation/objects.py Outdated Show resolved Hide resolved
src/uproot/interpretation/objects.py Outdated Show resolved Hide resolved
Comment on lines 836 to 838
try:
return pandas.Series(uproot.extras.awkward().to_numpy(array), index=index)
except:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If array is a NumPy array, we don't want to convert it into an Awkward Array just to convert it back.

Suggested change
try:
return pandas.Series(uproot.extras.awkward().to_numpy(array), index=index)
except:
if (
isinstance(array, numpy.ndarray)
and array.dtype.names is None
and len(array.shape) == 1
):
return pandas.Series(array, index=index)
elif type(array).__module__.startswith("awkward"):
try:
tmp = array.to_numpy()
except ValueError:
pass
return pandas.Series(tmp, index=index)
if True:

The next section is not appropriately indented, so I included a temporary if True: that will allow it to be tested, but it should be fixed.

Generally, we want try-except logic to be as narrow as possible because over-catching exceptions hides bugs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the originally failing case, what is the actual type of array? The safest way to fix it would be to only add a check for that additional type. It might mean we'd have to come back to it if there's another case we didn't handle, but fixing them one at a time minimizes the chance of breaking currently-working code.

If array is an ObjectArray because it went through the no-Forth fallback, then there are other steps in preparing the Python objects it contains as an Awkward Array (to wrap in a Pandas column). Specifically,

elif isinstance(interpretation, uproot.interpretation.objects.AsObjects):
form = json.loads(
interpretation.awkward_form(interpretation.branch.file).to_json()
)
unlabeled = awkward.from_iter(
(_object_to_awkward_json(form, x) for x in array), highlevel=False
)
return _awkward_add_doc(
awkward,
awkward.Array(_awkward_json_to_array(awkward, form, unlabeled)),
branch,
ak_add_doc,
)

The intermediate steps are needed to set numeric types appropriately (Python floating point numbers are float64, but the Form might call for float32) and it adds doc strings to the Awkward __doc__ parameter (so that it shows up in Jupyter and IDEs).

(The best fix is likely different from my suggestion above.)

@ioanaif ioanaif force-pushed the ioanaif/fix-885-pandas-double-nested-vectors branch from 75e2b09 to 68fbaf9 Compare July 6, 2023 13:25
@ioanaif ioanaif requested a review from jpivarski July 6, 2023 16:50
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect! Thank you!

I think it's done and can be merged, but I'll let you do that in case you have anything to add at the last minute.

@ioanaif ioanaif merged commit a8644df into main Jul 7, 2023
@ioanaif ioanaif deleted the ioanaif/fix-885-pandas-double-nested-vectors branch July 7, 2023 07:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Double nested Awkward arrays are STL:Vector using library="pd"
2 participants