Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in converting a DH table with all columns of the same type #4566

Merged
merged 1 commit into from
Sep 28, 2023

Conversation

alexpeters1208
Copy link
Contributor

There is a bug in the table.to_pandas() function that causes tables that have columns of all the same data type to not be properly converted. Here is an example of the bug:

from deephaven import empty_table
from deephaven import pandas as dhpd

test_table1 = empty_table(10).update(["X = i", "Y = Math.sin(X)"])
test_df1 = dhpd.to_pandas(test_table1.view(["X", "Y"]))

test_table2 = empty_table(10).update(["X = (float)i", "Y = (float)Math.sin(X)"])
test_df2 = dhpd.to_pandas(test_table2.view(["X", "Y"]))

Here are the resulting tables and metadata:
Screenshot 2023-09-28 at 11 45 06 AM
Screenshot 2023-09-28 at 11 45 54 AM

This is due to a small block of code that creates a set containing all of the column datatypes, checks to see if this set is singular, and makes a call to np.stack() to collect the data into a single nd-array before calling pd.DataFrame. The problem is that np.stack() is not made aware of the desired datatypes of its elements, and they are not type-aware automatically, so the resulting nd-array has type object. This entire process can be bypassed, as pandas can look at the data and figure out the types correctly, regardless of whether or not all columns are the same.

I suspect stacking things into a single nd-array if appropriate before calling to_pandas() was done for efficiency. I am curious what the set creation, checking length of the set, creating the list to call np.stack() on, and calling np.stack() costs in terms of efficiency. This whole process adds at least two for-loops through the columns. It is not clear to me that this beats just calling pd.DataFrame() on the dict of data without worrying about what the types are. Someone can probably tell me more about this.

Here are the results of the above code with my fix:
Screenshot 2023-09-28 at 11 54 15 AM
Screenshot 2023-09-28 at 11 54 37 AM

Copy link
Contributor

@jmao-denver jmao-denver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the changes make sense:

  1. np.stack actually make copies of the input arrays
  2. pd.DataFrame will not make a copy of the data if the values of the input dict are all pd.Series and the copy=Flase.

@alexpeters1208 alexpeters1208 merged commit cca5058 into main Sep 28, 2023
20 checks passed
@alexpeters1208 alexpeters1208 deleted the fix-to-pandas-bug branch September 28, 2023 18:26
@github-actions github-actions bot locked and limited conversation to collaborators Sep 28, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants