Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: Add TableDefinition wrapper for python #5892

Merged
merged 22 commits into from
Aug 15, 2024

Conversation

devinrsmith
Copy link
Member

@devinrsmith devinrsmith commented Aug 3, 2024

  • deephaven.table.TableDefinition: new python wrapper for io.deephaven.engine.table.TableDefinition
  • deephaven.column.ColumnDefinition: new python wrapper for io.deephaven.engine.table.ColumnDefinition
  • deephaven.table.TableDefinitionLike: new type alias to allow for consistent public APIs and conversions into deephaven.table.TableDefinition
  • deephaven.column.Column: deprecated for removal
  • deephaven.jcompat.j_table_definition: deprecated for removal
  • deephaven.stream.kafka.consumer.json_spec: cols_defs specified as List[Tuple[str, DType]] deprecated for removal

Fixes #4822

BREAKING CHANGE: deephaven.column.InputColumn no longer inherits from deephaven.column.Column; as such, it no longer exposes Column's attributes. This is unlikely to affect users as InputColumn is really a structure meant to support new_table.

@devinrsmith devinrsmith self-assigned this Aug 3, 2024
@devinrsmith
Copy link
Member Author

Sketch of impl. Manual testing works. Need to add real tests.

@jmao-denver
Copy link
Contributor

Any use cases to show that the new wrapper is better than a simple new property column_map?

@devinrsmith
Copy link
Member Author

Any use cases to show that the new wrapper is better than a simple new property column_map?

Good Q. The advantage to this approach:

  • Can call java methods using TableDefinition.j_table_definition without needing to re-construct a TableDefinition
  • Can co-locate common methods one might make w/ TableDefinition, ie TableDefinition.table
  • Could introduce pythonic methods that match existing java helpers like io.deephaven.engine.table.TableDefinition#checkMutualCompatibility(io.deephaven.engine.table.TableDefinition) or io.deephaven.engine.table.TableDefinition#checkHasColumn(java.lang.String, java.lang.Class<?>).

I think we could ask the same question about the java side; why don't we use Map<String, ColumnDefinition<?>> everywhere instead of TableDefinition? Because there is good advantage to modelling it as a more specific object to provide common utility around it.

Copy link
Member

@chipkent chipkent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor docs comment, but I'm otherwise not opposed.

py/server/deephaven/table.py Outdated Show resolved Hide resolved
@devinrsmith devinrsmith marked this pull request as ready for review August 13, 2024 14:01
Copy link
Member

@rcaudy rcaudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the TableDefinition wrapper needs work, mostly re: ordering.


j_object_type = _JColumnDefinition

def __init__(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a constructor that directly wraps an existing _JColumnDefinition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This includes that case:

Column(j_column_definition=...)

And we use this construction as the basis for TableDefinition as a Mapping[str, Column].

It looks funky because we need to maintain compatibility with the existing constructor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just agreed on a call to introduce a superclass called ColumnDefinition, maybe, in order to preserve the existing behavior and also provide the kind of wrapper we want users to use.


j_object_type = _JColumn

def __init__(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, this is supported via j_column_definition

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We said this may not actually be a wrapper. I'm not clear that it's a wrapper around a JColumnDefinition, regardless.

py/server/deephaven/table.py Outdated Show resolved Hide resolved
Comment on lines 506 to 524
@cached_property
def is_refreshing(self) -> bool:
"""Whether this table is refreshing."""
if self._is_refreshing is None:
self._is_refreshing = self.j_table.isRefreshing()
return self._is_refreshing
return self.j_table.isRefreshing()

@property
@cached_property
def is_blink(self) -> bool:
"""Whether this table is a blink table."""
return _JBlinkTableTools.isBlink(self.j_table)

@property
@cached_property
def update_graph(self) -> UpdateGraph:
"""The update graph of the table."""
if self._update_graph is None:
self._update_graph = UpdateGraph(self.j_table.getUpdateGraph())
return self._update_graph
return UpdateGraph(self.j_table.getUpdateGraph())

@property
@cached_property
def is_flat(self) -> bool:
"""Whether this table is guaranteed to be flat, i.e. its row set will be from 0 to number of rows - 1."""
if self._is_flat is None:
self._is_flat = self.j_table.isFlat()
return self._is_flat
return self.j_table.isFlat()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isFlat is technically mutable, as is isRefreshing. That said, a Python wrapper should never see a table that hasn't had these set in their final form, so I don't know that it actually matters.

isBlink is effectively immutable, on the other hand; that is, once you look at it, it can't be changed (like anything else stored in attributes). updateGraph is final, also OK.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a possibility for mutation, I would remove the caching because we'll never remember to make a change here if we do get to a case where mutation happens. I also doubt that these get called enough to make a significant performance difference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm convinced the caching is fine. But I also agree that we are likely not calling these anywhere performance critical.

py/server/deephaven/table.py Show resolved Hide resolved
py/server/deephaven/table.py Show resolved Hide resolved
Copy link
Contributor

@jmao-denver jmao-denver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. some concerns about the use of mutable @cached_property but can accept the trade-off
  2. there seems to be a small breaking change

Comment on lines 87 to 89
@property
def j_column_header(self):
return _JColumnHeader.of(self.name, self.data_type.qst_type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't need to worry about the mutability of @cached_property, should we use that here too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like more of an internal property and only used when creating an InputColumn; I don't think we expect users to call j_column_header? I would be okay removing it or forcing it to start w/ _, and I think tracing through the callstack, it doesn't make sense to cache it (I don't think we'll ever re-use it).

Where we might conceivably re-use it in the future is if we had a helper from Column -> InputColumn. IE,

my_col : InputColumn = Column("Foo", dtypes.int32).with_data([1, 2, 3])

But at least with the current interfaces, users are always going through InputColumn constructor or helper methods (bool_col, byte_col, etc).

def j_object(self) -> jpy.JType:
return self.j_table_definition

@property
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cached_property?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me if we want to cache Table; this would mean that the table would exist until TableDefinition is GCd.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concur with Devin.

py/server/tests/test_column.py Outdated Show resolved Hide resolved
 !2, 3.9      /home/runner/work/deephaven-core/deephaven-core/py/server/deephaven/table.py
  builtin generic type annotation (collections.abc.Iterable[..]) requires !2, 3.9
  builtin generic type annotation (collections.abc.Mapping[..]) requires !2, 3.9

Minimum required versions: 3.9
Incompatible versions:     2
Target versions not met:   3.8
py/server/deephaven/numpy.py Outdated Show resolved Hide resolved
py/server/deephaven/pandas.py Outdated Show resolved Hide resolved
@@ -75,7 +73,7 @@ def publish_failure(self, failure: Exception) -> None:

def table_publisher(
name: str,
col_defs: Union[Dict[str, DType], List[Column]],
table_definition: TableDefinitionAlias,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change since the arg is renamed. Should we temporarily support the old parameter with a deprecation warning, or should we directly break code? This only ends up being breaking if the user code is using named parameters.

@jmao-denver

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've changed all the relevant param names back.

with self.subTest("from table definition"):
append_only_input_table = input_table(col_defs=col_defs)
append_only_input_table = input_table(col_defs=t.definition)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. col_defs is now a misnomer. This is a table definition.
  2. We are inconsistent on if we name the variable as col_defs or table_def across the API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This item needs some attention.

py/server/deephaven/column.py Outdated Show resolved Hide resolved
py/server/deephaven/column.py Show resolved Hide resolved
py/server/deephaven/column.py Show resolved Hide resolved
py/server/deephaven/table.py Outdated Show resolved Hide resolved
py/server/deephaven/table.py Outdated Show resolved Hide resolved
py/server/deephaven/table.py Show resolved Hide resolved
py/server/deephaven/table.py Outdated Show resolved Hide resolved
py/server/deephaven/table.py Outdated Show resolved Hide resolved
Comment on lines 506 to 524
@cached_property
def is_refreshing(self) -> bool:
"""Whether this table is refreshing."""
if self._is_refreshing is None:
self._is_refreshing = self.j_table.isRefreshing()
return self._is_refreshing
return self.j_table.isRefreshing()

@property
@cached_property
def is_blink(self) -> bool:
"""Whether this table is a blink table."""
return _JBlinkTableTools.isBlink(self.j_table)

@property
@cached_property
def update_graph(self) -> UpdateGraph:
"""The update graph of the table."""
if self._update_graph is None:
self._update_graph = UpdateGraph(self.j_table.getUpdateGraph())
return self._update_graph
return UpdateGraph(self.j_table.getUpdateGraph())

@property
@cached_property
def is_flat(self) -> bool:
"""Whether this table is guaranteed to be flat, i.e. its row set will be from 0 to number of rows - 1."""
if self._is_flat is None:
self._is_flat = self.j_table.isFlat()
return self._is_flat
return self.j_table.isFlat()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a possibility for mutation, I would remove the caching because we'll never remember to make a change here if we do get to a case where mutation happens. I also doubt that these get called enough to make a significant performance difference.


self._schema = _td_to_columns(self._definition)
return self._schema
@property
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cache?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary reason to cache is to prevent jpy java calls. In all the cases where cache_property isn't used, it's either all pure python, or the delegated layer is handling the caching.

"""The column names of the table."""
return list(self.definition.keys())

@property
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cache?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

py/server/tests/test_table_definition.py Outdated Show resolved Hide resolved
Copy link
Member

@rcaudy rcaudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, assuming you guys get on the same page regarding deprecation/breaking changes, etc.

py/server/deephaven/column.py Show resolved Hide resolved
@@ -327,35 +326,6 @@ def _j_array_to_series(dtype: DType, j_array: jpy.JType, conv_null: bool) -> pd.

return s

def j_table_definition(table_definition: Union[Dict[str, DType], List[Column], None]) -> Optional[jpy.JType]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this internal before?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't have a _, so it wasn't fully private / internal. It is possible someone used it, so it would be a breaking change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added it back with deprecation for removal warning.

@@ -71,7 +71,7 @@ def _validate(inputs: Input, outputs: Output, table: Table):
input_columns_list = [input_.input.getColNames()[i] for input_ in inputs for i in
range(len(input_.input.getColNames()))]
input_columns = set(input_columns_list)
table_columns = {col.name for col in table.columns}
table_columns = set(table.definition.keys())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not order-preserving. Why is Python like this? dict is ordered, but set isn't?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous set construction was not order preserving either. Looking at the code, the order doesn't matter, it's just after fast check (and I think intersection / difference).

@@ -95,7 +95,7 @@ def to_numpy(table: Table, cols: List[str] = None) -> np.ndarray:
if table.is_refreshing:
table = table.snapshot()

col_def_dict = {col.name: col for col in table.columns}
col_def_dict = table.definition
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline and dump the var?

Copy link
Member Author

@devinrsmith devinrsmith Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The renamed the variable, but it's a simpler change to not inline... Looking at the method, I think there is room for further improvements, but I don't want to muck with any of this logic atm.

Comment on lines 506 to 524
@cached_property
def is_refreshing(self) -> bool:
"""Whether this table is refreshing."""
if self._is_refreshing is None:
self._is_refreshing = self.j_table.isRefreshing()
return self._is_refreshing
return self.j_table.isRefreshing()

@property
@cached_property
def is_blink(self) -> bool:
"""Whether this table is a blink table."""
return _JBlinkTableTools.isBlink(self.j_table)

@property
@cached_property
def update_graph(self) -> UpdateGraph:
"""The update graph of the table."""
if self._update_graph is None:
self._update_graph = UpdateGraph(self.j_table.getUpdateGraph())
return self._update_graph
return UpdateGraph(self.j_table.getUpdateGraph())

@property
@cached_property
def is_flat(self) -> bool:
"""Whether this table is guaranteed to be flat, i.e. its row set will be from 0 to number of rows - 1."""
if self._is_flat is None:
self._is_flat = self.j_table.isFlat()
return self._is_flat
return self.j_table.isFlat()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm convinced the caching is fine. But I also agree that we are likely not calling these anywhere performance critical.

input_data: Any = field(default=None)

def __post_init__(self):
def j_object(self) -> jpy.JType:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, but do a quick check.

return ColumnType(self.j_column_definition.getColumnType())


class Column(ColumnDefinition):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with deprecation with a plan to remove in the next release. Create a ticket for removal and assign it to one of us.

py/server/deephaven/column.py Outdated Show resolved Hide resolved
py/server/deephaven/column.py Show resolved Hide resolved
py/server/deephaven/column.py Show resolved Hide resolved
py/server/deephaven/column.py Show resolved Hide resolved
py/server/deephaven/column.py Outdated Show resolved Hide resolved
@@ -327,35 +326,6 @@ def _j_array_to_series(dtype: DType, j_array: jpy.JType, conv_null: bool) -> pd.

return s

def j_table_definition(table_definition: Union[Dict[str, DType], List[Column], None]) -> Optional[jpy.JType]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't have a _, so it wasn't fully private / internal. It is possible someone used it, so it would be a breaking change.

with self.subTest("from table definition"):
append_only_input_table = input_table(col_defs=col_defs)
append_only_input_table = input_table(col_defs=t.definition)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This item needs some attention.

@devinrsmith devinrsmith changed the title feat: Add TableDefinition wrapper for python feat!: Add TableDefinition wrapper for python Aug 15, 2024
chipkent
chipkent previously approved these changes Aug 15, 2024
rcaudy
rcaudy previously approved these changes Aug 15, 2024
@devinrsmith devinrsmith enabled auto-merge (squash) August 15, 2024 22:44
@devinrsmith devinrsmith merged commit 07786ef into deephaven:main Aug 15, 2024
16 checks passed
@devinrsmith devinrsmith deleted the python-table-definition branch August 15, 2024 23:04
@github-actions github-actions bot locked and limited conversation to collaborators Aug 15, 2024
@deephaven-internal
Copy link
Contributor

Labels indicate documentation is required. Issues for documentation have been opened:

Community: deephaven/deephaven-docs-community#295

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TableDefinition python wrapper
5 participants