Skip to content

Commit

Permalink
[HOPSWORKS-2134] Extend Query constructor with filter capability (#174)
Browse files Browse the repository at this point in the history
  • Loading branch information
moritzmeister authored Dec 22, 2020
1 parent cafe316 commit 0130182
Show file tree
Hide file tree
Showing 20 changed files with 870 additions and 323 deletions.
6 changes: 6 additions & 0 deletions auto_doc.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,12 @@
"hsfs.storage_connector.StorageConnector"
),
},
"query_vs_dataframe.md": {
"query_methods": keras_autodoc.get_methods("hsfs.constructor.query.Query"),
"query_properties": keras_autodoc.get_properties(
"hsfs.constructor.query.Query"
),
},
"api/connection_api.md": {
"connection": ["hsfs.connection.Connection"],
"connection_properties": keras_autodoc.get_properties(
Expand Down
4 changes: 2 additions & 2 deletions docs/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ Entities within the Feature Store are organized hierarchically. On the most gran

[**Feature Groups**](generated/feature_group.md) are entities that contain both metadata about the grouped features, as well as information of the jobs used to ingest the data contained in a feature group and also the actual location of the data (HopsFS or externally, such as S3). Typically, feature groups represent a logical set of features coming from the same data source sharing a common primary key. Feature groups also contain the schema and type information of the features, for the user to know how to interpret the data.

Feature groups can also be used to compute [Statistics](generated/statistics.md) over features, or to define [Data Validation Rules](generated/data_validation.md) using the statistics and schema information.
Feature groups can also be used to compute Statistics over features, or to define Data Validation Rules using the statistics and schema information.

In order to enable [online serving](overview.md#offline-vs-offline-feature-store) for features of a feature group, the feature group needs to be made available as an online feature group.

Expand All @@ -60,7 +60,7 @@ In order to be able to train machine learning models efficiently, the feature da

Training datasets can be created with features from any number of feature groups, as long as the feature groups can be joined in a meaningful way.

Users are able to compute [Statistics](generated/statistics.md) also for training datasets, which will make it easy to understand a dataset's characteristics also in the future.
Users are able to compute Statistics also for training datasets, which will make it easy to understand a dataset's characteristics also in the future.

The Hopsworks Feature Store has support for writing training datasets either to the distributed file system of Hopsworks - HopsFS - or to external storage such as S3.

Expand Down
2 changes: 1 addition & 1 deletion docs/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ The Hopsworks feature feature store library is called `hsfs` (**H**opswork**s**
The library is Apache V2 licensed and available [here](https://github.com/logicalclocks/feature-store-api). The library is currently available for Python and JVM languages such as Scala and Java.
If you want to connect to the Feature Store from outside Hopsworks, see our [integration guides](setup.md).

The library is build around metadata-objects, representing entities within the Feature Store. You can modify metadata by changing it in the metadata-objects and subsequently persisting it to the Feature Store. In fact, the Feature Store itself is also represented by an object. Furthermore, these objects have methods to save data along with the entities in the feature store. This data can be materialized from [Spark or Pandas DataFrames, or the `HSFS`-**Query** abstraction](generated/programming_interface.md).
The library is build around metadata-objects, representing entities within the Feature Store. You can modify metadata by changing it in the metadata-objects and subsequently persisting it to the Feature Store. In fact, the Feature Store itself is also represented by an object. Furthermore, these objects have methods to save data along with the entities in the feature store. This data can be materialized from [Spark or Pandas DataFrames, or the `HSFS`-**Query** abstraction](generated/query_vs_dataframe.md).

### Guide Notebooks

Expand Down
183 changes: 183 additions & 0 deletions docs/templates/query_vs_dataframe.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,183 @@
# Query vs DataFrame

HSFS provides a DataFrame API to ingest data into the Hopsworks Feature Store. You can also retrieve feature data in a DataFrame, that can either be used directly to train models or [materialized to file(s)](training_dataset.md) for later use to train models.

The idea of the Feature Store is to have pre-computed features available for both training and serving models. The key functionality required to generate training datasets from reusable features are: feature selection, joins, filters and point in time queries. To enable this functionality, we are introducing a new expressive Query abstraction with `HSFS` that provides these operations and guarantees reproducible creation of training datasets from features in the Feature Store.

The new joining functionality is heavily inspired by the APIs used by Pandas to merge DataFrames. The APIs allow you to specify which features to select from which feature group, how to join them and which features to use in join conditions.

=== "Python"
```python
# create a query
feature_join = rain_fg.select_all() \
.join(temperature_fg.select_all(), on=["date", "location_id"]) \
.join(location_fg.select_all())

td = fs.create_training_dataset("rain_dataset",
version=1,
label=”weekly_rain”,
data_format=”tfrecords”)

# materialize query in the specified file format
td.save(feature_join)

# use materialized training dataset for training, possibly in a different environment
td = fs.get_training_dataset(“rain_dataset”, version=1)

# get TFRecordDataset to use in a TensorFlow model
dataset = td.tf_data().tf_record_dataset(batch_size=32, num_epochs=100)

# reproduce query for online feature store and drop label for inference
jdbc_querystring = td.get_query(online=True, with_label=False)
```

=== "Scala"
```scala
# create a query
val featureJoin = (rainFg.selectAll()
.join(temperatureFg.selectAll(), on=Seq("date", "location_id"))
.join(locationFg.selectAll()))

val td = (fs.createTrainingDataset()
.name("rain_dataset")
.version(1)
.label(”weekly_rain”)
.dataFormat(”tfrecords”)
.build())

# materialize query in the specified file format
td.save(featureJoin)

# use materialized training dataset for training, possibly in a different environment
val td = fs.getTrainingDataset(“rain_dataset”, 1)

# reproduce query for online feature store and drop label for inference
val jdbcQuerystring = td.getQuery(true, false)
```

If a data scientist wants to modify a new feature that is not available in the Feature Store, she can write code to compute the new feature (using existing features or external data) and ingest the new feature values into the Feature Store. If the new feature is based solely on existing feature values in the Feature Store, we call it a derived feature. The same HSFS APIs can be used to compute derived features as well as features using external data sources.

## The Query Abstraction

Most operations performed on `FeatureGroup` metadata objects will return a `Query` with the applied operation.

### Examples

Selecting features from a feature group is a lazy operation, returning a query with the selected
features only:

=== "Python"
```python
rain_fg = fs.get_feature_group("rain_fg")

# Returns Query
feature_join = rain_fg.select(["location_id", "weekly_rainfall"])
```

=== "Scala"
```Scala
val rainFg = fs.getFeatureGroup("rain_fg")

# Returns Query
val featureJoin = rainFg.select(Seq("location_id", "weekly_rainfall"))
```

#### Join

Similarly joins return queries. The simplest join, is one of two feature groups without specifying a join key or type.
By default Hopsworks will use the maximal matching subset of the primary key of the two feature groups as joining key, if not specified otherwise.

=== "Python"
```python
# Returns Query
feature_join = rain_fg.join(temperature_fg)
```

=== "Scala"
```Scala
# Returns Query
val featureJoin = rainFg.join(temperatureFg)
```
More complex joins are possible by selecting subsets of features from the joines feature groups and by specifying a join key and type.
Possible join types are "inner", "left" or "right". Furthermore, it is possible to specify different features for the join key of the left and right feature group.
The join key lists should contain the name of the features to join on.

=== "Python"
```python
feature_join = rain_fg.select_all() \
.join(temperature_fg.select_all(), on=["date", "location_id"]) \
.join(location_fg.select_all(), left_on=["location_id"], right_on=["id"], how="left")
```

=== "Scala"
```scala
val featureJoin = (rainFg.selectAll()
.join(temperatureFg.selectAll(), Seq("date", "location_id"))
.join(locationFg.selectAll(), Seq("location_id"), Seq("id"), "left"))
```

!!! error "Nested Joins"
The API currently does not support nested joins. That is joins of joins.
You can fall back to Spark DataFrames to cover these cases. However, if you have to use joins of joins, most likely
there is potential to optimise your feature group structure.

#### Filter

In the same way as joins, applying filters to feature groups creates a query with the applied filter.

Filters are constructed with Python Operators `==`, `>=`, `<=`, `!=`, `>`, `<` and using the Bitwise Operators `&` and `|` to construct conjunctions.
For the Scala part of the API, equivalent methods are available in the `Feature` and `Filter` classes.

=== "Python"
```python
filtered_rain = rain_fg.filter(rain_fg.location_id == 10)
```

=== "Scala"
```scala
val filteredRain = rainFg.filter(rainFg.getFeature("location_id").eq(10))
```

Filters are fully compatible with joins:

=== "Python"
```python
feature_join = rain_fg.select_all() \
.join(temperature_fg.select_all(), on=["date", "location_id"]) \
.join(location_fg.select_all(), left_on=["location_id"], right_on=["id"], how="left") \
.filter((rain_fg.location_id == 10) | (rain_fg.location_id == 20))
```

=== "Scala"
```scala
val featureJoin = (rainFg.selectAll()
.join(temperatureFg.selectAll(), Seq("date", "location_id"))
.join(locationFg.selectAll(), Seq("location_id"), Seq("id"), "left")
.filter(rainFg.getFeature("location_id").eq(10).or(rainFg.getFeature("location_id").eq(20))))
```

The filters can be applied at any point of the query:

=== "Python"
```python
feature_join = rain_fg.select_all() \
.join(temperature_fg.select_all().filter(temperature_fg.avg_temp >= 22), on=["date", "location_id"]) \
.join(location_fg.select_all(), left_on=["location_id"], right_on=["id"], how="left") \
.filter(rain_fg.location_id == 10)
```

=== "Scala"
```scala
val featureJoin = (rainFg.selectAll()
.join(temperatureFg.selectAll().filter(temperatureFg.getFeature("avg_temp").ge(22)), Seq("date", "location_id"))
.join(locationFg.selectAll(), Seq("location_id"), Seq("id"), "left")
.filter(rainFg.getFeature("location_id").eq(10)))
```

## Methods

{{query_methods}}

## Properties

{{query_properties}}
6 changes: 3 additions & 3 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -50,9 +50,9 @@ nav:
- Storage Connector: generated/storage_connector.md
- Feature: generated/feature.md
- Training Dataset: generated/training_dataset.md
- Dataframe vs. Query: guides/programming_interface.md
- Statistics: guides/statistics.md
- Data Validation: guides/data_validation.md
- Query vs. Dataframe: generated/query_vs_dataframe.md
# - Statistics: guides/statistics.md
# - Data Validation: guides/data_validation.md
- API Reference:
- Connection: generated/api/connection_api.md
- FeatureStore: generated/api/feature_store_api.md
Expand Down
15 changes: 15 additions & 0 deletions python/hsfs/constructor/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#
# Copyright 2020 Logical Clocks AB
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
140 changes: 140 additions & 0 deletions python/hsfs/constructor/filter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
#
# Copyright 2020 Logical Clocks AB
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

import json

from hsfs import util


class Filter:
GE = "GREATER_THAN_OR_EQUAL"
GT = "GREATER_THAN"
NE = "NOT_EQUALS"
EQ = "EQUALS"
LE = "LESS_THAN_OR_EQUAL"
LT = "LESS_THAN"

def __init__(self, feature, condition, value):
self._feature = feature
self._condition = condition
self._value = value

def json(self):
return json.dumps(self, cls=util.FeatureStoreEncoder)

def to_dict(self):
return {
"feature": self._feature,
"condition": self._condition,
"value": str(self._value),
}

def __and__(self, other):
if isinstance(other, Filter):
return Logic.And(left_f=self, right_f=other)
elif isinstance(other, Logic):
return Logic.And(left_f=self, right_l=other)
else:
raise TypeError(
"Operator `&` expected type `Filter` or `Logic`, got `{}`".format(
type(other)
)
)

def __or__(self, other):
if isinstance(other, Filter):
return Logic.Or(left_f=self, right_f=other)
elif isinstance(other, Logic):
return Logic.Or(left_f=self, right_l=other)
else:
raise TypeError(
"Operator `|` expected type `Filter` or `Logic`, got `{}`".format(
type(other)
)
)

def __repr__(self):
return f"Filter({self._feature!r}, {self._condition!r}, {self._value!r})"

def __str__(self):
return self.json()


class Logic:
AND = "AND"
OR = "OR"
SINGLE = "SINGLE"

def __init__(self, type, left_f=None, right_f=None, left_l=None, right_l=None):
self._type = type
self._left_f = left_f
self._right_f = right_f
self._left_l = left_l
self._right_l = right_l

def json(self):
return json.dumps(self, cls=util.FeatureStoreEncoder)

def to_dict(self):
return {
"type": self._type,
"leftFilter": self._left_f,
"rightFilter": self._right_f,
"leftLogic": self._left_l,
"rightLogic": self._right_l,
}

@classmethod
def And(cls, left_f=None, right_f=None, left_l=None, right_l=None):
return cls(cls.AND, left_f, right_f, left_l, right_l)

@classmethod
def Or(cls, left_f=None, right_f=None, left_l=None, right_l=None):
return cls(cls.OR, left_f, right_f, left_l, right_l)

@classmethod
def Single(cls, left_f):
return cls(cls.SINGLE, left_f)

def __and__(self, other):
if isinstance(other, Filter):
return Logic.And(left_l=self, right_f=other)
elif isinstance(other, Logic):
return Logic.And(left_l=self, right_l=other)
else:
raise TypeError(
"Operator `&` expected type `Filter` or `Logic`, got `{}`".format(
type(other)
)
)

def __or__(self, other):
if isinstance(other, Filter):
return Logic.Or(left_l=self, right_f=other)
elif isinstance(other, Logic):
return Logic.Or(left_l=self, right_l=other)
else:
raise TypeError(
"Operator `|` expected type `Filter` or `Logic`, got `{}`".format(
type(other)
)
)

def __repr__(self):
return f"Logic({self._type!r}, {self._left_f!r}, {self._right_f!r}, {self._left_l!r}, {self._right_l!r})"

def __str__(self):
return self.json()
Loading

0 comments on commit 0130182

Please sign in to comment.