Support TFRecord as one of the output formats for historical feature retrieval #1222

khorshuheng · 2020-12-09T06:58:37Z

What this PR does / why we need it:
This PR allows user to specify tfrecord as the output format of historical feature retrieval, which is useful in cases where the users wish to generate statistics from the retrieved dataset using tfdv, or if the user's machine learning model is based on tensorflow.

Which issue(s) this PR fixes:

Fixes #

Does this PR introduce a user-facing change?:

User can now specify `tfrecord` as the output format of historical feature retrieval.

feast-ci-bot · 2020-12-09T06:58:41Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: khorshuheng

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [khorshuheng]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

woop · 2020-12-09T07:05:53Z

tests/integration/test_launchers.py

+
+
+def test_dataproc_job_tfrecord_output(
+    dataproc_launcher: DataprocClusterLauncher,  # noqa: F811


What is with all these noqa: F811? Can it be fixed please?

That's because we are importing the fixtures from another module, and flake8 doesn't understand pytest fixtures.

One way to fix it would be not to import the fixture and include it directly in test_launchers, though that would mean it's harder for us to reuse the fixture across different test files.

I will research for way on how this can be circumvented without explicitly adding # noqa: F811, but i am not certain if it is possible.

oavdeev-tt · 2020-12-28T18:08:11Z

sdk/python/feast/pyspark/launchers/aws/emr_utils.py

+            "Args": ["spark-submit", pyspark_script_path]
+            + args
+            + ["--packages", ",".join(packages)]
+            if packages
+            else [],


Suggested change

"Args": ["spark-submit", pyspark_script_path]

+ args

+ ["--packages", ",".join(packages)]

if packages

else [],

"Args": ["spark-submit", pyspark_script_path]

+ args

+ (["--packages", ",".join(packages)]

if packages

else []),

oavdeev · 2021-01-05T01:22:59Z

/test test-end-to-end-sparkop

oavdeev · 2021-01-05T01:43:11Z

/test test-end-to-end-sparkop

oavdeev · 2021-01-05T02:23:29Z

/test test-end-to-end-sparkop

oavdeev · 2021-01-05T03:52:36Z

/test test-end-to-end-sparkop

jklegar · 2021-01-05T19:35:06Z

/test test-end-to-end-azure

…retrieval Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

feast-ci-bot · 2021-01-07T06:29:16Z

@khorshuheng: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
test-end-to-end-azure	`6d89310`	link	`/test test-end-to-end-azure`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

khorshuheng · 2021-01-07T07:10:35Z

/test test-end-to-end

khorshuheng requested review from davidheryanto, pyalex, woop and zhilingc as code owners December 9, 2020 06:58

feast-ci-bot added release-note do-not-merge/work-in-progress labels Dec 9, 2020

feast-ci-bot added approved needs-kind size/L labels Dec 9, 2020

khorshuheng force-pushed the tfrecord-output branch from 1ca8d28 to 7d62dec Compare December 9, 2020 06:59

woop reviewed Dec 9, 2020

View reviewed changes

khorshuheng force-pushed the tfrecord-output branch from 7bb80e3 to 2d019f1 Compare December 23, 2020 14:59

oavdeev-tt reviewed Dec 28, 2020

View reviewed changes

khorshuheng force-pushed the tfrecord-output branch from 1238e6c to 0937a1b Compare January 4, 2021 04:41

khorshuheng added 10 commits January 7, 2021 13:15

Support TFRecord as one of the output formats for historical feature …

a5b4fca

…retrieval Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

Fix style

ec5dde6

Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

Python style fix

1c99c99

Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

Update tfrecord jar to be Spark 3.0 compatible

25daac8

Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

Add ability to download extra packages for EMR historical retrieval job

62174b2

Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

Add ability to download extra packages for k8s historical retrieval job

f3e19b9

Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

e2e tests for tfrecord output

2cd688e

Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

Patch copy module so that regex can be deep copied

5534852

Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

Use separate fixture for tfrecord feast client

8afbffd

Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

Style fix

2c7c9db

Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

khorshuheng added 3 commits January 7, 2021 13:15

Fix aws launcher spark submit arguments

73b1d22

Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

Fix spark submit argument sequence

23c5005

Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

Update docker image for k8s launcher

5d80975

Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

khorshuheng force-pushed the tfrecord-output branch from 39840ec to 5d80975 Compare January 7, 2021 05:15

khorshuheng changed the title ~~(WIP) Support TFRecord as one of the output formats for historical feature retrieval~~ Support TFRecord as one of the output formats for historical feature retrieval Jan 7, 2021

feast-ci-bot removed the do-not-merge/work-in-progress label Jan 7, 2021

Revert image change

6d89310

Signed-off-by: Khor Shu Heng <khor.heng@gojek.com>

pyalex merged commit c4df8c9 into feast-dev:master Jan 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support TFRecord as one of the output formats for historical feature retrieval #1222

Support TFRecord as one of the output formats for historical feature retrieval #1222

khorshuheng commented Dec 9, 2020

feast-ci-bot commented Dec 9, 2020

woop Dec 9, 2020

khorshuheng Dec 9, 2020

oavdeev-tt Dec 28, 2020

oavdeev commented Jan 5, 2021

oavdeev commented Jan 5, 2021

oavdeev commented Jan 5, 2021

oavdeev commented Jan 5, 2021

jklegar commented Jan 5, 2021

feast-ci-bot commented Jan 7, 2021 •

edited

Loading

khorshuheng commented Jan 7, 2021



		def test_dataproc_job_tfrecord_output(
		dataproc_launcher: DataprocClusterLauncher, # noqa: F811

Support TFRecord as one of the output formats for historical feature retrieval #1222

Support TFRecord as one of the output formats for historical feature retrieval #1222

Conversation

khorshuheng commented Dec 9, 2020

feast-ci-bot commented Dec 9, 2020

woop Dec 9, 2020

Choose a reason for hiding this comment

khorshuheng Dec 9, 2020

Choose a reason for hiding this comment

oavdeev-tt Dec 28, 2020

Choose a reason for hiding this comment

oavdeev commented Jan 5, 2021

oavdeev commented Jan 5, 2021

oavdeev commented Jan 5, 2021

oavdeev commented Jan 5, 2021

jklegar commented Jan 5, 2021

feast-ci-bot commented Jan 7, 2021 • edited Loading

khorshuheng commented Jan 7, 2021

feast-ci-bot commented Jan 7, 2021 •

edited

Loading