Create ensembler web service #165

deadlycoconuts · 2022-02-11T08:50:47Z

Context

In PR #164, the abstract class PyFunc was modified in preparation for the introduction of a real-time pyfunc ensembler engine.

This PR introduces the aforementioned real-time ensembler engine which is in effect a web service wrapped around by a container that serves real-time ensembling requests using an ensembler defined in the mlflow pyfunc flavour. This ensembler is to be created using the Turing SDK by implementing the same PyFunc class currently used for implementing batch ensemblers. Hence with these changes, a user would be able to implement an pyfunc ensembler that works in both batch and real-time ensembling (if the same batch columns/live payload naming convention is used).

Features

New real-time ensembler engine that runs the ensemble method of child classes of the PyFunc class
A web service using the tornado framework to serve ensembling requests sent from a Turing router, by using the real-time ensembler engine
A Docker image to package the web service as a base image, with a separate Docker image (to build on to the former) to load the user-defined ensembler artefacts and run the web service in a container
A CI/CD workflow on GitHub for the new engine

Main Additions

engines/real-time-ensembler/pyfunc_ensembler_runner/ensembler_runner.py - Class that holds the real-time ensembler engine
engines/real-time-ensembler/pyfunc_ensembler_runner/handler.py - Class containing the HTTP handler (according to the tornado framework) that handles post requests from a Turing router
engines/real-time-ensembler/pyfunc_ensembler_runner/server.py - Class containing the web service
engines/real-time-ensembler/pyfunc_ensembler_runner/__main__.py - Main entry point that runs the web service
engines/real-time-ensembler/Dockerfile - Base image for the web service; does not contain artefacts of the pyfunc ensembler
engines/real-time-ensembler/app.Dockerfile - Docker image that builds upon the base image by loading artefacts of a given pyfunc ensembler
sdk/turing/ensembler.py - Changes to the predict method of the PyFunc class to process the incoming arguments differently according to their input type (pandas.DataFrame for batch ensembling vs dict for real-time ensembling)
.github/workflows/real-time-ensembler.yaml - GitHub workflow that tests and publishes the base Docker image containing the real-time ensembler engine

engines/real-time-ensembler/.gitignore

engines/real-time-ensembler/Dockerfile

engines/real-time-ensembler/app.Dockerfile

romanwozniak · 2022-02-14T09:55:44Z

engines/real-time-ensembler/app.Dockerfile

+COPY --from=builder /venv /venv
+
+RUN /bin/bash -c ". /venv/bin/activate && \
+    python -m pyfunc_ensembler_runner --mlflow_ensembler_dir /ensembler --dry_run"


What exactly this --dry_run does?

Oh it's just an option to run the web service (to load the ensembler from the mlflow registry and the runner) without actually serving it. It's... actually kinda pointless but I observed @gojek/merlin having it for their pyfunc-servers so I was wondering if that option would serve some other separate/greater purpose, which was why I included it to be safe.

But I can remove it also since I can't seem to find any reason why we should run it once without serving it in order to ensure it's working, before actually running the service again for serving.

I was wondering if that option would serve some other separate/greater purpose, which was why I included it to be safe.

I think the reason is that if you run this command at the image building step, then failure of this dry-run step will prevent image from being even published and hence we will not spend more time trying to deploy the image if it's misconfigured

engines/real-time-ensembler/app.Dockerfile

romanwozniak · 2022-02-14T10:12:28Z

engines/real-time-ensembler/pyfunc_ensembler_runner/ensembler_runner.py

+    def _flatten_json(y: Dict[str, Any]) -> Dict[str, Any]:
+        """
+        Helper function to normalise a nested dictionary into a dictionary of depth 1 with names following the
+        convention: key_1.key_2.key_3..., with a period acting as a delimiter between nested keys
+
+        Items in lists have their names rendered using their index numbers within the lists they are found in.
+        """


I'm not very comfortable with this. It seems like dict could be a better container for this data, compared to pandas.DataFrame/pandas.Series. The reason for this is:

Latency – for real-time ensemblers, the latency requirement is much more critical, compared to batch ensembling jobs. Having all this json manipulation and transformation into a pandas containers is likely something we'd better avoid

This "flat-json" is not something the end-users would expect to work with. Also, JSON keys could have . in their names, which is not handled here

With that said, re-working the interface of the PyFunc ensembler makes more sense, and then batch-ensembler can be updated too, to transform pandas data into dict

Okay with regards to the latency concerns for real-time ensemblers, I've reworked (overloaded) the predict method of the PyFunc base class a little to make it work differently depending on whether it gets passed a dict (when called by a real-time ensembler engine) or a pandas.DataFrame (when called by a batch-ensembler engine) as its argument:
https://github.com/gojek/turing/blob/1bce96a181324636367236df3e9ad6d98c285c68/sdk/turing/ensembler.py#L73

With this change, I've completely removed any of those pandas.DataFrame transformations like flatten_json or other preprocessing table operations that might introduce latency in the real-time ensemblers.

While it'd be nice to unify both the batch and real-time ensembling use cases to act on and return dict inputs, the predict method, as an inherited method of the mlflow.pyfunc.PythonModel abstract class, gets utilised by other downstream methods such as mlflow.pyfunc.spark_udf, that forces the predict method to follow a certain contract by taking pandas.DataFrames as input and output.

In particular, the current batch ensembling engine uses mlflow.pyfunc.spark_udf, which necessitates at least some implementation of predict that takes in and outputs pandas.DataFrames. For now I've kept our original implementation intact to prevent the batch ensembler from breaking, but I'll see if I can find a solution that's more elegant.

engines/real-time-ensembler/pyfunc_ensembler_runner/handler.py

engines/real-time-ensembler/requirements.txt

.github/workflows/real-time-ensembler.yaml

…ests

…ith current batch ensemblers

…classname

… conda env

…rvice

…l-time ensembling

sdk/turing/ensembler.py

romanwozniak

This looks great. Thanks, @deadlycoconuts!

…ut type

deadlycoconuts · 2022-02-16T04:04:57Z

Alright thanks @romanwozniak for the useful comments; I'm gonna merge this now 🚀

deadlycoconuts requested a review from a team February 11, 2022 16:44

deadlycoconuts marked this pull request as ready for review February 11, 2022 16:44

romanwozniak reviewed Feb 14, 2022

View reviewed changes

deadlycoconuts added 27 commits February 15, 2022 11:47

Add skeleton class for pyfunc ensembler

4c50590

Refactor ensembler class in pyfunc to accept both batch and live requ…

9cd9384

…ests

Add supporting classes for live pyfunc ensembler

97d97bd

Add preprocessing methods for live ensembler

c39a7ac

Update PyFunc ensembler in SDK to utilise returned treatment_config

77331ff

Modify predict method in SDK PyFunc to allow backward compatibility w…

3cab352

…ith current batch ensemblers

Set output from prediction to be a list-like object

c863c08

Remove redundant header names for features in PyFunc

dbaf6e1

Rename PyFuncEnsembler to PyFuncEnsemblerRunner to remove overloaded …

062d55e

…classname

Rename references to renamed PyFuncEnsemblerRunner

0e296b1

Add docstrings to various methods

b0fca59

Add README template

9cec5da

Add base files for containerisation

2cf245e

Make container use a multi-stage build that use a venv derived from a…

4281458

… conda env

Rename preprocess method to make it appear private

5162f5f

Add gitignore file

096d277

Add test for preprocessing method for pyfunc_ensembler_runner

686a3cc

Cleanup some testing configurations

5f96038

Rename test sample data to improve consistency in naming

0690e40

Remove test request

1aa4803

Add additional tests for web service

472572a

Add files for containerisation

2f0438d

Rename live-ensembler to real-time-ensembler

ddb43f4

Add github workflow for real-time-ensembler

ea81728

Edit typo in workflow

854553e

Edit typo in readme file

78f58c5

Add changes missed out by rebasing

06b769a

deadlycoconuts added 5 commits February 15, 2022 11:47

Edit typo in exception message

8c2ec33

Separate dockerfiles into a base and app file

440ba05

Edit typo in dockerfile

b7381e7

Rename real-time ensembler module and mentions to pyfunc-ensembler-se…

0d7fe79

…rvice

Rename batch-ensembler module and mentions with pyfunc-ensembler-job

d5b9a40

deadlycoconuts force-pushed the create_ensembler_web_service branch from 8fb53c7 to d5b9a40 Compare February 15, 2022 04:07

deadlycoconuts added 10 commits February 15, 2022 12:11

Rename remnants of ensemblers with old naming convention

31c6f36

Add new pyfunc-ensembler-service engine to Turing CI

ac5fcec

Replace vanilla debian image with its slim version

9a36f04

Clean up dockerfiles to utilise env variables

0324af1

Replace redundant run.sh script by running webservice from dockerfile

48b4e30

Remove redundant entries in .gitignore

48074e0

Rename batch ensembler to pyfunc-ensembler-job

c8f9b8b

Revamp pyfunc implementation to avoid dataframe manipulations for rea…

1bce96a

…l-time ensembling

Remove redundant imports

03ac53a

Replace incorrect env variables in dockerfiles

4e9841a

romanwozniak reviewed Feb 16, 2022

View reviewed changes

sdk/turing/ensembler.py Show resolved Hide resolved

romanwozniak approved these changes Feb 16, 2022

View reviewed changes

deadlycoconuts added 2 commits February 16, 2022 11:27

Refactor pyfunc predict method to use helper methods dependent on inp…

9360625

…ut type

Rewrite help tags for arg parser

57d956f

deadlycoconuts merged commit e108820 into caraml-dev:main Feb 16, 2022

deadlycoconuts deleted the create_ensembler_web_service branch February 16, 2022 04:29

This was referenced Feb 25, 2022

Automate ensembler image building with Turing API #170

Merged

Add UI/SDK support to pyfunc ensemblers #171

Merged

deadlycoconuts mentioned this pull request Mar 7, 2022

Add user docs for Turing/Turing SDK #174

Merged

deadlycoconuts self-assigned this Mar 15, 2022

deadlycoconuts mentioned this pull request Mar 15, 2022

Add env vars support to Pyfunc ensembler #181

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create ensembler web service #165

Create ensembler web service #165

deadlycoconuts commented Feb 11, 2022 •

edited

Loading

romanwozniak Feb 14, 2022

deadlycoconuts Feb 15, 2022 •

edited

Loading

romanwozniak Feb 15, 2022

romanwozniak Feb 14, 2022

deadlycoconuts Feb 15, 2022 •

edited

Loading

romanwozniak left a comment

deadlycoconuts commented Feb 16, 2022

Create ensembler web service #165

Create ensembler web service #165

Conversation

deadlycoconuts commented Feb 11, 2022 • edited Loading

Context

Features

Main Additions

romanwozniak Feb 14, 2022

Choose a reason for hiding this comment

deadlycoconuts Feb 15, 2022 • edited Loading

Choose a reason for hiding this comment

romanwozniak Feb 15, 2022

Choose a reason for hiding this comment

romanwozniak Feb 14, 2022

Choose a reason for hiding this comment

deadlycoconuts Feb 15, 2022 • edited Loading

Choose a reason for hiding this comment

romanwozniak left a comment

Choose a reason for hiding this comment

deadlycoconuts commented Feb 16, 2022

deadlycoconuts commented Feb 11, 2022 •

edited

Loading

deadlycoconuts Feb 15, 2022 •

edited

Loading

deadlycoconuts Feb 15, 2022 •

edited

Loading