Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create ensembler web service #165

Merged

Conversation

deadlycoconuts
Copy link
Contributor

@deadlycoconuts deadlycoconuts commented Feb 11, 2022

Context

In PR #164, the abstract class PyFunc was modified in preparation for the introduction of a real-time pyfunc ensembler engine.

This PR introduces the aforementioned real-time ensembler engine which is in effect a web service wrapped around by a container that serves real-time ensembling requests using an ensembler defined in the mlflow pyfunc flavour. This ensembler is to be created using the Turing SDK by implementing the same PyFunc class currently used for implementing batch ensemblers. Hence with these changes, a user would be able to implement an pyfunc ensembler that works in both batch and real-time ensembling (if the same batch columns/live payload naming convention is used).

Features

  • New real-time ensembler engine that runs the ensemble method of child classes of the PyFunc class
  • A web service using the tornado framework to serve ensembling requests sent from a Turing router, by using the real-time ensembler engine
  • A Docker image to package the web service as a base image, with a separate Docker image (to build on to the former) to load the user-defined ensembler artefacts and run the web service in a container
  • A CI/CD workflow on GitHub for the new engine

Main Additions

  • engines/real-time-ensembler/pyfunc_ensembler_runner/ensembler_runner.py - Class that holds the real-time ensembler engine
  • engines/real-time-ensembler/pyfunc_ensembler_runner/handler.py - Class containing the HTTP handler (according to the tornado framework) that handles post requests from a Turing router
  • engines/real-time-ensembler/pyfunc_ensembler_runner/server.py - Class containing the web service
  • engines/real-time-ensembler/pyfunc_ensembler_runner/__main__.py - Main entry point that runs the web service
  • engines/real-time-ensembler/Dockerfile - Base image for the web service; does not contain artefacts of the pyfunc ensembler
  • engines/real-time-ensembler/app.Dockerfile - Docker image that builds upon the base image by loading artefacts of a given pyfunc ensembler
  • sdk/turing/ensembler.py - Changes to the predict method of the PyFunc class to process the incoming arguments differently according to their input type (pandas.DataFrame for batch ensembling vs dict for real-time ensembling)
  • .github/workflows/real-time-ensembler.yaml - GitHub workflow that tests and publishes the base Docker image containing the real-time ensembler engine

@deadlycoconuts deadlycoconuts requested a review from a team February 11, 2022 16:44
@deadlycoconuts deadlycoconuts marked this pull request as ready for review February 11, 2022 16:44
engines/real-time-ensembler/.gitignore Outdated Show resolved Hide resolved
engines/real-time-ensembler/Dockerfile Outdated Show resolved Hide resolved
engines/real-time-ensembler/app.Dockerfile Outdated Show resolved Hide resolved
COPY --from=builder /venv /venv

RUN /bin/bash -c ". /venv/bin/activate && \
python -m pyfunc_ensembler_runner --mlflow_ensembler_dir /ensembler --dry_run"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly this --dry_run does?

Copy link
Contributor Author

@deadlycoconuts deadlycoconuts Feb 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh it's just an option to run the web service (to load the ensembler from the mlflow registry and the runner) without actually serving it. It's... actually kinda pointless but I observed @gojek/merlin having it for their pyfunc-servers so I was wondering if that option would serve some other separate/greater purpose, which was why I included it to be safe.

But I can remove it also since I can't seem to find any reason why we should run it once without serving it in order to ensure it's working, before actually running the service again for serving.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if that option would serve some other separate/greater purpose, which was why I included it to be safe.

I think the reason is that if you run this command at the image building step, then failure of this dry-run step will prevent image from being even published and hence we will not spend more time trying to deploy the image if it's misconfigured

engines/real-time-ensembler/app.Dockerfile Outdated Show resolved Hide resolved
Comment on lines 56 to 62
def _flatten_json(y: Dict[str, Any]) -> Dict[str, Any]:
"""
Helper function to normalise a nested dictionary into a dictionary of depth 1 with names following the
convention: key_1.key_2.key_3..., with a period acting as a delimiter between nested keys

Items in lists have their names rendered using their index numbers within the lists they are found in.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very comfortable with this. It seems like dict could be a better container for this data, compared to pandas.DataFrame/pandas.Series. The reason for this is:

  1. Latency – for real-time ensemblers, the latency requirement is much more critical, compared to batch ensembling jobs. Having all this json manipulation and transformation into a pandas containers is likely something we'd better avoid
  2. This "flat-json" is not something the end-users would expect to work with. Also, JSON keys could have . in their names, which is not handled here

With that said, re-working the interface of the PyFunc ensembler makes more sense, and then batch-ensembler can be updated too, to transform pandas data into dict

Copy link
Contributor Author

@deadlycoconuts deadlycoconuts Feb 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay with regards to the latency concerns for real-time ensemblers, I've reworked (overloaded) the predict method of the PyFunc base class a little to make it work differently depending on whether it gets passed a dict (when called by a real-time ensembler engine) or a pandas.DataFrame (when called by a batch-ensembler engine) as its argument:
https://github.com/gojek/turing/blob/1bce96a181324636367236df3e9ad6d98c285c68/sdk/turing/ensembler.py#L73

With this change, I've completely removed any of those pandas.DataFrame transformations like flatten_json or other preprocessing table operations that might introduce latency in the real-time ensemblers.


While it'd be nice to unify both the batch and real-time ensembling use cases to act on and return dict inputs, the predict method, as an inherited method of the mlflow.pyfunc.PythonModel abstract class, gets utilised by other downstream methods such as mlflow.pyfunc.spark_udf, that forces the predict method to follow a certain contract by taking pandas.DataFrames as input and output.

In particular, the current batch ensembling engine uses mlflow.pyfunc.spark_udf, which necessitates at least some implementation of predict that takes in and outputs pandas.DataFrames. For now I've kept our original implementation intact to prevent the batch ensembler from breaking, but I'll see if I can find a solution that's more elegant.

engines/real-time-ensembler/requirements.txt Outdated Show resolved Hide resolved
.github/workflows/real-time-ensembler.yaml Outdated Show resolved Hide resolved
@deadlycoconuts deadlycoconuts force-pushed the create_ensembler_web_service branch from 8fb53c7 to d5b9a40 Compare February 15, 2022 04:07
Copy link
Contributor

@romanwozniak romanwozniak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Thanks, @deadlycoconuts!

@deadlycoconuts
Copy link
Contributor Author

Alright thanks @romanwozniak for the useful comments; I'm gonna merge this now 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants