Skip to content

Latest commit

 

History

History
146 lines (103 loc) · 7.76 KB

File metadata and controls

146 lines (103 loc) · 7.76 KB

Generate Statistics component

This component is used to calculate statistics of a given dataset (CSV). It uses the generate_statistics_from_csv function in the TensorFlow Data Validation (TFDV) package.

As well as being run in an ordinary pipeline container, it can optionally be run on DataFlow, allowing it to scale to massive datasets.

Usage

Normal usage - DirectRunner

In this mode, all the computation happens with the Vertex Pipeline step. This is the approach that is used in the example ML pipelines, but of course you can easily change this to make use of DataFlow (see below).

Code example

The example below shows the use of a file_pattern for selecting a dataset spread across multiple CSV file. The dataset parameter uses the output from a previous pipeline step (a dataset in CSV format) - alternatively you can use the GCS path of a CSV file.

gen_statistics = generate_statistics(
        dataset=train_dataset.outputs["dataset"],
        file_pattern="file-*.csv",
    ).set_display_name("Generate data statistics")

The final piece set_display_name(...) is optional - it is used to create a neater display name in the Vertex Pipelines UI.

Running on DataFlow without a prebuilt container image

In this approach, the DataFlow workers install Apache Beam and TFDV when the job runs. The benefits to this approach are mainly simplicity - you don't need to create a custom container image for the DataFlow workers to use.

The drawbacks to this approach include:

  • Slower job startup, as the DataFlow workers need to install the dependencies at the start of each job
  • Relies on PyPi availability at runtime

Code example

In this example we have additionally set the following parameters:

  • use_dataflow=True
  • project_id - the GCP project ID where we want to run the DataFlow job
  • region - the GCP region where we want to run the DataFlow job
  • gcs_staging_location - A GCS path that can be used as a DataFlow staging location
  • gcs_temp_location - A GCS path that can be used as a DataFlow temp / scratch directory
gen_statistics = generate_statistics(
        dataset=train_dataset.outputs["dataset"],
        file_pattern="file-*.csv",
        use_dataflow=True,
        project_id="my-gcp-project",
        region="europe-west1",
        gcs_staging_location="gs://my-gcs-bucket/dataflow-staging",
        gcs_temp_location="gs://my-gcs-bucket/dataflow-temp",
    ).set_display_name("Generate data statistics")

Running on DataFlow using a prebuilt container image

In this approach, a custom container image is used by the DataFlow workers, meaning that the DataFlow workers don't need to install anything when the job runs.

The benefits to this approach include:

  • Quicker job startup (don't need to install packages each time)
  • More deterministic behaviour (even if package versions are pinned, their dependencies might not be!)
  • No reliance on PyPi repositories at runtime - for instance if you are operating in a disconnected environment, or if you are concerned about the public PyPi repositories experiencing an outage

The main drawback to this approach is that you need to provide a custom container image with Apache Beam and TFDV pre-installed. The next section contains more details on how you can create the custom container image.

Code example

In this example we have additionally set the following parameters:

  • tfdv_container_image - the container image to use for the DataFlow runners - must have Apache Beam and TFDV preinstalled (the same versions as you use to make the function call in the pipeline step!)
  • subnetwork - the subnetwork that you want to attach the DataFlow workers to. Should be in the form regions/REGION_NAME/subnetworks/SUBNETWORK_NAME - see further docs here. Also note that the DataFlow region must match the region of the subnetwork!
  • use_public_ips=False - we specify that the DataFlow workers should not have public IP addresses. Without additional networking considerations (a NAT gateway), this generally means that they are unable to access the internet

Of course, you can also use the tfdv_container_image without using the subnetwork and use_public_ips parameters, so your DataFlow workers will still have public IP addresses and will use the default compute network.

gen_statistics = generate_statistics(
        dataset=train_dataset.outputs["dataset"],
        file_pattern="file-*.csv",
        use_dataflow=True,
        project_id="my-gcp-project",
        region="europe-west1",
        gcs_staging_location="gs://my-gcs-bucket/dataflow-staging",
        gcs_temp_location="gs://my-gcs-bucket/dataflow-temp",
        tfdv_container_image="eu.gcr.io/my-gcp-project/my-custom-tfdv-image:latest",
        subnetwork="regions/europe-west1/subnetworks/my-subnet",
        use_public_ips=False,
    ).set_display_name("Generate data statistics")

Creating a custom container image for use with DataFlow

Creating the custom container image is very simple. Here is an example of a Dockerfile that you can use:

# Use correct image for Apache Beam version
FROM apache/beam_python3.7_sdk:2.35.0

# Install TFDV on top 
RUN pip install tensorflow-data-validation==1.6.0

# Check version compatibilities here
# https://www.tensorflow.org/tfx/data_validation/install#compatible_versions

Additional component parameters

In the docstring for the generate_statistics component, you will notice a few other parameters that we haven't mentioned in the examples above:

  • statistics - this is the output path that the statistics file is written to. Since its type is Output[Artifact], Vertex Pipelines automatically provides the path for us without us having to specify it

The following options are all related to the Apache Beam PipelineOptions. Each is a dictionary that you can pass to the component to construct the PipelineOptions. Any options passed in using these dictionaries will override those set by the component (as these dictionaries are applied last), so use with care! You can find more details about how to set the PipelineOptions in the Apache Beam and DataFlow documentation.

  • extra_standard_options
  • extra_setup_optons
  • extra_worker_options
  • extra_google_cloud_options
  • extra_debug_options

In addition to these options, you can also pass options required for generating statistics with tfdv_stats_options. For example, these stats options can include (but are not limited to):

  • schema: Pre-defined schema as a tensorflow_metadata Schema proto
  • infer_type_from_schema: Boolean to indicate whether the feature types should be inferred from a schema
  • feature_allowlist: List of feature names to calculate statistics for
  • sample_rate
  • desired_batch_size

For more details, please refer to this link.