Skip to content

breatheco-de/dataflow-project-events

Repository files navigation

For further developers or Data Scientist

  1. Is better if you work in Gitpod, its easily
  2. Run
pipenv install

You will need to install the dependencies of the Pipfile.lock to make this project work.

How use this project

  1. Clone into your computer (or gitpod).
  2. Add your transformations into the ./transformations/<pipeline>/ folder.
  3. Configure the project.yml to specify the pipeline and transformations in the order you want to execute them. Each pipeline must have at least one source and only one destination. You can have multiple sources if needed.
  4. Add new transformation files as you need them, make sure to include expected_inputs and expected_output as examples. The expected inputs can be an array of dataframes for multiple sources.
  5. Update your project.yml file as needed to change the order of the transformations.
  6. Validate your transformations running $ pipenv run validate.
  7. Run your pipeline by running $ pipenv run pipeline --name=<pipeline_slug>
  8. If you need to clean your outputs :$ pipenv run clear

Transformations

import pandas as pd
import numpy as np

def run(df):
    # ...
    return df

Streaming data

Pipelines also allow string chunks of data. For example:

pipenv run pipeline --name=clean_publicsupport_fs_messages --stream=stream_sample.csv

Note: --stream is the path to a csv file that contains all the streams you want to test, if the CSV contains multiple rows, each of them will be considered a separate stream and the pipeline will run once for each stream.

Fining the stream parameter in the transformation

Make sure to specify the stream optional parameter in the transformation function:

import pandas as pd
import numpy as np

def run(df, stream=None):
    # ...
    return df

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published