Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend Execution Stages with a new Level - Execution Batches #72

Closed
mrpaulandrew opened this issue Nov 10, 2020 · 2 comments · Fixed by #82
Closed

Extend Execution Stages with a new Level - Execution Batches #72

mrpaulandrew opened this issue Nov 10, 2020 · 2 comments · Fixed by #82
Labels
enhancement New feature or request Solution Area: Orchestrator Azure Data Factory or Azure Synapse Analytics - Processing pipelines. Solution Area: SQL Database Metadata database housed within a version SQL Server database

Comments

@mrpaulandrew
Copy link
Owner

mrpaulandrew commented Nov 10, 2020

Building on community feedback in the following issues:

#61
#62

Allow the framework to support multiple executions of the parent pipeline using a new higher level concept called Batches.

  • A batch sits above execution stages.
  • Batches can be linked to 1/many execution stages.
  • Be enabled/disabled via properties.
  • Batches can run concurrently. Batch 1 = Running, Batch 2 = Running.
  • A single batch ID cannot run concurrently.

Examples of Batch names:

  1. Hourly
  2. Daily
  3. Weekly
  4. Monthly

Finally, it is expected that a Trigger hitting the parent pipeline will have the batch name included.

@mrpaulandrew mrpaulandrew added enhancement New feature or request Solution Area: Orchestrator Azure Data Factory or Azure Synapse Analytics - Processing pipelines. Solution Area: SQL Database Metadata database housed within a version SQL Server database Solution Area: Functions App Azure Functions App - Middle ware callers. labels Nov 10, 2020
@mrpaulandrew mrpaulandrew changed the title Add a Property to Enable/Disable Multiple Concurrent Parent Triggers Extend Execution Stages with a new Level - Execution Batches Nov 11, 2020
@Mallik-G
Copy link

Hi @mrpaulandrew,

I too had a similar requirement to support multiple parallel executions. I approached it similarly except that I called the higher-level concept as "Job".

A Job is comprised of one or more stages.
Multiple jobs can run concurrently.
JobId is passed a parameter from the Parent pipeline's trigger and is made available to all lower levels.
A combination of pipelines linked to all stages in a job is inserted into LocalExecution.
Once the job is successful, logs related to a job are moved to ExecutionLog

Metadatadb changes include
- changes to Stages table to support Job and Stage Linking.
- added JobId as a parameter to stored procs where needed.
- few properties like OverrideRestart, FailureHandling are more job-specific, so moved to a new object named JobProperties table.
- framework level properties that will be common across jobs are kept in Properties Table.
- related procs and functions are also developed

Examples of Jobs:

  • Daily Sales Ingestion Job (JobId 1)

    • Ingestion stage (StageId 1)
      • CopyToDataLake (Pipeline 1)
    • Transformation Stage (StageId 2)
      • LaunchDatabricksJob (Pipeline 2)
    • Load Stage (StageId 3)
      • CopyToDW (Pipeline 3)
  • Daily Marketing Egression Job (JobId 2)

    • Prepare Stage (StageId 4)
      • LaunchDatabricksJob (Pipeline 4)
    • Extract Stage (StageId 5)
      • CopyToSFTP ((Pipeline 5)
  • Weekly Campaign Ingestion Job (JobId 3)

    • Ingestion stage (StageId 6)
      • CopyToDataLake (Pipeline 6)
    • Transformation Stage (StageId 7)
      • LaunchDatabricksJob (Pipeline 7)
    • Load Stage (StageId 8)
      • CopyToDW (Pipeline 8)

I didn't have much use for Grand Parent at my work setup. So, my idea was to have the ParentPipeline to have different triggers (schedule or tumbling) and to pass the JobId as a parameter to the pipeline

When a worker pipeline is reused in a stage within the same job or across jobs, for example, LaunchDatabricksJob is a pipeline that gets reused a lot at my work where I need to pass different invocation params, class name and jar name as well to submit different apps/jars on to Databricks cluster.
To do that, I just add a new entry into Pipelines table mapping the Stage and Pipeline (like a new pipeline instance) and that causes a new PipelineId to be created making this particular pipeline instance unique. This PipelineId will be used in the Pipeline Parameters table to associate parameters needed for that particular pipeline id (or pipeline instance).

Examples of Jobs:

DailyTrigger-SalesIngestion => Scheduled At 3 AM Daily in AUS Time Zone => Parameter: JobId = 1

DailyTrigger-MarketingEgression => TumblingWindow trigger Scheduled at 4 AM Daily UTC => Parameter: JobId = 2

WeeklyTrigger-CampaignIngestion => Scheduled At 3 AM On Mondays in AUS Time Zone => Parameter: JobId = 3

I had this as a work-in-progress project in my local for some time, but after looking at recent conversations around this issue, I have pulled recent changes from the master branch, applied my changes (only metadata database changes) and pushed them to my fork. Planning to work on ADF Pipeline changes tonight and give a test end to end

Please have a look at the changes and let me know if it helps -
https://github.com/Mallik-G/ADF.procfwk/tree/feature/mallik

Thanks,
Mallik

@NJLangley
Copy link
Contributor

@mrpaulandrew I have pushed my implementation of this onto my fork here:
https://github.com/NJLangley/ADF.procfwk/tree/feature/batches

There are a few outstanding bits to fix:

  • Removing the stage from the pipelines table, it's now on the BatchPipelineLink table
  • Sorting the check to ensure that two batches with the same pipeline/params cannot run at the same time. Maybe we could do something clever here to see if another batch is running the pipeline with exactly the same params and wait for that to finish, instead of firing it off again?
  • Updating the sample data procs, I'll try and get that sorted in the next few days

Let me know if you have questions / any issues testing it

@mrpaulandrew mrpaulandrew removed the Solution Area: Functions App Azure Functions App - Middle ware callers. label Nov 13, 2020
@mrpaulandrew mrpaulandrew linked a pull request Nov 19, 2020 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Solution Area: Orchestrator Azure Data Factory or Azure Synapse Analytics - Processing pipelines. Solution Area: SQL Database Metadata database housed within a version SQL Server database
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants