datahub-project · hsheth2 · Jul 26, 2024 · Jul 18, 2024 · Jul 19, 2024 · Jul 19, 2024
diff --git a/docs/cli.md b/docs/cli.md
@@ -102,6 +102,7 @@ Command Options:
   --test-source-connection  When set, ingestion will only test the source connection details from the recipe
   --no-progress             If enabled, mute intermediate progress ingestion reports
 ```
+
 #### ingest --dry-run
 
 The `--dry-run` option of the `ingest` command performs all of the ingestion steps, except writing to the sink. This is useful to validate that the
@@ -133,23 +134,8 @@ By default `--preview` creates 10 workunits. But if you wish to try producing mo
 datahub ingest -c ./examples/recipes/example_to_datahub_rest.dhub.yaml -n --preview --preview-workunits=20
 ```
 
-#### ingest deploy
-
-The `ingest deploy` command instructs the cli to upload an ingestion recipe to DataHub to be run by DataHub's [UI Ingestion](./ui-ingestion.md).
-This command can also be used to schedule the ingestion while uploading or even to update existing sources. It will upload to the remote instance the
-CLI is connected to, not the sink of the recipe. Use `datahub init` to set the remote if not already set.
-
-To schedule a recipe called "test", to run at 5am everyday, London time with the recipe configured in a local `recipe.yaml` file: 
-````shell
-datahub ingest deploy --name "test" --schedule "5 * * * *" --time-zone "Europe/London" -c recipe.yaml
-````
-
-To update an existing recipe please use the `--urn` parameter to specify the id of the recipe to update.
-
-**Note:** Updating a recipe will result in a replacement of the existing options with what was specified in the cli command.
-I.e: Not specifying a schedule in the cli update command will remove the schedule from the recipe to be updated.
-
 #### ingest --no-default-report
+
 By default, the cli sends an ingestion report to DataHub, which allows you to see the result of all cli-based ingestion in the UI. This can be turned off with the `--no-default-report` flag.
 
 ```shell
@@ -180,6 +166,53 @@ failure_log:
     filename: ./path/to/failure.json
 ```
 
+### ingest deploy
+
+The `ingest deploy` command instructs the cli to upload an ingestion recipe to DataHub to be run by DataHub's [UI Ingestion](./ui-ingestion.md).
+This command can also be used to schedule the ingestion while uploading or even to update existing sources. It will upload to the remote instance the
+CLI is connected to, not the sink of the recipe. Use `datahub init` to set the remote if not already set.
+
-### ingest deploy
-
-The `ingest deploy` command instructs the cli to upload an ingestion recipe to DataHub to be run by DataHub's [UI Ingestion](./ui-ingestion.md).
-This command can also be used to schedule the ingestion while uploading or even to update existing sources. It will upload to the remote instance the
-CLI is connected to, not the sink of the recipe. Use `datahub init` to set the remote if not already set.
+### ingest deploy
+
+The `ingest deploy` command instructs the cli to upload an ingestion recipe to DataHub to be run by DataHub's [UI Ingestion](./ui-ingestion.md).
+This command can also be used to schedule the ingestion while uploading, or even to update existing sources. It will upload to the remote instance the
+CLI is connected to, not the sink of the recipe. Use `datahub init` to set the remote if not already set.
-### ingest deploy
-
-The `ingest deploy` command instructs the cli to upload an ingestion recipe to DataHub to be run by DataHub's [UI Ingestion](./ui-ingestion.md).
-This command can also be used to schedule the ingestion while uploading or even to update existing sources. It will upload to the remote instance the
-CLI is connected to, not the sink of the recipe. Use `datahub init` to set the remote if not already set.
+### ingest deploy
+
+The `ingest deploy` command instructs the cli to upload an ingestion recipe to DataHub to be run by DataHub's [UI Ingestion](./ui-ingestion.md).
+This command can also be used to schedule the ingestion while uploading, or even to update existing sources. It will upload to the remote instance the
+CLI is connected to, not the sink of the recipe. Use `datahub init` to set the remote if not already set.
+This command will automatically create a new recipe if it doesn't exist, or update it if it does.
+Note that this is a complete update, and will remove any options that were previously set.
+I.e: Not specifying a schedule in the cli update command will remove the schedule from the recipe to be updated.
+
+**Basic example**
+
+To schedule a recipe called "Snowflake Integration", to run at 5am every day, London time with the recipe configured in a local `recipe.yaml` file:
+
+```shell
+datahub ingest deploy --name "Snowflake Integration" --schedule "5 * * * *" --time-zone "Europe/London" -c recipe.yaml
+```
+
+By default, the ingestion recipe's identifier is generated by hashing the name.
+You can override the urn generation by passing the `--urn` flag to the CLI.
+
+**Using `deployment` to avoid CLI args**
+
+As an alternative to configuring settings from the CLI, all of these settings can also be set in the `deployment` field of the recipe.
+
+```yml
+# deployment_recipe.yml
+deployment:
+  name: "Snowflake Integration"
+  schedule: "5 * * * *"
+  time_zone: "Europe/London"
+
+source: ...
+```
+
+```shell
+datahub ingest deploy -c deployment_recipe.yml
+# Note that when deployment options are specified in the recipe, all other CLI options are ignored.
+```
+
+This can be particularly useful when you want all recipes should be stored in version control.
+
+```shell
+# Deploy every yml recipe in a directory
+ls recipe_directory/*.yml | xargs -n 1 -I {} datahub ingest deploy -c {}
+```
+
 ### init
 
 The init command is used to tell `datahub` about where your DataHub instance is located. The CLI will point to localhost DataHub by default.
@@ -242,8 +275,6 @@ The [metadata deletion guide](./how/delete-metadata.md) covers the various optio
 
 ### exists
 
-**🤝 Version compatibility** : `acryl-datahub>=0.10.2.4`
-
 The exists command can be used to check if an entity exists in DataHub.
 
 ```shell
@@ -253,7 +284,6 @@ true
 false
 ```
 
-
 ### get
 
 The `get` command allows you to easily retrieve metadata from DataHub, by using the REST API. This works for both versioned aspects and timeseries aspects. For timeseries aspects, it fetches the latest value.
@@ -314,6 +344,7 @@ Update succeeded with status 200
 ```
 
 #### put platform
+
 **🤝 Version Compatibility:** `acryl-datahub>0.8.44.4`
 
 The **put platform** command instructs `datahub` to create or update metadata about a data platform. This is very useful if you are using a custom data platform, to set up its logo and display name for a native UI experience.
@@ -346,6 +377,7 @@ datahub timeline --urn "urn:li:dataset:(urn:li:dataPlatform:mysql,User.UserAccou
 The `dataset` command allows you to interact with the dataset entity.
 
 The `get` operation can be used to read in a dataset into a yaml file.
+
 ```shell
 datahub dataset get --urn "$URN" --to-file "$FILE_NAME"
 ```
@@ -358,7 +390,6 @@ datahub dataset upsert -f dataset.yaml
 
 An example of `dataset.yaml` would look like as in [dataset.yaml](https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/cli_usage/dataset/dataset.yaml).
 
-
 ### user (User Entity)
 
 The `user` command allows you to interact with the User entity.
@@ -411,7 +442,6 @@ members:
     display_name: "Joe's Hub"
 ```
 
-
 ### dataproduct (Data Product Entity)
 
 **🤝 Version Compatibility:** `acryl-datahub>=0.10.2.4`
@@ -566,14 +596,12 @@ Use this to delete a Data Product from DataHub. Default to `--soft` which preser
 # > datahub dataproduct delete --urn "urn:li:dataProduct:pet_of_the_week" --hard
 ```
 
-
 ## Miscellaneous Admin Commands
 
 ### lite (experimental)
 
 The lite group of commands allow you to run an embedded, lightweight DataHub instance for command line exploration of your metadata. This is intended more for developer tool oriented usage rather than as a production server instance for DataHub. See [DataHub Lite](./datahub_lite.md) for more information about how you can ingest metadata into DataHub Lite and explore your metadata easily.
 
-
 ### telemetry
 
 To help us understand how people are using DataHub, we collect anonymous usage statistics on actions such as command invocations via Mixpanel.
@@ -640,7 +668,6 @@ External Entities Affected: None
 Old Entities Migrated = {'urn:li:dataset:(urn:li:dataPlatform:hive,logging_events,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)', 'urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)'}
 ```
 
-
 ## Alternate Installation Options
 
 ### Using docker
@@ -673,7 +700,7 @@ We use a plugin architecture so that you can install only the dependencies you a
 Please see our [Integrations page](https://datahubproject.io/integrations) if you want to filter on the features offered by each source.
 
 | Plugin Name                                                                                    | Install Command                                            | Provides                                |
-|------------------------------------------------------------------------------------------------| ---------------------------------------------------------- | --------------------------------------- |
+| ---------------------------------------------------------------------------------------------- | ---------------------------------------------------------- | --------------------------------------- |
 | [metadata-file](./generated/ingestion/sources/metadata-file.md)                                | _included by default_                                      | File source and sink                    |
 | [athena](./generated/ingestion/sources/athena.md)                                              | `pip install 'acryl-datahub[athena]'`                      | AWS Athena source                       |
 | [bigquery](./generated/ingestion/sources/bigquery.md)                                          | `pip install 'acryl-datahub[bigquery]'`                    | BigQuery source                         |
@@ -715,7 +742,7 @@ Please see our [Integrations page](https://datahubproject.io/integrations) if yo
 ### Sinks
 
 | Plugin Name                                                       | Install Command                              | Provides                   |
-|-------------------------------------------------------------------| -------------------------------------------- | -------------------------- |
+| ----------------------------------------------------------------- | -------------------------------------------- | -------------------------- |
 | [metadata-file](../metadata-ingestion/sink_docs/metadata-file.md) | _included by default_                        | File source and sink       |
 | [console](../metadata-ingestion/sink_docs/console.md)             | _included by default_                        | Console sink               |
 | [datahub-rest](../metadata-ingestion/sink_docs/datahub.md)        | `pip install 'acryl-datahub[datahub-rest]'`  | DataHub sink over REST API |

diff --git a/metadata-ingestion/src/datahub/cli/ingest_cli.py b/metadata-ingestion/src/datahub/cli/ingest_cli.py
@@ -16,7 +16,9 @@
 import datahub as datahub_package
 from datahub.cli import cli_utils
 from datahub.cli.config_utils import CONDENSED_DATAHUB_CONFIG_PATH
+from datahub.configuration.common import ConfigModel
 from datahub.configuration.config_loader import load_config_file
+from datahub.emitter.mce_builder import datahub_guid
 from datahub.ingestion.graph.client import get_default_graph
 from datahub.ingestion.run.connection import ConnectionManager
 from datahub.ingestion.run.pipeline import Pipeline
@@ -204,6 +206,24 @@ async def run_ingestion_and_check_upgrade() -> int:
     # don't raise SystemExit if there's no error
 
 
+def _make_ingestion_urn(name: str) -> str:
+    guid = datahub_guid(
+        {
+            "name": name,
+        }
+    )
+    return f"urn:li:dataHubIngestionSource:deploy-{guid}"
+
+
+class DeployOptions(ConfigModel):
+    name: str
+    description: Optional[str] = None
+    schedule: Optional[str] = None
+    time_zone: str = "UTC"
+    cli_version: Optional[str] = None
+    executor_id: str = "default"
+
+
 @ingest.command()
 @upgrade.check_upgrade
 @telemetry.with_telemetry()
@@ -212,7 +232,12 @@ async def run_ingestion_and_check_upgrade() -> int:
     "--name",
     type=str,
     help="Recipe Name",
-    required=True,
+)
+@click.option(
+    "--description",
+    type=str,
+    help="Recipe description",
+    required=False,
 )
 @click.option(
     "-c",
@@ -224,7 +249,7 @@ async def run_ingestion_and_check_upgrade() -> int:
 @click.option(
     "--urn",
     type=str,
-    help="Urn of recipe to update. Creates recipe if provided urn does not exist",
+    help="Urn of recipe to update. If not specified here or in the recipe's pipeline_name, this will create a new ingestion source.",
     required=False,
 )
 @click.option(
@@ -256,7 +281,8 @@ async def run_ingestion_and_check_upgrade() -> int:
     default="UTC",
 )
 def deploy(
-    name: str,
+    name: Optional[str],
+    description: Optional[str],
     config: str,
     urn: Optional[str],
     executor_id: str,
@@ -280,69 +306,96 @@ def deploy(
         resolve_env_vars=False,
     )
 
+    deploy_options_raw = pipeline_config.pop("deployment", None)
+    if deploy_options_raw is not None:
+        deploy_options = DeployOptions.parse_obj(deploy_options_raw)
+
+        logger.info(f"Using {repr(deploy_options)}")
+
+        if urn:
+            raise click.UsageError(
+                "Cannot specify both --urn and deployment field in config"
+            )
+        elif name:
+            raise click.UsageError(
+                "Cannot specify both --name and deployment field in config"
+            )
+        else:
+            logger.info(
+                "The deployment field is set in the recipe, any CLI args will be ignored"
+            )
+
+        # When urn/name is not specified, we will generate a unique urn based on the deployment name.
+        urn = _make_ingestion_urn(deploy_options.name)
+        logger.info(f"Will create or update a recipe with urn: {urn}")
+    elif name:
+        if not urn:
+            # When the urn is not specified, generate an urn based on the name.
+            urn = _make_ingestion_urn(name)
+            logger.info(
+                f"No urn was explicitly specified, will create or update the recipe with urn: {urn}"
+            )
+
+        deploy_options = DeployOptions(
+            name=name,
+            description=description,
+            schedule=schedule,
+            time_zone=time_zone,
+            cli_version=cli_version,
+            executor_id=executor_id,
+        )
+
+        logger.info(f"Using {repr(deploy_options)}")
+    else:  # neither deployment_name nor name is set
+        raise click.UsageError(
+            "Either --name must be set or deployment_name specified in the config"
+        )
+
+    # Invariant - at this point, both urn and deploy_options are set.
+
     variables: dict = {
         "urn": urn,
-        "name": name,
+        "name": deploy_options.name,
+        "description": deploy_options.description,
         "type": pipeline_config["source"]["type"],
         "recipe": json.dumps(pipeline_config),
-        "executorId": executor_id,
-        "version": cli_version,
+        "executorId": deploy_options.executor_id,
+        "version": deploy_options.cli_version,
     }
 
-    if schedule is not None:
-        variables["schedule"] = {"interval": schedule, "timezone": time_zone}
-
-    if urn:
-
-        graphql_query: str = textwrap.dedent(
-            """
-            mutation updateIngestionSource(
-                $urn: String!,
-                $name: String!,
-                $type: String!,
-                $schedule: UpdateIngestionSourceScheduleInput,
-                $recipe: String!,
-                $executorId: String!
-                $version: String) {
-
-                updateIngestionSource(urn: $urn, input: {
-                    name: $name,
-                    type: $type,
-                    schedule: $schedule,
-                    config: {
-                        recipe: $recipe,
-                        executorId: $executorId,
-                        version: $version,
-                    }
-                })
-            }
-            """
-        )
-    else:
-        logger.info("No URN specified recipe urn, will create a new recipe.")
-        graphql_query = textwrap.dedent(
-            """
-            mutation createIngestionSource(
-                $name: String!,
-                $type: String!,
-                $schedule: UpdateIngestionSourceScheduleInput,
-                $recipe: String!,
-                $executorId: String!,
-                $version: String) {
-
-                createIngestionSource(input: {
-                    name: $name,
-                    type: $type,
-                    schedule: $schedule,
-                    config: {
-                        recipe: $recipe,
-                        executorId: $executorId,
-                        version: $version,
-                    }
-                })
-            }
-            """
-        )
+    if deploy_options.schedule is not None:
+        variables["schedule"] = {
+            "interval": deploy_options.schedule,
+            "timezone": deploy_options.time_zone,
+        }
+
+    # The updateIngestionSource endpoint can actually do upserts as well.
+    graphql_query: str = textwrap.dedent(
+        """
+        mutation updateIngestionSource(
+            $urn: String!,
+            $name: String!,
+            $description: String,
+            $type: String!,
+            $schedule: UpdateIngestionSourceScheduleInput,
+            $recipe: String!,
+            $executorId: String!
+            $version: String) {
+
+            updateIngestionSource(urn: $urn, input: {
+                name: $name,
+                description: $description,
+                type: $type,
+                schedule: $schedule,
+                config: {
+                    recipe: $recipe,
+                    executorId: $executorId,
+                    version: $version,
+                }
+            })
+        }
+        """
+    )
 
     response = datahub_graph.execute_graphql(graphql_query, variables=variables)