From d6cff82a44dd2c9bc6ab0c04502f31335d259e83 Mon Sep 17 00:00:00 2001 From: Charles Smith Date: Thu, 3 Oct 2024 14:21:46 -0700 Subject: [PATCH 01/19] draft tutorial on extern: --- docs/tutorials/tutorial-extern.md | 72 +++++++++++++++++++++++++++++++ 1 file changed, 72 insertions(+) create mode 100644 docs/tutorials/tutorial-extern.md diff --git a/docs/tutorials/tutorial-extern.md b/docs/tutorials/tutorial-extern.md new file mode 100644 index 000000000000..7fa0ede2a9e7 --- /dev/null +++ b/docs/tutorials/tutorial-extern.md @@ -0,0 +1,72 @@ +--- +id: tutorial-extern +title: Export query results +sidebar_label: Export results +description: How to use EXTERN to export query results. +--- + + + +This tutorial demonstrates how to use the [EXTERN](..multi-stage-query/reference#extern-function) function Apache Druid® to export data. + + +## Prerequisites + +Before you follow the steps in this tutorial, download Druid as described in the [Local quickstart](index.md) and have it running on your local machine. You don't need to load any data into the Druid cluster. + +You should be familiar with data querying in Druid. If you haven't already, go through the [Query data](../tutorials/tutorial-query.md) tutorial first. + +## Configure Druid local export directory + +``` +sed -i -e $'$a\\\n\\\n\\\n#\\\n###Local export\\\n#\\\ndruid.export.storage.baseDir=/tmp/druid/' conf/druid/auto/_common/common.runtime.properties +``` + +This adds the following to the Druid configuration: + +``` +# +###Local export +# +druid.export.storage.baseDir=/tmp/druid/ +``` + +## Start Druid + +## Load data + +## Export data + +```sql +INSERT INTO + EXTERN( + local(exportPath => '/tmp/druid/query1') + ) +AS CSV +SELECT APPROX_COUNT_DISTINCT_DS_THETA(theta_uid) FILTER(WHERE "show" = 'Bridgerton') AS users +FROM ts_tutorial +``` + +## Learn more + +See the following topics for more information: + +* [Update data](./tutorial-update-data.md) for a tutorial on updating data in Druid. +* [Data updates](../data-management/update.md) for an overview of updating data in Druid. From 843c04d4b876cc522dcb1b79b81baa9001f7c214 Mon Sep 17 00:00:00 2001 From: Charles Smith Date: Thu, 3 Oct 2024 17:55:28 -0700 Subject: [PATCH 02/19] updated draft --- docs/tutorials/tutorial-extern.md | 76 +++++++++++++++++++++++++++---- 1 file changed, 68 insertions(+), 8 deletions(-) diff --git a/docs/tutorials/tutorial-extern.md b/docs/tutorials/tutorial-extern.md index 7fa0ede2a9e7..bc7554b03be3 100644 --- a/docs/tutorials/tutorial-extern.md +++ b/docs/tutorials/tutorial-extern.md @@ -26,20 +26,32 @@ description: How to use EXTERN to export query results. This tutorial demonstrates how to use the [EXTERN](..multi-stage-query/reference#extern-function) function Apache Druid® to export data. - ## Prerequisites -Before you follow the steps in this tutorial, download Druid as described in the [Local quickstart](index.md) and have it running on your local machine. You don't need to load any data into the Druid cluster. +Before you follow the steps in this tutorial, download Druid as described in the [Local quickstart](index.md). +Do not start Druid, you'll do that as part of the tutorial. -You should be familiar with data querying in Druid. If you haven't already, go through the [Query data](../tutorials/tutorial-query.md) tutorial first. +You should be familiar with ingesting and querying data in Druid. +If you haven't already, go through the [Query data](../tutorials/tutorial-query.md) tutorial first. -## Configure Druid local export directory +## Export query reaults to the local file system -``` -sed -i -e $'$a\\\n\\\n\\\n#\\\n###Local export\\\n#\\\ndruid.export.storage.baseDir=/tmp/druid/' conf/druid/auto/_common/common.runtime.properties +This example demonstrates how to configure Druid to export to the local file system. +It is OK to learn about EXTERN syntax for exporting data. +It is not suitable for production scenarios. + +### Configure Druid local export directory + +The following commands set the base path for the Druid exports to `/tmp/druid/`. +If the account running Druid does not have access to `/tmp/druid/`, change the path. For example: `/Users/Example/druid`. +From the root of the Druid distribution, run the following: + +```bash +export export_path="/tmp/druid" +sed -i -e $'$a\\\n\\\n\\\n#\\\n###Local export\\\n#\\\ndruid.export.storage.baseDir='$export_path conf/druid/auto/_common/common.runtime.properties ``` -This adds the following to the Druid configuration: +This adds the following section to the Druid quicstart `common.runtime.properties`: ``` # @@ -48,10 +60,58 @@ This adds the following to the Druid configuration: druid.export.storage.baseDir=/tmp/druid/ ``` -## Start Druid +### Start Druid + +From the root of the Druid distribution, launch Druid as follows: + +```bash +/bin/start-druid +``` ## Load data +From the Query view, run the following command to load the Wikipedia example data set: + +```sql +REPLACE INTO "wikipedia" OVERWRITE ALL +WITH "ext" AS ( + SELECT * + FROM TABLE( + EXTERN( + '{"type":"http","uris":["https://druid.apache.org/data/wikipedia.json.gz"]}', + '{"type":"json"}' + ) + ) EXTEND ("isRobot" VARCHAR, "channel" VARCHAR, "timestamp" VARCHAR, "flags" VARCHAR, "isUnpatrolled" VARCHAR, "page" VARCHAR, "diffUrl" VARCHAR, "added" BIGINT, "comment" VARCHAR, "commentLength" BIGINT, "isNew" VARCHAR, "isMinor" VARCHAR, "delta" BIGINT, "isAnonymous" VARCHAR, "user" VARCHAR, "deltaBucket" BIGINT, "deleted" BIGINT, "namespace" VARCHAR, "cityName" VARCHAR, "countryName" VARCHAR, "regionIsoCode" VARCHAR, "metroCode" BIGINT, "countryIsoCode" VARCHAR, "regionName" VARCHAR) +) +SELECT + TIME_PARSE("timestamp") AS "__time", + "isRobot", + "channel", + "flags", + "isUnpatrolled", + "page", + "diffUrl", + "added", + "comment", + "commentLength", + "isNew", + "isMinor", + "delta", + "isAnonymous", + "user", + "deltaBucket", + "deleted", + "namespace", + "cityName", + "countryName", + "regionIsoCode", + "metroCode", + "countryIsoCode", + "regionName" +FROM "ext" +PARTITIONED BY DAY +``` + ## Export data ```sql From 40b134c178fac8c0f2eac868a147cba183336ff4 Mon Sep 17 00:00:00 2001 From: Charles Smith Date: Fri, 4 Oct 2024 14:59:01 -0700 Subject: [PATCH 03/19] updates --- docs/multi-stage-query/reference.md | 19 ++++++++------- docs/tutorials/tutorial-extern.md | 38 +++++++++++++++++++++-------- 2 files changed, 38 insertions(+), 19 deletions(-) diff --git a/docs/multi-stage-query/reference.md b/docs/multi-stage-query/reference.md index 50f07ff80b48..711da00d0360 100644 --- a/docs/multi-stage-query/reference.md +++ b/docs/multi-stage-query/reference.md @@ -72,9 +72,8 @@ FROM TABLE( `name` and a `type`. The type can be `string`, `long`, `double`, or `float`. This row signature is used to map the external data into the SQL layer. -Variation 2, with the input schema expressed in SQL using an `EXTEND` clause. (See the next -section for more detail on `EXTEND`). This format also uses named arguments to make the -SQL a bit easier to read: +Variation 2, with the input schema expressed in SQL using an `EXTEND` clause. See the next +section for more detail on `EXTEND`. This format also uses named arguments to make the SQL easier to read: ```sql SELECT @@ -95,12 +94,14 @@ For more information, see [Read external data with EXTERN](concepts.md#read-exte #### `EXTERN` to export to a destination -`EXTERN` can be used to specify a destination where you want to export data to. -This variation of EXTERN requires one argument, the details of the destination as specified below. +You can use `EXTERN` to specify a destination to export data to. +This variation of EXTERN acppets the details of the destination as the only argument. This variation additionally requires an `AS` clause to specify the format of the exported rows. -While exporting data, some metadata files will also be created at the destination in addition to the data. These files will be created in a directory `_symlink_format_manifest`. -- `_symlink_format_manifest/manifest`: Lists the files which were created as part of the export. The file is in the symlink manifest format, and consists of a list of absolute paths to the files created. +When you export data, Druid creates metadata files in a subdirectory of the destination directory named `_symlink_format_manifest`: + +- `_symlink_format_manifest/manifest`: Lists the exported files using the symlink manifest format. It consists of a list of absolute paths to the export files. For example: + ```text s3://export-bucket/export/query-6564a32f-2194-423a-912e-eead470a37c4-worker2-partition2.csv s3://export-bucket/export/query-6564a32f-2194-423a-912e-eead470a37c4-worker1-partition1.csv @@ -112,8 +113,8 @@ s3://export-bucket/export/query-6564a32f-2194-423a-912e-eead470a37c4-worker0-par Keep the following in mind when using EXTERN to export rows: - Only INSERT statements are supported. - Only `CSV` format is supported as an export format. -- Partitioning (`PARTITIONED BY`) and clustering (`CLUSTERED BY`) aren't supported with export statements. -- You can export to Amazon S3 or local storage. +- Partitioning (`PARTITIONED BY`) and clustering (`CLUSTERED BY`) aren't supported with EXTERN statements. +- You can export to Amazon S3, Google CS, or local storage. - The destination provided should contain no other files or directories. When you export data, use the `rowsPerPage` context parameter to restrict the size of exported files. diff --git a/docs/tutorials/tutorial-extern.md b/docs/tutorials/tutorial-extern.md index bc7554b03be3..1c1d5dcec095 100644 --- a/docs/tutorials/tutorial-extern.md +++ b/docs/tutorials/tutorial-extern.md @@ -34,7 +34,7 @@ Do not start Druid, you'll do that as part of the tutorial. You should be familiar with ingesting and querying data in Druid. If you haven't already, go through the [Query data](../tutorials/tutorial-query.md) tutorial first. -## Export query reaults to the local file system +## Export query results to the local file system This example demonstrates how to configure Druid to export to the local file system. It is OK to learn about EXTERN syntax for exporting data. @@ -43,7 +43,10 @@ It is not suitable for production scenarios. ### Configure Druid local export directory The following commands set the base path for the Druid exports to `/tmp/druid/`. -If the account running Druid does not have access to `/tmp/druid/`, change the path. For example: `/Users/Example/druid`. +If the account running Druid does not have access to `/tmp/druid/`, change the path. +For example: `/Users/Example/druid`. +If you change the path in this step, use the updated path in all subsequent steps. + From the root of the Druid distribution, run the following: ```bash @@ -60,17 +63,15 @@ This adds the following section to the Druid quicstart `common.runtime.propertie druid.export.storage.baseDir=/tmp/druid/ ``` -### Start Druid +### Start Druid and load sample data From the root of the Druid distribution, launch Druid as follows: ```bash -/bin/start-druid +./bin/start-druid ``` -## Load data - -From the Query view, run the following command to load the Wikipedia example data set: +From the [Query view](http://localhost:8888/unified-console.html#workbench), run the following command to load the Wikipedia example data set: ```sql REPLACE INTO "wikipedia" OVERWRITE ALL @@ -114,16 +115,33 @@ PARTITIONED BY DAY ## Export data +Run the following query to export query results to the path: +`/tmp/druid/wiki_example`. +The path must be a subdirectory of the `druid.export.storage.baseDir`. + + ```sql INSERT INTO EXTERN( - local(exportPath => '/tmp/druid/query1') + local(exportPath => '/tmp/druid/wiki_example') ) AS CSV -SELECT APPROX_COUNT_DISTINCT_DS_THETA(theta_uid) FILTER(WHERE "show" = 'Bridgerton') AS users -FROM ts_tutorial +SELECT "channel", + SUM("delta") AS "changes" +FROM "wikipedia" +GROUP BY 1 +LIMIT 10 ``` +Druid exports the results of the qurey to the `/tmp/druid/wiki_example` dirctory. +Run the following comannd to list the contents of + +```bash +ls '/tmp/druid/wiki_example' +``` + +The results are a csv file export of the data and a directory + ## Learn more See the following topics for more information: From 0be19d33bef8e1bcfcae6fbc107d3f81eae94060 Mon Sep 17 00:00:00 2001 From: Charles Smith Date: Fri, 4 Oct 2024 16:47:26 -0700 Subject: [PATCH 04/19] add sidebar, fix reference wording --- docs/multi-stage-query/reference.md | 64 +++++++++++++++-------------- website/sidebars.json | 3 +- 2 files changed, 35 insertions(+), 32 deletions(-) diff --git a/docs/multi-stage-query/reference.md b/docs/multi-stage-query/reference.md index 711da00d0360..b9c01d87c3a1 100644 --- a/docs/multi-stage-query/reference.md +++ b/docs/multi-stage-query/reference.md @@ -94,13 +94,13 @@ For more information, see [Read external data with EXTERN](concepts.md#read-exte #### `EXTERN` to export to a destination -You can use `EXTERN` to specify a destination to export data to. +You can use `EXTERN` to specify a destination to export data. This variation of EXTERN acppets the details of the destination as the only argument. This variation additionally requires an `AS` clause to specify the format of the exported rows. When you export data, Druid creates metadata files in a subdirectory of the destination directory named `_symlink_format_manifest`: -- `_symlink_format_manifest/manifest`: Lists the exported files using the symlink manifest format. It consists of a list of absolute paths to the export files. For example: +- `_symlink_format_manifest/manifest`: Lists absolute paths to exported files using the symlink manifest format. For example: ```text s3://export-bucket/export/query-6564a32f-2194-423a-912e-eead470a37c4-worker2-partition2.csv @@ -129,10 +129,11 @@ SELECT FROM ``` -##### S3 +##### Amazon S3 -Export results to S3 by passing the function `s3()` as an argument to the `EXTERN` function. Note that this requires the `druid-s3-extensions`. -The `s3()` function is a Druid function that configures the connection. Arguments for `s3()` should be passed as named parameters with the value in single quotes like the following example: +To export results to S3, pass the `s3()` function as an argument to the EXTERN function. S3 export requires the `druid-s3-extensions`. +The `s3()` function configures the connection to AWS. +Pass all arguments for `s3()` as named parameters with the value in single quote. For example: ```sql INSERT INTO @@ -147,9 +148,9 @@ FROM
Supported arguments for the function: -| Parameter | Required | Description | Default | -|-------------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------| -| `bucket` | Yes | The S3 bucket to which the files are exported to. The bucket and prefix combination should be whitelisted in `druid.export.storage.s3.allowedExportPaths`. | n/a | +| Parameter | Required | Description | Default | +|---|---|---|---| +| `bucket` | Yes | The S3 bucket to which the files are exported to. The bucket and prefix combination should be whitelisted in `druid.export.storage.s3.allowedExportPaths`. | n/a | | `prefix` | Yes | Path where the exported files would be created. The export query expects the destination to be empty. If the location includes other files, then the query will fail. The bucket and prefix combination should be whitelisted in `druid.export.storage.s3.allowedExportPaths`. | n/a | The following runtime parameters must be configured to export into an S3 destination: @@ -162,10 +163,11 @@ The following runtime parameters must be configured to export into an S3 destina | `druid.export.storage.s3.chunkSize` | No | Defines the size of each chunk to temporarily store in `tempDir`. The chunk size must be between 5 MiB and 5 GiB. A large chunk size reduces the API calls to S3, however it requires more disk space to store the temporary chunks. | 100MiB | -##### GS +##### Google Cloud Storage -Export results to GCS by passing the function `google()` as an argument to the `EXTERN` function. Note that this requires the `druid-google-extensions`. -The `google()` function is a Druid function that configures the connection. Arguments for `google()` should be passed as named parameters with the value in single quotes like the following example: +To export query results to Google Cloud Storage (GCS), passing the `google()` function as an argument to the `EXTERN` function. +This requires the `druid-google-extensions`. +The `google()` function configures the connection to Google Cloud Storage. Pass the arguments for `google()` as named parameters with the value in single quotes. For example: ```sql INSERT INTO @@ -180,29 +182,29 @@ FROM
Supported arguments for the function: -| Parameter | Required | Description | Default | -|-------------|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------| -| `bucket` | Yes | The GS bucket to which the files are exported to. The bucket and prefix combination should be whitelisted in `druid.export.storage.google.allowedExportPaths`. | n/a | -| `prefix` | Yes | Path where the exported files would be created. The export query expects the destination to be empty. If the location includes other files, then the query will fail. The bucket and prefix combination should be whitelisted in `druid.export.storage.google.allowedExportPaths`. | n/a | +| Parameter | Required | Description | Default | +|---|---|---|---| +| `bucket` | Yes | The GCS bucket destination for exported files. You must add the bucket and prefix combination to the `druid.export.storage.google.allowedExportPaths` allow list. | n/a | +| `prefix` | Yes | Destination path in the bucket to create exported files. The export query expects the destination path to be empty. If the location includes other files, the query will fail. You must add the bucket and prefix combination to the `druid.export.storage.google.allowedExportPaths` allow list. | n/a | -The following runtime parameters must be configured to export into a GCS destination: +Configure the following runtime parameters to export query results to a GCS destination: -| Runtime Parameter | Required | Description | Default | -|--------------------------------------------------|----------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------| -| `druid.export.storage.google.tempLocalDir` | Yes | Directory used on the local storage of the worker to store temporary files required while uploading the data. | n/a | -| `druid.export.storage.google.allowedExportPaths` | Yes | An array of GS prefixes that are allowed as export destinations. Export queries fail if the export destination does not match any of the configured prefixes. Example: `[\"gs://bucket1/export/\", \"gs://bucket2/export/\"]` | n/a | -| `druid.export.storage.google.maxRetry` | No | Defines the max number times to attempt GS API calls to avoid failures due to transient errors. | 10 | -| `druid.export.storage.google.chunkSize` | No | Defines the size of each chunk to temporarily store in `tempDir`. A large chunk size reduces the API calls to GS; however, it requires more disk space to store the temporary chunks. | 4MiB | +| Runtime Parameter | Required | Description | Default | +|---|---|---|---| +| `druid.export.storage.google.tempLocalDir` | Yes | Directory for local storage where the worker stores temporary files before uploading the data to GCS. | n/a | +| `druid.export.storage.google.allowedExportPaths` | Yes | An array of GCS prefixes allowed as export destinations. Export queries fail if the export destination does not match any of the configured prefixes. For eample: `[\"gs://bucket1/export/\", \"gs://bucket2/export/\"]` | n/a | +| `druid.export.storage.google.maxRetry` | No | The maximum number of attempts for GCS API calls to avoid failures due to transient errors. | 10 | +| `druid.export.storage.google.chunkSize` | No | Individual chunk size to store temporarily in `tempDir`. Large chunk sizes reduce the number of API calls to GS, but require more disk space to store temporary chunks. | 4 MiB | -##### LOCAL +##### Local file storage -You can export to the local storage, which exports the results to the filesystem of the MSQ worker. +You can export queryies to local storage, which writes the results to the filesystem on the MSQ worker. This is useful in a single node setup or for testing but is not suitable for production use cases. -Export results to local storage by passing the function `LOCAL()` as an argument for the `EXTERN FUNCTION`. -To use local storage as an export destination, the runtime property `druid.export.storage.baseDir` must be configured on the Indexer/Middle Manager. -This value must be set to an absolute path on the local machine. Exporting data will be allowed to paths which match the prefix set by this value. -Arguments to `LOCAL()` should be passed as named parameters with the value in single quotes in the following example: +To export results to local storage, passing the `LOCAL()` function as an argument to the EXTERN function. +You must configure the runtime property `druid.export.storage.baseDir` must be configured as an absolute path on the Indexer/Middle Manager to use local storage as an export destination. +You can exporting data to paths that match this value as a prefix. +Pass all arguments to `LOCAL()` as named parameters with the value in single quotes. For example: ```sql INSERT INTO @@ -217,9 +219,9 @@ FROM
Supported arguments to the function: -| Parameter | Required | Description | Default | -|-------------|--------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| --| -| `exportPath` | Yes | Absolute path to a subdirectory of `druid.export.storage.baseDir` used as the destination to export the results to. The export query expects the destination to be empty. If the location includes other files or directories, then the query will fail. | n/a | +| Parameter | Required | Description | Default | +|---|---|---|---|---| +| `exportPath` | Yes | Absolute path to a subdirectory of `druid.export.storage.baseDir` where Druid exports the querey results. The destination must be empty. If the location includes other files or directories, the query will fail. | n/a | For more information, see [Read external data with EXTERN](concepts.md#write-to-an-external-destination-with-extern). diff --git a/website/sidebars.json b/website/sidebars.json index e53040063188..b7cf66750388 100644 --- a/website/sidebars.json +++ b/website/sidebars.json @@ -38,7 +38,8 @@ "tutorials/tutorial-sql-query-view", "tutorials/tutorial-unnest-arrays", "tutorials/tutorial-query-deep-storage", - "tutorials/tutorial-latest-by"] + "tutorials/tutorial-latest-by", + "tutorials/tutorial-extern"] }, "tutorials/tutorial-sketches-theta", "tutorials/tutorial-jdbc", From 3d01d80c46a88c7621c2c98426497d743026b496 Mon Sep 17 00:00:00 2001 From: Charles Smith Date: Wed, 9 Oct 2024 12:21:55 -0700 Subject: [PATCH 05/19] update reference --- docs/multi-stage-query/reference.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/docs/multi-stage-query/reference.md b/docs/multi-stage-query/reference.md index b9c01d87c3a1..01a30db4a621 100644 --- a/docs/multi-stage-query/reference.md +++ b/docs/multi-stage-query/reference.md @@ -95,12 +95,12 @@ For more information, see [Read external data with EXTERN](concepts.md#read-exte #### `EXTERN` to export to a destination You can use `EXTERN` to specify a destination to export data. -This variation of EXTERN acppets the details of the destination as the only argument. -This variation additionally requires an `AS` clause to specify the format of the exported rows. +This variation of EXTERN: +- accepts the details of the destination as the only argument +- requires an `AS` clause to specify the format of the exported rows. -When you export data, Druid creates metadata files in a subdirectory of the destination directory named `_symlink_format_manifest`: - -- `_symlink_format_manifest/manifest`: Lists absolute paths to exported files using the symlink manifest format. For example: +When you export data, Druid creates metadata files in a subdirectory of the destination directory named `_symlink_format_manifest`. +The `manifest` file within that directory`_symlink_format_manifest/manifest` lists absolute paths to exported files using the symlink manifest format. For example: ```text s3://export-bucket/export/query-6564a32f-2194-423a-912e-eead470a37c4-worker2-partition2.csv @@ -114,7 +114,7 @@ Keep the following in mind when using EXTERN to export rows: - Only INSERT statements are supported. - Only `CSV` format is supported as an export format. - Partitioning (`PARTITIONED BY`) and clustering (`CLUSTERED BY`) aren't supported with EXTERN statements. -- You can export to Amazon S3, Google CS, or local storage. +- You can export to Amazon S3, Google GCS, or local storage. - The destination provided should contain no other files or directories. When you export data, use the `rowsPerPage` context parameter to restrict the size of exported files. @@ -129,7 +129,7 @@ SELECT FROM
``` -##### Amazon S3 +##### S3 - Amazon S3 To export results to S3, pass the `s3()` function as an argument to the EXTERN function. S3 export requires the `druid-s3-extensions`. The `s3()` function configures the connection to AWS. @@ -163,7 +163,7 @@ The following runtime parameters must be configured to export into an S3 destina | `druid.export.storage.s3.chunkSize` | No | Defines the size of each chunk to temporarily store in `tempDir`. The chunk size must be between 5 MiB and 5 GiB. A large chunk size reduces the API calls to S3, however it requires more disk space to store the temporary chunks. | 100MiB | -##### Google Cloud Storage +##### GOOGLE - Google Cloud Storage To export query results to Google Cloud Storage (GCS), passing the `google()` function as an argument to the `EXTERN` function. This requires the `druid-google-extensions`. @@ -196,7 +196,7 @@ Configure the following runtime parameters to export query results to a GCS dest | `druid.export.storage.google.maxRetry` | No | The maximum number of attempts for GCS API calls to avoid failures due to transient errors. | 10 | | `druid.export.storage.google.chunkSize` | No | Individual chunk size to store temporarily in `tempDir`. Large chunk sizes reduce the number of API calls to GS, but require more disk space to store temporary chunks. | 4 MiB | -##### Local file storage +##### LOCAL - local file storage You can export queryies to local storage, which writes the results to the filesystem on the MSQ worker. This is useful in a single node setup or for testing but is not suitable for production use cases. From 94c9beb479e08941aea4848191976a00dbd848f9 Mon Sep 17 00:00:00 2001 From: Charles Smith Date: Wed, 16 Oct 2024 13:59:06 -0700 Subject: [PATCH 06/19] final updates to reference --- docs/multi-stage-query/reference.md | 30 ++++++++++++++--------------- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/docs/multi-stage-query/reference.md b/docs/multi-stage-query/reference.md index 01a30db4a621..bc810032f10a 100644 --- a/docs/multi-stage-query/reference.md +++ b/docs/multi-stage-query/reference.md @@ -150,22 +150,22 @@ Supported arguments for the function: | Parameter | Required | Description | Default | |---|---|---|---| -| `bucket` | Yes | The S3 bucket to which the files are exported to. The bucket and prefix combination should be whitelisted in `druid.export.storage.s3.allowedExportPaths`. | n/a | -| `prefix` | Yes | Path where the exported files would be created. The export query expects the destination to be empty. If the location includes other files, then the query will fail. The bucket and prefix combination should be whitelisted in `druid.export.storage.s3.allowedExportPaths`. | n/a | +| `bucket` | Yes | S3 bucket destination for exported files. You must add the bucket and prefix combination to the `druid.export.storage.s3.allowedExportPaths`. | n/a | +| `prefix` | Yes | Destination path in the bucket to create exported files. The export query expects the destination path to be empty. If the location includes other files, the query will fail. You must add the bucket and prefix combination to the `druid.export.storage.s3.allowedExportPaths`. | n/a | -The following runtime parameters must be configured to export into an S3 destination: +Configure following runtime parameters to export to an S3 destination: -| Runtime Parameter | Required | Description | Default | -|----------------------------------------------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----| -| `druid.export.storage.s3.tempLocalDir` | Yes | Directory used on the local storage of the worker to store temporary files required while uploading the data. | n/a | -| `druid.export.storage.s3.allowedExportPaths` | Yes | An array of S3 prefixes that are whitelisted as export destinations. Export queries fail if the export destination does not match any of the configured prefixes. Example: `[\"s3://bucket1/export/\", \"s3://bucket2/export/\"]` | n/a | -| `druid.export.storage.s3.maxRetry` | No | Defines the max number times to attempt S3 API calls to avoid failures due to transient errors. | 10 | -| `druid.export.storage.s3.chunkSize` | No | Defines the size of each chunk to temporarily store in `tempDir`. The chunk size must be between 5 MiB and 5 GiB. A large chunk size reduces the API calls to S3, however it requires more disk space to store the temporary chunks. | 100MiB | +| Runtime Parameter | Required | Description | Default | +|---|---|---|---| +| `druid.export.storage.s3.tempLocalDir` | Yes | Directory for local storage where the worker stores temporary files before uploading the data to S3. | n/a | +| `druid.export.storage.s3.allowedExportPaths` | Yes | Array of S3 prefixes allowed as export destinations. Export queries fail if the export destination does not match any of the configured prefixes. For eample: `[\"s3://bucket1/export/\", \"s3://bucket2/export/\"]` | n/a | +| `druid.export.storage.s3.maxRetry` | No | The maximum number of attempts for S3 API calls to avoid failures due to transient errors. | 10 | +| `druid.export.storage.s3.chunkSize` | No | Individual chunk size to store temporarily in `tempDir`. Large chunk sizes reduce the number of API calls to S3, but require more disk space to store temporary chunks. | 100MiB | ##### GOOGLE - Google Cloud Storage -To export query results to Google Cloud Storage (GCS), passing the `google()` function as an argument to the `EXTERN` function. +To export query results to Google Cloud Storage (GCS), pass the `google()` function as an argument to the `EXTERN` function. This requires the `druid-google-extensions`. The `google()` function configures the connection to Google Cloud Storage. Pass the arguments for `google()` as named parameters with the value in single quotes. For example: @@ -184,17 +184,17 @@ Supported arguments for the function: | Parameter | Required | Description | Default | |---|---|---|---| -| `bucket` | Yes | The GCS bucket destination for exported files. You must add the bucket and prefix combination to the `druid.export.storage.google.allowedExportPaths` allow list. | n/a | +| `bucket` | Yes | GCS bucket destination for exported files. You must add the bucket and prefix combination to the `druid.export.storage.google.allowedExportPaths` allow list. | n/a | | `prefix` | Yes | Destination path in the bucket to create exported files. The export query expects the destination path to be empty. If the location includes other files, the query will fail. You must add the bucket and prefix combination to the `druid.export.storage.google.allowedExportPaths` allow list. | n/a | Configure the following runtime parameters to export query results to a GCS destination: | Runtime Parameter | Required | Description | Default | |---|---|---|---| -| `druid.export.storage.google.tempLocalDir` | Yes | Directory for local storage where the worker stores temporary files before uploading the data to GCS. | n/a | -| `druid.export.storage.google.allowedExportPaths` | Yes | An array of GCS prefixes allowed as export destinations. Export queries fail if the export destination does not match any of the configured prefixes. For eample: `[\"gs://bucket1/export/\", \"gs://bucket2/export/\"]` | n/a | -| `druid.export.storage.google.maxRetry` | No | The maximum number of attempts for GCS API calls to avoid failures due to transient errors. | 10 | -| `druid.export.storage.google.chunkSize` | No | Individual chunk size to store temporarily in `tempDir`. Large chunk sizes reduce the number of API calls to GS, but require more disk space to store temporary chunks. | 4 MiB | +| `druid.export.storage.google.tempLocalDir` | Yes | Directory for local storage where the worker stores temporary files before uploading the data to GCS | n/a | +| `druid.export.storage.google.allowedExportPaths` | Yes | Array of GCS prefixes allowed as export destinations. Export queries fail if the export destination does not match any of the configured prefixes. For eample: `[\"gs://bucket1/export/\", \"gs://bucket2/export/\"]` | n/a | +| `druid.export.storage.google.maxRetry` | No | The maximum number of attempts for GCS API calls to avoid failures due to transient errors | 10 | +| `druid.export.storage.google.chunkSize` | No | Individual chunk size to store temporarily in `tempDir`. Large chunk sizes reduce the number of API calls to GS, but require more disk space to store temporary chunks | 4 MiB | ##### LOCAL - local file storage From 3732dd3c4fadb4cba4449183f424b3488b4e78c7 Mon Sep 17 00:00:00 2001 From: Charles Smith Date: Wed, 16 Oct 2024 16:17:50 -0700 Subject: [PATCH 07/19] update --- docs/tutorials/tutorial-extern.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/tutorials/tutorial-extern.md b/docs/tutorials/tutorial-extern.md index 1c1d5dcec095..6a36d295a9e7 100644 --- a/docs/tutorials/tutorial-extern.md +++ b/docs/tutorials/tutorial-extern.md @@ -113,7 +113,7 @@ FROM "ext" PARTITIONED BY DAY ``` -## Export data +### Query to export data Run the following query to export query results to the path: `/tmp/druid/wiki_example`. @@ -140,7 +140,7 @@ Run the following comannd to list the contents of ls '/tmp/druid/wiki_example' ``` -The results are a csv file export of the data and a directory +The results are a csv file export of the data and a directory ## Learn more From 35c33560fa94fbd4abe80cc1916b18c19ab438a9 Mon Sep 17 00:00:00 2001 From: Charles Smith Date: Thu, 21 Nov 2024 16:01:29 -0800 Subject: [PATCH 08/19] add cloud info to tutorial --- docs/multi-stage-query/reference.md | 12 ++++-- docs/multi-stage-query/security.md | 4 +- docs/tutorials/tutorial-extern.md | 64 +++++++++++++++++++++++++++-- 3 files changed, 72 insertions(+), 8 deletions(-) diff --git a/docs/multi-stage-query/reference.md b/docs/multi-stage-query/reference.md index bc810032f10a..3d7d79e3c147 100644 --- a/docs/multi-stage-query/reference.md +++ b/docs/multi-stage-query/reference.md @@ -131,7 +131,10 @@ FROM
##### S3 - Amazon S3 -To export results to S3, pass the `s3()` function as an argument to the EXTERN function. S3 export requires the `druid-s3-extensions`. +To export results to S3, pass the `s3()` function as an argument to the EXTERN function. +Export to S3 requires the `druid-s3-extensions`. +For a list of S3 permissions the MSQ task engine needs to perform export, see [Permissions for durable storage](./security.md#s3). + The `s3()` function configures the connection to AWS. Pass all arguments for `s3()` as named parameters with the value in single quote. For example: @@ -166,7 +169,8 @@ Configure following runtime parameters to export to an S3 destination: ##### GOOGLE - Google Cloud Storage To export query results to Google Cloud Storage (GCS), pass the `google()` function as an argument to the `EXTERN` function. -This requires the `druid-google-extensions`. +Export to GCS requires the `druid-google-extensions`. + The `google()` function configures the connection to Google Cloud Storage. Pass the arguments for `google()` as named parameters with the value in single quotes. For example: ```sql @@ -202,8 +206,8 @@ You can export queryies to local storage, which writes the results to the filesy This is useful in a single node setup or for testing but is not suitable for production use cases. To export results to local storage, passing the `LOCAL()` function as an argument to the EXTERN function. -You must configure the runtime property `druid.export.storage.baseDir` must be configured as an absolute path on the Indexer/Middle Manager to use local storage as an export destination. -You can exporting data to paths that match this value as a prefix. +You must configure the runtime property `druid.export.storage.baseDir` as an absolute path on the Indexer/Middle Manager to use local storage as an export destination. +You can export data to paths that match this value as a prefix. Pass all arguments to `LOCAL()` as named parameters with the value in single quotes. For example: ```sql diff --git a/docs/multi-stage-query/security.md b/docs/multi-stage-query/security.md index 2ec7956e7cd5..9cbc011141f3 100644 --- a/docs/multi-stage-query/security.md +++ b/docs/multi-stage-query/security.md @@ -80,4 +80,6 @@ The MSQ task engine needs the following permissions for pushing, fetching, and r - `Microsoft.Storage/storageAccounts/blobServices/containers/blobs/read` to read and list files in durable storage - `Microsoft.Storage/storageAccounts/blobServices/containers/blobs/write` to write files in durable storage. - `Microsoft.Storage/storageAccounts/blobServices/containers/blobs/add/action` to create files in durable storage. -- `Microsoft.Storage/storageAccounts/blobServices/containers/blobs/delete` to delete files when they're no longer needed. \ No newline at end of file +- `Microsoft.Storage/storageAccounts/blobServices/containers/blobs/delete` to delete files when they're no longer needed. + + \ No newline at end of file diff --git a/docs/tutorials/tutorial-extern.md b/docs/tutorials/tutorial-extern.md index 6a36d295a9e7..f595ff22fd2d 100644 --- a/docs/tutorials/tutorial-extern.md +++ b/docs/tutorials/tutorial-extern.md @@ -24,6 +24,9 @@ description: How to use EXTERN to export query results. ~ under the License. --> +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + This tutorial demonstrates how to use the [EXTERN](..multi-stage-query/reference#extern-function) function Apache Druid® to export data. ## Prerequisites @@ -140,11 +143,66 @@ Run the following comannd to list the contents of ls '/tmp/druid/wiki_example' ``` -The results are a csv file export of the data and a directory +The results are a csv file export of the data and a directory + +## Export query results to cloud storage + +The steps to export to cloud storage are similar to exporting to the local file system. +Druid supports Amazon S3 or Google GCS as cloud export storage destinations. + +1. Enable the extension for your cloud storage destination: + - **Amazon S3**: `druid-s3-extensions` + - **Google GCS**: `google-extensions` + See [Loading core extensions](../configuration/extensions.md#loading-core-extensions). +1. Configure the additional properties for your cloud storage destination. Replace {CLOUD} with `s3` or `google` accordingly: + - `druid.export.storage.{CLOUD}.tempLocalDir`: The local temp directory where the query engine stages files to export. + - `druid.export.storage.{CLOUD}.allowedExportPaths`: The s3 or GS prefixes allowed as Druid export locations. For example `[\"s3://bucket1/export/\",\"s3://bucket2/export/\"]` or `[\"gs://bucket1/export/\", \"gs://bucket2/export/\"]` + - `druid.export.storage.{CLOUD}.maxRetry`: The maximum number times to attempt cloud API calls to avoid failures from transient errors. + - `druid.export.storage.s3.chunkSize`: The maximum size of individual data chunks to store in the temp directory.' +1. Verify the instance role has the correct permissions to the bucket and folders: read, write, create, and delete. See [Permissions for durable storage](../multi-stage-query/security.md#permissions-for-durable-storage). +1. Use the query syntax for your cloud storage type. For example: + + + + + + ```sql + INSERT INTO + EXTERN( + s3(bucket => 'your_bucket', prefix => 'prefix/to/files')) + AS CSV + SELECT "channel", + SUM("delta") AS "changes" + FROM "wikipedia" + GROUP BY 1 + LIMIT 10 + ``` + + + + + + ```sql + INSERT INTO + EXTERN + google(bucket => 'your_bucket', prefix => 'prefix/to/files') + AS CSV + SELECT "channel", + SUM("delta") AS "changes" + FROM "wikipedia" + GROUP BY 1 + LIMIT 10 + ``` + + + + + +1. When querying, use the `rowsPerPage` query context parameter to restrict the output file size. It is possible to add very large LIMIT to the end of your query to force Druid to create one file, however this technique is not recommended. ## Learn more See the following topics for more information: -* [Update data](./tutorial-update-data.md) for a tutorial on updating data in Druid. -* [Data updates](../data-management/update.md) for an overview of updating data in Druid. +* [Export to a destination](../multi-stage-query/reference.md#extern-to-export-to-a-destination) for a reference of the EXTERN +* [SQL-based ingestion security](../multi-stage-query/security.md/#permissions-for-durable-storage) for cloud permission requirements for MSQ. From 9b6ea763688d9f30781aca742487dac3765f60fd Mon Sep 17 00:00:00 2001 From: Charles Smith Date: Thu, 21 Nov 2024 16:08:16 -0800 Subject: [PATCH 09/19] fix conflict --- docs/multi-stage-query/reference.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/multi-stage-query/reference.md b/docs/multi-stage-query/reference.md index 3d7d79e3c147..ef401621d016 100644 --- a/docs/multi-stage-query/reference.md +++ b/docs/multi-stage-query/reference.md @@ -160,8 +160,8 @@ Configure following runtime parameters to export to an S3 destination: | Runtime Parameter | Required | Description | Default | |---|---|---|---| -| `druid.export.storage.s3.tempLocalDir` | Yes | Directory for local storage where the worker stores temporary files before uploading the data to S3. | n/a | | `druid.export.storage.s3.allowedExportPaths` | Yes | Array of S3 prefixes allowed as export destinations. Export queries fail if the export destination does not match any of the configured prefixes. For eample: `[\"s3://bucket1/export/\", \"s3://bucket2/export/\"]` | n/a | +| `druid.export.storage.s3.tempLocalDir` | No | Directory for local storage where the worker stores temporary files before uploading the data to S3. | n/a | | `druid.export.storage.s3.maxRetry` | No | The maximum number of attempts for S3 API calls to avoid failures due to transient errors. | 10 | | `druid.export.storage.s3.chunkSize` | No | Individual chunk size to store temporarily in `tempDir`. Large chunk sizes reduce the number of API calls to S3, but require more disk space to store temporary chunks. | 100MiB | @@ -195,8 +195,8 @@ Configure the following runtime parameters to export query results to a GCS dest | Runtime Parameter | Required | Description | Default | |---|---|---|---| -| `druid.export.storage.google.tempLocalDir` | Yes | Directory for local storage where the worker stores temporary files before uploading the data to GCS | n/a | | `druid.export.storage.google.allowedExportPaths` | Yes | Array of GCS prefixes allowed as export destinations. Export queries fail if the export destination does not match any of the configured prefixes. For eample: `[\"gs://bucket1/export/\", \"gs://bucket2/export/\"]` | n/a | +| `druid.export.storage.google.tempLocalDir` | No | Directory for local storage where the worker stores temporary files before uploading the data to GCS | n/a | | `druid.export.storage.google.maxRetry` | No | The maximum number of attempts for GCS API calls to avoid failures due to transient errors | 10 | | `druid.export.storage.google.chunkSize` | No | Individual chunk size to store temporarily in `tempDir`. Large chunk sizes reduce the number of API calls to GS, but require more disk space to store temporary chunks | 4 MiB | @@ -227,7 +227,7 @@ Supported arguments to the function: |---|---|---|---|---| | `exportPath` | Yes | Absolute path to a subdirectory of `druid.export.storage.baseDir` where Druid exports the querey results. The destination must be empty. If the location includes other files or directories, the query will fail. | n/a | -For more information, see [Read external data with EXTERN](concepts.md#write-to-an-external-destination-with-extern). +For more information, see [Export external data with EXTERN](concepts.md#write-to-an-external-destination-with-extern). ### `INSERT` From 78ee8d5b2e9b3a18a745ad0bfa047f30f9dfd0a7 Mon Sep 17 00:00:00 2001 From: Charles Smith Date: Thu, 21 Nov 2024 16:20:14 -0800 Subject: [PATCH 10/19] fix link --- docs/tutorials/tutorial-extern.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/tutorials/tutorial-extern.md b/docs/tutorials/tutorial-extern.md index f595ff22fd2d..135539779d56 100644 --- a/docs/tutorials/tutorial-extern.md +++ b/docs/tutorials/tutorial-extern.md @@ -27,7 +27,7 @@ description: How to use EXTERN to export query results. import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; -This tutorial demonstrates how to use the [EXTERN](..multi-stage-query/reference#extern-function) function Apache Druid® to export data. +This tutorial demonstrates how to use the [EXTERN](../multi-stage-query/reference.md#extern-function) function Apache Druid® to export data. ## Prerequisites From 1a4368a97343eca77f807e42143e92cb2d85f515 Mon Sep 17 00:00:00 2001 From: Charles Smith Date: Tue, 26 Nov 2024 11:19:26 -0800 Subject: [PATCH 11/19] Update docs/multi-stage-query/reference.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> --- docs/multi-stage-query/reference.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/multi-stage-query/reference.md b/docs/multi-stage-query/reference.md index e32a3c6672a7..d62024fae254 100644 --- a/docs/multi-stage-query/reference.md +++ b/docs/multi-stage-query/reference.md @@ -95,7 +95,7 @@ For more information, see [Read external data with EXTERN](concepts.md#read-exte #### `EXTERN` to export to a destination You can use `EXTERN` to specify a destination to export data. -This variation of EXTERN: +This variation of `EXTERN` accepts the details of the destination as the only argument and requires an `AS` clause to specify the format of the exported rows. - accepts the details of the destination as the only argument - requires an `AS` clause to specify the format of the exported rows. From d5837ba1d02865697b15b3b7f1cb6ce3cfafa65b Mon Sep 17 00:00:00 2001 From: Charles Smith Date: Tue, 26 Nov 2024 11:20:50 -0800 Subject: [PATCH 12/19] Update docs/multi-stage-query/reference.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> --- docs/multi-stage-query/reference.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/multi-stage-query/reference.md b/docs/multi-stage-query/reference.md index d62024fae254..67de9a8348e5 100644 --- a/docs/multi-stage-query/reference.md +++ b/docs/multi-stage-query/reference.md @@ -100,7 +100,7 @@ This variation of `EXTERN` accepts the details of the destination as the only ar - requires an `AS` clause to specify the format of the exported rows. When you export data, Druid creates metadata files in a subdirectory of the destination directory named `_symlink_format_manifest`. -The `manifest` file within that directory`_symlink_format_manifest/manifest` lists absolute paths to exported files using the symlink manifest format. For example: +Within the `_symlink_format_manifest/manifest` directory, the `manifest` file lists absolute paths to exported files using the symlink manifest format. For example: ```text s3://export-bucket/export/query-6564a32f-2194-423a-912e-eead470a37c4-worker2-partition2.csv From 89fd50adf88226a763d89233cc108fb238c7259e Mon Sep 17 00:00:00 2001 From: Charles Smith Date: Tue, 26 Nov 2024 11:25:38 -0800 Subject: [PATCH 13/19] Apply suggestions from code review Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> --- docs/multi-stage-query/reference.md | 26 ++++++++++++-------------- 1 file changed, 12 insertions(+), 14 deletions(-) diff --git a/docs/multi-stage-query/reference.md b/docs/multi-stage-query/reference.md index 67de9a8348e5..9cbb1a3eea05 100644 --- a/docs/multi-stage-query/reference.md +++ b/docs/multi-stage-query/reference.md @@ -96,10 +96,8 @@ For more information, see [Read external data with EXTERN](concepts.md#read-exte You can use `EXTERN` to specify a destination to export data. This variation of `EXTERN` accepts the details of the destination as the only argument and requires an `AS` clause to specify the format of the exported rows. -- accepts the details of the destination as the only argument -- requires an `AS` clause to specify the format of the exported rows. -When you export data, Druid creates metadata files in a subdirectory of the destination directory named `_symlink_format_manifest`. +When you export data, Druid creates metadata files in a subdirectory named `_symlink_format_manifest`. Within the `_symlink_format_manifest/manifest` directory, the `manifest` file lists absolute paths to exported files using the symlink manifest format. For example: ```text @@ -131,12 +129,12 @@ FROM
##### S3 - Amazon S3 -To export results to S3, pass the `s3()` function as an argument to the EXTERN function. -Export to S3 requires the `druid-s3-extensions`. -For a list of S3 permissions the MSQ task engine needs to perform export, see [Permissions for durable storage](./security.md#s3). +To export results to S3, pass the `s3()` function as an argument to the `EXTERN` function. +Export to S3 requires the `druid-s3-extensions` extension +For a list of S3 permissions the MSQ task engine requires to perform export, see [Permissions for durable storage](./security.md#s3). The `s3()` function configures the connection to AWS. -Pass all arguments for `s3()` as named parameters with the value in single quote. For example: +Pass all arguments for `s3()` as named parameters with their values enclosed in single quotes. For example: ```sql INSERT INTO @@ -153,22 +151,22 @@ Supported arguments for the function: | Parameter | Required | Description | Default | |---|---|---|---| -| `bucket` | Yes | S3 bucket destination for exported files. You must add the bucket and prefix combination to the `druid.export.storage.s3.allowedExportPaths`. | n/a | -| `prefix` | Yes | Destination path in the bucket to create exported files. The export query expects the destination path to be empty. If the location includes other files, the query will fail. You must add the bucket and prefix combination to the `druid.export.storage.s3.allowedExportPaths`. | n/a | +| `bucket` | Yes | S3 bucket destination for exported files. You must add the bucket and prefix combination to the `druid.export.storage.s3.allowedExportPaths` allow list. | n/a | +| `prefix` | Yes | Destination path in the bucket to create exported files. The export query expects the destination path to be empty. If the location includes other files, the query will fail. You must add the bucket and prefix combination to the `druid.export.storage.s3.allowedExportPaths` allow list. | n/a | Configure following runtime parameters to export to an S3 destination: -| Runtime Parameter | Required | Description | Default | +| Runtime parameter | Required | Description | Default | |---|---|---|---| -| `druid.export.storage.s3.allowedExportPaths` | Yes | Array of S3 prefixes allowed as export destinations. Export queries fail if the export destination does not match any of the configured prefixes. For eample: `[\"s3://bucket1/export/\", \"s3://bucket2/export/\"]` | n/a | +| `druid.export.storage.s3.allowedExportPaths` | Yes | Array of S3 prefixes allowed as export destinations. Export queries fail if the export destination does not match any of the configured prefixes. For example: `[\"s3://bucket1/export/\", \"s3://bucket2/export/\"]` | n/a | | `druid.export.storage.s3.tempLocalDir` | No | Directory for local storage where the worker stores temporary files before uploading the data to S3. | n/a | -| `druid.export.storage.s3.maxRetry` | No | The maximum number of attempts for S3 API calls to avoid failures due to transient errors. | 10 | -| `druid.export.storage.s3.chunkSize` | No | Individual chunk size to store temporarily in `tempDir`. Large chunk sizes reduce the number of API calls to S3, but require more disk space to store temporary chunks. | 100MiB | +| `druid.export.storage.s3.maxRetry` | No | Maximum number of attempts for S3 API calls to avoid failures due to transient errors. | 10 | +| `druid.export.storage.s3.chunkSize` | No | Individual chunk size to store temporarily in `tempDir`. Large chunk sizes reduce the number of API calls to S3, but require more disk space to store temporary chunks. | 100MiB | ##### GOOGLE - Google Cloud Storage To export query results to Google Cloud Storage (GCS), pass the `google()` function as an argument to the `EXTERN` function. -Export to GCS requires the `druid-google-extensions`. +Export to GCS requires the `druid-google-extensions` extension. The `google()` function configures the connection to Google Cloud Storage. Pass the arguments for `google()` as named parameters with the value in single quotes. For example: From bb4c2f40845af3237586465aa458316ff8a546fe Mon Sep 17 00:00:00 2001 From: Charles Smith Date: Tue, 26 Nov 2024 11:29:00 -0800 Subject: [PATCH 14/19] Apply suggestions from code review Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> --- docs/multi-stage-query/reference.md | 24 ++++++++-------- docs/tutorials/tutorial-extern.md | 43 ++++++++++++++--------------- 2 files changed, 33 insertions(+), 34 deletions(-) diff --git a/docs/multi-stage-query/reference.md b/docs/multi-stage-query/reference.md index 9cbb1a3eea05..a907a39c6f04 100644 --- a/docs/multi-stage-query/reference.md +++ b/docs/multi-stage-query/reference.md @@ -190,23 +190,23 @@ Supported arguments for the function: Configure the following runtime parameters to export query results to a GCS destination: -| Runtime Parameter | Required | Description | Default | +| Runtime parameter | Required | Description | Default | |---|---|---|---| -| `druid.export.storage.google.allowedExportPaths` | Yes | Array of GCS prefixes allowed as export destinations. Export queries fail if the export destination does not match any of the configured prefixes. For eample: `[\"gs://bucket1/export/\", \"gs://bucket2/export/\"]` | n/a | -| `druid.export.storage.google.tempLocalDir` | No | Directory for local storage where the worker stores temporary files before uploading the data to GCS | n/a | -| `druid.export.storage.google.maxRetry` | No | The maximum number of attempts for GCS API calls to avoid failures due to transient errors | 10 | -| `druid.export.storage.google.chunkSize` | No | Individual chunk size to store temporarily in `tempDir`. Large chunk sizes reduce the number of API calls to GS, but require more disk space to store temporary chunks | 4 MiB | +| `druid.export.storage.google.allowedExportPaths` | Yes | Array of GCS prefixes allowed as export destinations. Export queries fail if the export destination does not match any of the configured prefixes. For example: `[\"gs://bucket1/export/\", \"gs://bucket2/export/\"]` | n/a | +| `druid.export.storage.google.tempLocalDir` | No | Directory for local storage where the worker stores temporary files before uploading the data to GCS. | n/a | +| `druid.export.storage.google.maxRetry` | No | Maximum number of attempts for GCS API calls to avoid failures due to transient errors. | 10 | +| `druid.export.storage.google.chunkSize` | No | Individual chunk size to store temporarily in `tempDir`. Large chunk sizes reduce the number of API calls to GS, but require more disk space to store temporary chunks. | 4 MiB | ##### LOCAL - local file storage -You can export queryies to local storage, which writes the results to the filesystem on the MSQ worker. +You can export queries to local storage. This process writes the results to the filesystem on the MSQ worker. This is useful in a single node setup or for testing but is not suitable for production use cases. -To export results to local storage, passing the `LOCAL()` function as an argument to the EXTERN function. -You must configure the runtime property `druid.export.storage.baseDir` as an absolute path on the Indexer/Middle Manager to use local storage as an export destination. +To export results to local storage, pass the `LOCAL()` function as an argument to the EXTERN function. +You must configure the runtime property `druid.export.storage.baseDir` as an absolute path on the Indexer or Middle Manager to use local storage as an export destination. You can export data to paths that match this value as a prefix. -Pass all arguments to `LOCAL()` as named parameters with the value in single quotes. For example: +Pass all arguments to `LOCAL()` as named parameters with values enclosed in single quotes. For example: ```sql INSERT INTO @@ -219,11 +219,11 @@ SELECT FROM
``` -Supported arguments to the function: +Supported arguments for the function: | Parameter | Required | Description | Default | -|---|---|---|---|---| -| `exportPath` | Yes | Absolute path to a subdirectory of `druid.export.storage.baseDir` where Druid exports the querey results. The destination must be empty. If the location includes other files or directories, the query will fail. | n/a | +|---|---|---|---| +| `exportPath` | Yes | Absolute path to a subdirectory of `druid.export.storage.baseDir` where Druid exports the query results. The destination must be empty. If the location includes other files or directories, the query will fail. | n/a | For more information, see [Export external data with EXTERN](concepts.md#write-to-an-external-destination-with-extern). diff --git a/docs/tutorials/tutorial-extern.md b/docs/tutorials/tutorial-extern.md index 135539779d56..224662fe1682 100644 --- a/docs/tutorials/tutorial-extern.md +++ b/docs/tutorials/tutorial-extern.md @@ -27,26 +27,25 @@ description: How to use EXTERN to export query results. import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; -This tutorial demonstrates how to use the [EXTERN](../multi-stage-query/reference.md#extern-function) function Apache Druid® to export data. +This tutorial demonstrates how to use the Apache Druid® SQL [EXTERN](../multi-stage-query/reference.md#extern-function) function to export data. ## Prerequisites Before you follow the steps in this tutorial, download Druid as described in the [Local quickstart](index.md). -Do not start Druid, you'll do that as part of the tutorial. +Don't start Druid, you'll do that as part of the tutorial. You should be familiar with ingesting and querying data in Druid. If you haven't already, go through the [Query data](../tutorials/tutorial-query.md) tutorial first. ## Export query results to the local file system -This example demonstrates how to configure Druid to export to the local file system. -It is OK to learn about EXTERN syntax for exporting data. -It is not suitable for production scenarios. +This example demonstrates how to configure Druid to export data to the local file system. +While you can use this approach to learn about EXTERN syntax for exporting data, it's not suitable for production scenarios. ### Configure Druid local export directory The following commands set the base path for the Druid exports to `/tmp/druid/`. -If the account running Druid does not have access to `/tmp/druid/`, change the path. +If the account running Druid doesn't have access to `/tmp/druid/`, change the path. For example: `/Users/Example/druid`. If you change the path in this step, use the updated path in all subsequent steps. @@ -57,7 +56,7 @@ export export_path="/tmp/druid" sed -i -e $'$a\\\n\\\n\\\n#\\\n###Local export\\\n#\\\ndruid.export.storage.baseDir='$export_path conf/druid/auto/_common/common.runtime.properties ``` -This adds the following section to the Druid quicstart `common.runtime.properties`: +This adds the following section to the Druid `common.runtime.properties` configuration file located in `conf/druid/auto/_common`: ``` # @@ -118,7 +117,7 @@ PARTITIONED BY DAY ### Query to export data -Run the following query to export query results to the path: +Open a new tab and run the following query to export query results to the path: `/tmp/druid/wiki_example`. The path must be a subdirectory of the `druid.export.storage.baseDir`. @@ -136,29 +135,29 @@ GROUP BY 1 LIMIT 10 ``` -Druid exports the results of the qurey to the `/tmp/druid/wiki_example` dirctory. -Run the following comannd to list the contents of +Druid exports the results of the query to the `/tmp/druid/wiki_example` directory. +Run the following command to list the contents of the directory. ```bash -ls '/tmp/druid/wiki_example' +ls /tmp/druid/wiki_example ``` -The results are a csv file export of the data and a directory +The results are a CSV file export of the data and a directory. ## Export query results to cloud storage The steps to export to cloud storage are similar to exporting to the local file system. -Druid supports Amazon S3 or Google GCS as cloud export storage destinations. +Druid supports Amazon S3 or Google Cloud Storage (GCS) as cloud storage destinations. 1. Enable the extension for your cloud storage destination: - **Amazon S3**: `druid-s3-extensions` - - **Google GCS**: `google-extensions` - See [Loading core extensions](../configuration/extensions.md#loading-core-extensions). -1. Configure the additional properties for your cloud storage destination. Replace {CLOUD} with `s3` or `google` accordingly: - - `druid.export.storage.{CLOUD}.tempLocalDir`: The local temp directory where the query engine stages files to export. - - `druid.export.storage.{CLOUD}.allowedExportPaths`: The s3 or GS prefixes allowed as Druid export locations. For example `[\"s3://bucket1/export/\",\"s3://bucket2/export/\"]` or `[\"gs://bucket1/export/\", \"gs://bucket2/export/\"]` - - `druid.export.storage.{CLOUD}.maxRetry`: The maximum number times to attempt cloud API calls to avoid failures from transient errors. - - `druid.export.storage.s3.chunkSize`: The maximum size of individual data chunks to store in the temp directory.' + - **GCS**: `google-extensions` + See [Loading core extensions](../configuration/extensions.md#loading-core-extensions) for more information. +1. Configure the additional properties for your cloud storage destination. Replace `{CLOUD}` with `s3` or `google` accordingly: + - `druid.export.storage.{CLOUD}.tempLocalDir`: Local temporary directory where the query engine stages files to export. + - `druid.export.storage.{CLOUD}.allowedExportPaths`: S3 or GS prefixes allowed as Druid export locations. For example `[\"s3://bucket1/export/\",\"s3://bucket2/export/\"]` or `[\"gs://bucket1/export/\", \"gs://bucket2/export/\"]`. + - `druid.export.storage.{CLOUD}.maxRetry`: Maximum number of times to attempt cloud API calls to avoid failures from transient errors. + - `druid.export.storage.s3.chunkSize`: Maximum size of individual data chunks to store in the temporary directory. 1. Verify the instance role has the correct permissions to the bucket and folders: read, write, create, and delete. See [Permissions for durable storage](../multi-stage-query/security.md#permissions-for-durable-storage). 1. Use the query syntax for your cloud storage type. For example: @@ -198,11 +197,11 @@ Druid supports Amazon S3 or Google GCS as cloud export storage destinations. -1. When querying, use the `rowsPerPage` query context parameter to restrict the output file size. It is possible to add very large LIMIT to the end of your query to force Druid to create one file, however this technique is not recommended. +1. When querying, use the `rowsPerPage` query context parameter to restrict the output file size. While it's possible to add a very large LIMIT at the end of your query to force Druid to create a single file, we don't recommend this technique. ## Learn more See the following topics for more information: -* [Export to a destination](../multi-stage-query/reference.md#extern-to-export-to-a-destination) for a reference of the EXTERN +* [Export to a destination](../multi-stage-query/reference.md#extern-to-export-to-a-destination) for a reference of the EXTERN. * [SQL-based ingestion security](../multi-stage-query/security.md/#permissions-for-durable-storage) for cloud permission requirements for MSQ. From 7a77a46a81ef7eedd105190290158edc525fa3fd Mon Sep 17 00:00:00 2001 From: Charles Smith Date: Tue, 26 Nov 2024 11:40:03 -0800 Subject: [PATCH 15/19] fixes --- docs/tutorials/tutorial-extern.md | 108 +++++++++++++++--------------- 1 file changed, 55 insertions(+), 53 deletions(-) diff --git a/docs/tutorials/tutorial-extern.md b/docs/tutorials/tutorial-extern.md index 135539779d56..ec61513bd8d9 100644 --- a/docs/tutorials/tutorial-extern.md +++ b/docs/tutorials/tutorial-extern.md @@ -54,7 +54,7 @@ From the root of the Druid distribution, run the following: ```bash export export_path="/tmp/druid" -sed -i -e $'$a\\\n\\\n\\\n#\\\n###Local export\\\n#\\\ndruid.export.storage.baseDir='$export_path conf/druid/auto/_common/common.runtime.properties +sed -i -e $'$a\\\n\\\n\\\n#\\\n###Local export\\\n#\\\ndruid.export.storage.baseDir='$export_path' conf/druid/auto/_common/common.runtime.properties ``` This adds the following section to the Druid quicstart `common.runtime.properties`: @@ -68,53 +68,52 @@ druid.export.storage.baseDir=/tmp/druid/ ### Start Druid and load sample data -From the root of the Druid distribution, launch Druid as follows: - -```bash -./bin/start-druid -``` - -From the [Query view](http://localhost:8888/unified-console.html#workbench), run the following command to load the Wikipedia example data set: - -```sql -REPLACE INTO "wikipedia" OVERWRITE ALL -WITH "ext" AS ( - SELECT * - FROM TABLE( - EXTERN( - '{"type":"http","uris":["https://druid.apache.org/data/wikipedia.json.gz"]}', - '{"type":"json"}' - ) - ) EXTEND ("isRobot" VARCHAR, "channel" VARCHAR, "timestamp" VARCHAR, "flags" VARCHAR, "isUnpatrolled" VARCHAR, "page" VARCHAR, "diffUrl" VARCHAR, "added" BIGINT, "comment" VARCHAR, "commentLength" BIGINT, "isNew" VARCHAR, "isMinor" VARCHAR, "delta" BIGINT, "isAnonymous" VARCHAR, "user" VARCHAR, "deltaBucket" BIGINT, "deleted" BIGINT, "namespace" VARCHAR, "cityName" VARCHAR, "countryName" VARCHAR, "regionIsoCode" VARCHAR, "metroCode" BIGINT, "countryIsoCode" VARCHAR, "regionName" VARCHAR) -) -SELECT - TIME_PARSE("timestamp") AS "__time", - "isRobot", - "channel", - "flags", - "isUnpatrolled", - "page", - "diffUrl", - "added", - "comment", - "commentLength", - "isNew", - "isMinor", - "delta", - "isAnonymous", - "user", - "deltaBucket", - "deleted", - "namespace", - "cityName", - "countryName", - "regionIsoCode", - "metroCode", - "countryIsoCode", - "regionName" -FROM "ext" -PARTITIONED BY DAY -``` +1. From the root of the Druid distribution, launch Druid as follows: + + ```bash + ./bin/start-druid + ``` +1. After Druid starts, open http://localhost:8888/ in your browser to access the Web Console. +1. From the [Query view](http://localhost:8888/unified-console.html#workbench), run the following command to load the Wikipedia example data set: + ```sql + REPLACE INTO "wikipedia" OVERWRITE ALL + WITH "ext" AS ( + SELECT * + FROM TABLE( + EXTERN( + '{"type":"http","uris":["https://druid.apache.org/data/wikipedia.json.gz"]}', + '{"type":"json"}' + ) + ) EXTEND ("isRobot" VARCHAR, "channel" VARCHAR, "timestamp" VARCHAR, "flags" VARCHAR, "isUnpatrolled" VARCHAR, "page" VARCHAR, "diffUrl" VARCHAR, "added" BIGINT, "comment" VARCHAR, "commentLength" BIGINT, "isNew" VARCHAR, "isMinor" VARCHAR, "delta" BIGINT, "isAnonymous" VARCHAR, "user" VARCHAR, "deltaBucket" BIGINT, "deleted" BIGINT, "namespace" VARCHAR, "cityName" VARCHAR, "countryName" VARCHAR, "regionIsoCode" VARCHAR, "metroCode" BIGINT, "countryIsoCode" VARCHAR, "regionName" VARCHAR) + ) + SELECT + TIME_PARSE("timestamp") AS "__time", + "isRobot", + "channel", + "flags", + "isUnpatrolled", + "page", + "diffUrl", + "added", + "comment", + "commentLength", + "isNew", + "isMinor", + "delta", + "isAnonymous", + "user", + "deltaBucket", + "deleted", + "namespace", + "cityName", + "countryName", + "regionIsoCode", + "metroCode", + "countryIsoCode", + "regionName" + FROM "ext" + PARTITIONED BY DAY + ``` ### Query to export data @@ -136,7 +135,7 @@ GROUP BY 1 LIMIT 10 ``` -Druid exports the results of the qurey to the `/tmp/druid/wiki_example` dirctory. +Druid exports the results of the query to the `/tmp/druid/wiki_example` dirctory. Run the following comannd to list the contents of ```bash @@ -150,15 +149,18 @@ The results are a csv file export of the data and a directory The steps to export to cloud storage are similar to exporting to the local file system. Druid supports Amazon S3 or Google GCS as cloud export storage destinations. -1. Enable the extension for your cloud storage destination: +1. Enable the extension for your cloud storage destination. See [Loading core extensions](../configuration/extensions.md#loading-core-extensions). - **Amazon S3**: `druid-s3-extensions` - **Google GCS**: `google-extensions` - See [Loading core extensions](../configuration/extensions.md#loading-core-extensions). 1. Configure the additional properties for your cloud storage destination. Replace {CLOUD} with `s3` or `google` accordingly: - `druid.export.storage.{CLOUD}.tempLocalDir`: The local temp directory where the query engine stages files to export. - - `druid.export.storage.{CLOUD}.allowedExportPaths`: The s3 or GS prefixes allowed as Druid export locations. For example `[\"s3://bucket1/export/\",\"s3://bucket2/export/\"]` or `[\"gs://bucket1/export/\", \"gs://bucket2/export/\"]` + - `druid.export.storage.{CLOUD}.allowedExportPaths`: The s3 or GS prefixes allowed as Druid export locations. For example: + + **S3**: `[\"s3://bucket1/export/\",\"s3://bucket2/export/\"]` + + **GCS**: `[\"gs://bucket1/export/\", \"gs://bucket2/export/\"]` - `druid.export.storage.{CLOUD}.maxRetry`: The maximum number times to attempt cloud API calls to avoid failures from transient errors. - - `druid.export.storage.s3.chunkSize`: The maximum size of individual data chunks to store in the temp directory.' + - `druid.export.storage.s3.chunkSize`: The maximum size of individual data chunks to store in the temp directory. 1. Verify the instance role has the correct permissions to the bucket and folders: read, write, create, and delete. See [Permissions for durable storage](../multi-stage-query/security.md#permissions-for-durable-storage). 1. Use the query syntax for your cloud storage type. For example: From fdace3fc81fb2b1d46bd4e43f566aeedf4820dae Mon Sep 17 00:00:00 2001 From: Charles Smith Date: Tue, 26 Nov 2024 11:41:53 -0800 Subject: [PATCH 16/19] make hyperlink to console --- docs/tutorials/tutorial-extern.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/tutorials/tutorial-extern.md b/docs/tutorials/tutorial-extern.md index ec61513bd8d9..0648ed1ee89b 100644 --- a/docs/tutorials/tutorial-extern.md +++ b/docs/tutorials/tutorial-extern.md @@ -73,7 +73,7 @@ druid.export.storage.baseDir=/tmp/druid/ ```bash ./bin/start-druid ``` -1. After Druid starts, open http://localhost:8888/ in your browser to access the Web Console. +1. After Druid starts, open [http://localhost:8888/](http://localhost:8888/) in your browser to access the Web Console. 1. From the [Query view](http://localhost:8888/unified-console.html#workbench), run the following command to load the Wikipedia example data set: ```sql REPLACE INTO "wikipedia" OVERWRITE ALL From 634bf1f84dc0735eaba461649b0916beee016b58 Mon Sep 17 00:00:00 2001 From: Charles Smith Date: Tue, 26 Nov 2024 14:37:00 -0800 Subject: [PATCH 17/19] Update docs/multi-stage-query/reference.md Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> --- docs/multi-stage-query/reference.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/multi-stage-query/reference.md b/docs/multi-stage-query/reference.md index a907a39c6f04..06666f5cd1bc 100644 --- a/docs/multi-stage-query/reference.md +++ b/docs/multi-stage-query/reference.md @@ -130,7 +130,7 @@ FROM
##### S3 - Amazon S3 To export results to S3, pass the `s3()` function as an argument to the `EXTERN` function. -Export to S3 requires the `druid-s3-extensions` extension +Export to S3 requires the `druid-s3-extensions` extension. For a list of S3 permissions the MSQ task engine requires to perform export, see [Permissions for durable storage](./security.md#s3). The `s3()` function configures the connection to AWS. From 41cb3c9c5dab630d711db42267032387b3060a32 Mon Sep 17 00:00:00 2001 From: Charles Smith Date: Tue, 26 Nov 2024 16:34:15 -0800 Subject: [PATCH 18/19] Apply suggestions from code review Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> --- docs/multi-stage-query/reference.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/multi-stage-query/reference.md b/docs/multi-stage-query/reference.md index 06666f5cd1bc..d34ba1bdd4e3 100644 --- a/docs/multi-stage-query/reference.md +++ b/docs/multi-stage-query/reference.md @@ -154,7 +154,7 @@ Supported arguments for the function: | `bucket` | Yes | S3 bucket destination for exported files. You must add the bucket and prefix combination to the `druid.export.storage.s3.allowedExportPaths` allow list. | n/a | | `prefix` | Yes | Destination path in the bucket to create exported files. The export query expects the destination path to be empty. If the location includes other files, the query will fail. You must add the bucket and prefix combination to the `druid.export.storage.s3.allowedExportPaths` allow list. | n/a | -Configure following runtime parameters to export to an S3 destination: +Configure the following runtime parameters to export to an S3 destination: | Runtime parameter | Required | Description | Default | |---|---|---|---| @@ -168,7 +168,7 @@ Configure following runtime parameters to export to an S3 destination: To export query results to Google Cloud Storage (GCS), pass the `google()` function as an argument to the `EXTERN` function. Export to GCS requires the `druid-google-extensions` extension. -The `google()` function configures the connection to Google Cloud Storage. Pass the arguments for `google()` as named parameters with the value in single quotes. For example: +The `google()` function configures the connection to GCS. Pass the arguments for `google()` as named parameters with their values enclosed in single quotes. For example: ```sql INSERT INTO From 1d29aff3fc80fb2fa372da3fa1f5ae6a02f1d8e4 Mon Sep 17 00:00:00 2001 From: Charles Smith Date: Tue, 10 Dec 2024 16:42:57 -0800 Subject: [PATCH 19/19] fix typos --- docs/tutorials/tutorial-extern.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/tutorials/tutorial-extern.md b/docs/tutorials/tutorial-extern.md index b4d7ab1fcc39..d44dd19a1542 100644 --- a/docs/tutorials/tutorial-extern.md +++ b/docs/tutorials/tutorial-extern.md @@ -134,8 +134,8 @@ GROUP BY 1 LIMIT 10 ``` -Druid exports the results of the query to the `/tmp/druid/wiki_example` dirctory. -Run the following comannd to list the contents of +Druid exports the results of the query to the `/tmp/druid/wiki_example` directory. +Run the following command to list the contents of ```bash ls /tmp/druid/wiki_example