🎉 Destination S3: parquet output #3908

tuliren · 2021-06-05T06:03:36Z

What

Address Destination S3: support writing Parquet data format #3642.
Refactor S3 destination so that it is easier to add new formats.
- Rename S3OutputFormatter to S3Writer, and add a writer package.
- Extract shared writer logic to an abstract class BaseS3Writer.
- Move CSV logics to its own package.
- Move CSV specific constants to its own constant file.
- Add a util package.
- Extract shared acceptance code to an abstract class S3DestinationAcceptanceTest.
Support parquet file output:
- Use AvroParquetWriter to output Parquet files on S3.
  - The dependency on hadoop-aws ensures that data is uploaded to S3 while it is generated on the fly.
- Create JsonSchemaConverter to convert JsonSchema to Avro schema.
- Use json2avro.converter to convert Json object to Avro record based on the schema.

Pre-merge Checklist

Expand the checklist which is relevant for this PR.

Connector checklist

tuliren · 2021-06-10T17:21:52Z

...rc/main/java/io/airbyte/integrations/standardtest/destination/DestinationAcceptanceTest.java

-                    .put("HKD", 10)
-                    .put("NZD", 700)
+                    .put("HKD", 10.0)
+                    .put("NZD", 700.0)


HKD and NZD are typed as number in the catalog. All other entries have decimals for these two fields. So I'd like to change these values to decimals as well so that the type is consistent.

In Parquet and probably other formats, we need to have strict type mappings, and number is mapped to double. If these two fields have flipping types, I need to do arbitrary conversions to pass the acceptance test, which seems unnecessary.

tuliren · 2021-06-10T17:23:42Z

...-integration/java/io/airbyte/integrations/destination/s3/S3CsvDestinationAcceptanceTest.java

-    } catch (Exception e) {
-      return Optional.empty();
-    }
-  }


This is the arbitrary type conversion I was talking about in the comment about changing integer HKD field to decimals. When the testing data has consistent typing, such conversion can be removed.

michel-tricot · 2021-06-10T17:45:56Z

...ors/destination-s3/src/main/java/io/airbyte/integrations/destination/s3/csv/S3CsvWriter.java

-        .queueCapacity(S3DestinationConstants.DEFAULT_QUEUE_CAPACITY)
-        .numUploadThreads(S3DestinationConstants.DEFAULT_UPLOAD_THREADS)
-        .partSize(S3DestinationConstants.DEFAULT_PART_SIZE_MD);
+        .numStreams(S3CsvConstants.DEFAULT_NUM_STREAMS)


WHy don't we use the same hadoop s3 uploader here?

Which uploader do you mean by "hadoop s3 uploader"?

If I am reading the PR correctly, it seems we are using two different ways to push data to s3.

ParquetWriter

StreamTransferManager

I am just curious if it is possible to use a similar one from the Hadoop package to push the CSV one.

Got it. The two writers output data in different data structures, and we do need them for different formats. The Parquet writer organizes data in Parquet row groups, while the stream transfer manager writes data line by line.

davinchia · 2021-06-14T01:58:16Z

airbyte-integrations/connectors/destination-s3/build.gradle

@@ -15,10 +15,18 @@ dependencies {
    implementation project(':airbyte-integrations:connectors:destination-jdbc')
    implementation files(project(':airbyte-integrations:bases:base-java').airbyteDocker.outputs)

+    // csv


I appreciate the comments here to make clear what dependencies are for what!

davinchia

Nice work! Appreciate the comments + the extensive tests.

My comments are:

Minor readability changes.
Some better commenting to help future OSS contributors. I know there are some that want to contribute other formats.
Possibility of using an OSS tool to do the Json -> Avro conversion. Not a blocker. Thought it would be nice to not write our own tool.
Possibility of sharing the CsvWriter stream transfer manager construction with the CopyConsumer.
Possibility of using the PrimitiveJsonSchema class as a enum instead of having a separate listing in the S3 directory.

The last 2 points are more me thinking out loud. I'm entirely sure they are good ideas. These changes can be done in follow up PRs since this one is getting big as is.

...ors/destination-s3/src/main/java/io/airbyte/integrations/destination/s3/csv/S3CsvWriter.java

davinchia · 2021-06-14T04:20:11Z

...ors/destination-s3/src/main/java/io/airbyte/integrations/destination/s3/csv/S3CsvWriter.java

+    if (hasFailed) {
+      LOGGER.warn("Failure detected. Aborting upload of stream '{}'...", stream.getName());
+      csvPrinter.close();
+      outputStream.close();
+      uploadManager.abort();
+      LOGGER.warn("Upload of stream '{}' aborted.", stream.getName());
+    } else {
+      LOGGER.info("Uploading remaining data for stream '{}'.", stream.getName());
+      csvPrinter.close();
+      outputStream.close();
+      uploadManager.complete();
+      LOGGER.info("Upload completed for stream '{}'.", stream.getName());
+    }
+  }


Suggested change

if (hasFailed) {

LOGGER.warn("Failure detected. Aborting upload of stream '{}'...", stream.getName());

csvPrinter.close();

outputStream.close();

uploadManager.abort();

LOGGER.warn("Upload of stream '{}' aborted.", stream.getName());

} else {

LOGGER.info("Uploading remaining data for stream '{}'.", stream.getName());

csvPrinter.close();

outputStream.close();

uploadManager.complete();

LOGGER.info("Upload completed for stream '{}'.", stream.getName());

}

}

csvPrinter.close();

outputStream.close();

if (hasFailed) {

LOGGER.warn("Failure detected. Aborting upload of stream '{}'...", stream.getName());

uploadManager.abort();

LOGGER.warn("Upload of stream '{}' aborted.", stream.getName());

return;

}

LOGGER.info("Uploading remaining data for stream '{}'.", stream.getName());

uploadManager.complete();

LOGGER.info("Upload completed for stream '{}'.", stream.getName());

}

just slightly easier to read.

I will move the two close statements before the if check.

Usually I am a fan of early returning. However, given the shortness of the if blocks, it is already pretty readable, and early returning will slightly make it more confusing I think.

davinchia · 2021-06-14T04:22:02Z

...on-s3/src/main/java/io/airbyte/integrations/destination/s3/parquet/JsonFieldNameUpdater.java

+import java.util.Map;
+
+/**
+ * This helper class tracks whether a Json has special field name that needs to be replaced with a


nit: can we also add why this is required. as is I'm not sure whether this is because parquet expects it this way or we want to standardise thing to make things simpler (it looks like the latter)

It is the former. Parquet only allows these characters for record name: /a-zA-Z0-9_/. Otherwise I won't go through all the trouble to do this. The necessity of this tracker actually complicates things a lot.

Will update the comment to reflect that.

Parquet only allows these characters for record name: /a-zA-Z0-9_/.

We currently have to deal with some naming conventions because some destinations allow different subsets of characters for identifiers names. These are dealt in classes deriving from airbyte-integrations/bases/base-java/src/main/java/io/airbyte/integrations/destination/NamingConventionTransformer.java

It seems that for S3 parquet, a difference would be that it needs to apply those conventions to field names too. Other destinations only cares about conventions for stream names and namespace (table & schema).

But would it make sense to regroup them in the same kind of class/hierarchy too?

But would it make sense to regroup them in the same kind of class/hierarchy too?

@ChristopheDuong, can you elaborate on this? What do you mean by "regroup them in the same kind of class"?

@ChristopheDuong, can you elaborate on this? What do you mean by "regroup them in the same kind of class"?

Should we move some logic around the string transformations of this code to a class named S3NameTransformer that extends NamingConventionTransformer like we do with SnowflakeSQLNameTransformer or RedshiftSQLNameTransformer etc ?

I see.

The name conversion logic for Parquet is exactly the same as the one in StandardNameTransformer. So there is nothing to override. It seems unnecessary to create a new class.

davinchia · 2021-06-14T04:32:12Z

...ination-s3/src/main/java/io/airbyte/integrations/destination/s3/parquet/S3ParquetWriter.java

+
+    S3ParquetFormatConfig formatConfig = (S3ParquetFormatConfig) config.getFormatConfig();
+    Configuration hadoopConfig = getHadoopConfig(config);
+    this.parquetWriter = AvroParquetWriter.<GenericData.Record>builder(HadoopOutputFile.fromPath(path, hadoopConfig))


I am sad we have to rely on hadoop libraries to do this: https://issues.apache.org/jira/browse/PARQUET-1822

Gah.

...tination-s3/src/main/java/io/airbyte/integrations/destination/s3/parquet/JsonSchemaType.java

.../src/main/java/io/airbyte/integrations/destination/s3/parquet/JsonToAvroSchemaConverter.java

...est-integration/java/io/airbyte/integrations/destination/s3/S3DestinationAcceptanceTest.java

airbyte-integrations/connectors/destination-s3/README.md

davinchia · 2021-06-14T04:51:56Z

...ination-s3/src/main/java/io/airbyte/integrations/destination/s3/parquet/S3ParquetWriter.java

+
+  @Override
+  public void close(boolean hasFailed) throws IOException {
+    if (hasFailed) {


same thought as the CsvWriter close method. I prefer return early instead of an else block.

tuliren · 2021-06-14T08:19:51Z

@davinchia, thanks for the code review. I know it's a long PR.

subodh1810

This looks great. I just have 1 comment.
I would have loved if somehow we could have reused the hadoop library. Is there not a csv version of HadoopOutputFile so that we could have reused a lot of code?

subodh1810 · 2021-06-14T09:04:36Z

.github/workflows/publish-command.yml

@@ -109,7 +109,8 @@ jobs:
          ZENDESK_TALK_TEST_CREDS: ${{ secrets.ZENDESK_TALK_TEST_CREDS }}
          ZOOM_INTEGRATION_TEST_CREDS: ${{ secrets.ZOOM_INTEGRATION_TEST_CREDS }}
          PLAID_INTEGRATION_TEST_CREDS: ${{ secrets.PLAID_INTEGRATION_TEST_CREDS }}
-          DESTINATION_S3_INTEGRATION_TEST_CREDS: ${{ secrets.DESTINATION_S3_INTEGRATION_TEST_CREDS }}
+          DESTINATION_S3_CSV_INTEGRATION_TEST_CREDS: ${{ secrets.DESTINATION_S3_CSV_INTEGRATION_TEST_CREDS }}


I think we can have only 1 set of credentials and change the cofig method to only read the relevant information from the credentials.
For instance this part can be hardcoded in the test class in the config i.e. for CSV test it can be

"format": { "format_type": "CSV", "flattening": "Root level flattening" }

and for Parquet test it can be

"format": { "format_type": "Parquet", "compression_codec": "GZIP" }

and the sensitive information can be populated from the credentials.

tuliren · 2021-06-14T22:57:28Z

/test connector=connectors/destination-s3

🕑 connectors/destination-s3 https://github.com/airbytehq/airbyte/actions/runs/937430136
✅ connectors/destination-s3 https://github.com/airbytehq/airbyte/actions/runs/937430136

tuliren · 2021-06-14T23:04:06Z

/publish connector=connectors/destination-s3

🕑 connectors/destination-s3 https://github.com/airbytehq/airbyte/actions/runs/937443364
❌ connectors/destination-s3 https://github.com/airbytehq/airbyte/actions/runs/937443364

tuliren · 2021-06-14T23:28:15Z

/publish connector=connectors/destination-s3

🕑 connectors/destination-s3 https://github.com/airbytehq/airbyte/actions/runs/937491071
✅ connectors/destination-s3 https://github.com/airbytehq/airbyte/actions/runs/937491071

davinchia

Great stuff @tuliren!

Nothing else from me; feel free to merge whenever!

tuliren · 2021-06-14T23:41:25Z

/publish connector=connectors/destination-s3

🕑 connectors/destination-s3 https://github.com/airbytehq/airbyte/actions/runs/937515876
✅ connectors/destination-s3 https://github.com/airbytehq/airbyte/actions/runs/937515876

ChristopheDuong · 2021-06-15T08:32:43Z

.../src/main/java/io/airbyte/integrations/destination/s3/parquet/JsonToAvroSchemaConverter.java

+
+  static List<JsonSchemaType> getTypes(String fieldName, JsonNode typeProperty) {
+    if (typeProperty == null) {
+      throw new IllegalStateException(String.format("Field %s has no type", fieldName));


BTW some catalogs are producing fields without types, so it's not so uncommon...

For example, the source-facebook does this I think... Does this exception cancel the sync of such catalogs? Should the fields be ignored or defaulted to a string for example instead?

Got it. This is good to know.

In general, schemaless source is not suitable for Parquet. There is another big problem regarding how we are going to handle additionalProperties whose types are known.

I will submit a follow-up PR to take care of it.

Created an issue: #4124

marcosmarxm mentioned this pull request Jun 8, 2021

New Source: AWS S3 #3965

Closed

tuliren force-pushed the liren/s3-destination-parquet branch from bd97fe8 to 176db13 Compare June 10, 2021 17:17

tuliren commented Jun 10, 2021

View reviewed changes

michel-tricot reviewed Jun 10, 2021

View reviewed changes

github-actions bot added the area/documentation Improvements or additions to documentation label Jun 10, 2021

tuliren added 12 commits June 10, 2021 21:58

Add skeleton code for parquet writer

da09da0

Refactor s3 destination code

7bda80f

Add parquet to spec

749f262

Complete parquet writer

44315fb

Change testing data from int to double

7a9a358

Add acceptance test for parquet writer

d5ef777

Handle special schema field names

8f14630

Format code

0b0b986

Add parquet config

e03eca7

Add documentation

e2e00d2

Add unit tests

277f2d8

Fix typo

b369dd7

tuliren force-pushed the liren/s3-destination-parquet branch from a9790dd to b369dd7 Compare June 11, 2021 05:01

tuliren added 4 commits June 10, 2021 22:09

Update document

11dbba2

Bump version

853c825

Fix date format

f3c9352

Fix credential filename

df70772

tuliren changed the title ~~🎉 S3 destination parquet format~~ 🎉 Destination S3: parquet output Jun 11, 2021

tuliren marked this pull request as ready for review June 11, 2021 05:49

Update doc

5ceea73

tuliren requested review from sherifnada, davinchia and subodh1810 June 11, 2021 05:57

Update test and publish commands

f0bea91

github-actions bot added area/documentation Improvements or additions to documentation and removed area/connectors Connector related issues labels Jun 11, 2021

tuliren added 3 commits June 11, 2021 14:54

Drop union type length restriction

014fee8

Support array with multiple types

a1c8d0c

Move comments to connector doc

b5f5ca3

davinchia reviewed Jun 14, 2021

View reviewed changes

subodh1810 approved these changes Jun 14, 2021

View reviewed changes

tuliren added 4 commits June 14, 2021 15:32

Share config between acceptance tests

7a4e8c5

Add doc about additional properties

a4a3bcd

Move shared code out of if branch

336dc20

Add doc about adding a new format

69a5a97

github-actions bot added the area/connectors Connector related issues label Jun 14, 2021

tuliren added 2 commits June 14, 2021 15:52

Merge branch 'master' into liren/s3-destination-parquet

f3e54b8

Format code

bca69b0

tuliren added 2 commits June 14, 2021 16:26

Bump version to 0.1.4

f1a2590

Merge branch 'master' into liren/s3-destination-parquet

efbe5a4

davinchia approved these changes Jun 14, 2021

View reviewed changes

Fix default max padding size

167a21d

tuliren merged commit 87552b2 into master Jun 14, 2021

tuliren deleted the liren/s3-destination-parquet branch June 14, 2021 23:49

ChristopheDuong reviewed Jun 15, 2021

View reviewed changes

tuliren mentioned this pull request Jun 16, 2021

Destination S3: support writing Parquet data format #3642

Closed

prein mentioned this pull request Feb 21, 2022

[kubernetes/helm] Allow authentication to AWS using roles #5942

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🎉 Destination S3: parquet output #3908

🎉 Destination S3: parquet output #3908

tuliren commented Jun 5, 2021 •

edited

Loading

tuliren Jun 10, 2021

tuliren Jun 10, 2021

michel-tricot Jun 10, 2021

tuliren Jun 10, 2021

michel-tricot Jun 11, 2021

tuliren Jun 11, 2021

davinchia Jun 14, 2021

davinchia left a comment

davinchia Jun 14, 2021

davinchia Jun 14, 2021

tuliren Jun 14, 2021

davinchia Jun 14, 2021

tuliren Jun 14, 2021 •

edited

Loading

ChristopheDuong Jun 14, 2021 •

edited

Loading

tuliren Jun 14, 2021

ChristopheDuong Jun 15, 2021 •

edited

Loading

tuliren Jun 15, 2021

davinchia Jun 14, 2021

davinchia Jun 14, 2021

tuliren commented Jun 14, 2021

subodh1810 left a comment

subodh1810 Jun 14, 2021

tuliren commented Jun 14, 2021 •

edited by github-actions bot

Loading

tuliren commented Jun 14, 2021 •

edited by github-actions bot

Loading

tuliren commented Jun 14, 2021 •

edited by github-actions bot

Loading

davinchia left a comment

tuliren commented Jun 14, 2021 •

edited by github-actions bot

Loading

ChristopheDuong Jun 15, 2021

tuliren Jun 15, 2021 •

edited

Loading

tuliren Jun 15, 2021

🎉 Destination S3: parquet output #3908

🎉 Destination S3: parquet output #3908

Conversation

tuliren commented Jun 5, 2021 • edited Loading

What

Recommended reading order

Pre-merge Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davinchia left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tuliren Jun 14, 2021 • edited Loading

Choose a reason for hiding this comment

ChristopheDuong Jun 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChristopheDuong Jun 15, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tuliren commented Jun 14, 2021

subodh1810 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tuliren commented Jun 14, 2021 • edited by github-actions bot Loading

tuliren commented Jun 14, 2021 • edited by github-actions bot Loading

tuliren commented Jun 14, 2021 • edited by github-actions bot Loading

davinchia left a comment

Choose a reason for hiding this comment

tuliren commented Jun 14, 2021 • edited by github-actions bot Loading

Choose a reason for hiding this comment

tuliren Jun 15, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tuliren commented Jun 5, 2021 •

edited

Loading

tuliren Jun 14, 2021 •

edited

Loading

ChristopheDuong Jun 14, 2021 •

edited

Loading

ChristopheDuong Jun 15, 2021 •

edited

Loading

tuliren commented Jun 14, 2021 •

edited by github-actions bot

Loading

tuliren commented Jun 14, 2021 •

edited by github-actions bot

Loading

tuliren commented Jun 14, 2021 •

edited by github-actions bot

Loading

tuliren commented Jun 14, 2021 •

edited by github-actions bot

Loading

tuliren Jun 15, 2021 •

edited

Loading