Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🎉 New destination: S3 #3672

Merged
merged 28 commits into from
Jun 3, 2021
Merged

🎉 New destination: S3 #3672

merged 28 commits into from
Jun 3, 2021

Conversation

tuliren
Copy link
Contributor

@tuliren tuliren commented May 27, 2021

What

  • This PR addresses S3 as destination #573.
  • The MVP support CSV format. More formats (e.g. Apache Parquet) will be added soon.

How

  • This destination is similar to S3StreamCopier, but without DB operations.

Pre-merge Checklist

  • Run integration tests
  • Publish Docker images

Recommended reading order

  1. destination-s3/spec.json
  2. S3Destination.java
  3. S3Consumer.java
  4. S3OutputFormatter.java
  5. S3CsvOutputFormatter.java
  6. The rest

@tuliren tuliren requested a review from sherifnada May 27, 2021 20:13
@tuliren tuliren self-assigned this May 27, 2021
@tuliren tuliren linked an issue May 27, 2021 that may be closed by this pull request
Copy link
Contributor

@sherifnada sherifnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having looked into StreamTransferManager I think it's unlikely that it will do the kind of partitioning we want (uploading a stream of data into multiple files). It seems that it's trying to solve the use case of uploading a file of unknown size to S3. The way it happens to do that is to split it up into multiple files of known size in memory, uploading them, then concatenating them via a native S3 operation: MultipartUploads.

Given this, I think we have a few options:

  1. don't partition output within a stream
  2. partition by counting how many bytes have been been uploaded on the current upload manager. Once we exceed the desired size, complete that manager's work, then create a new manager to upload a new part file. Repeat this process until done.
  3. Load a single file then split after uploading. I don't like this approach because it has many moving parts, plus not all file formats are easilysplittable (CSV or JSON for example) -- it's possible to split those by reading N lines and counting bytes then saving that offset but it feels similar to approach Singer Postgres --> Postgres replication demo #2.

I suspect we should do approach 1 then 2 in a follow up release. I think this sequencing does not introduce any backwards incompatibilities and is congruent with an incremental value delivering approach. WDYT?

docs/integrations/destinations/s3.md Show resolved Hide resolved
docs/integrations/destinations/s3.md Outdated Show resolved Hide resolved
import java.util.Map;
import java.util.UUID;

public class S3Consumer extends FailureTrackingAirbyteMessageConsumer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this a distinct class from the one in JDBC? should it reuse that class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JDBC one has lots of database operations in it. I tried to reuse that one at the beginning of last week, but it was unnecessarily complicated. So I decided to create a separate one that only deals with S3 logic.

@tuliren tuliren changed the title Implement s3 destination (csv only mvp) 🎉 New destination: S3 Jun 1, 2021
@tuliren
Copy link
Contributor Author

tuliren commented Jun 1, 2021

/test connector=connectors/destination-s3

🕑 connectors/destination-s3 https://github.com/airbytehq/airbyte/actions/runs/897761847
✅ connectors/destination-s3 https://github.com/airbytehq/airbyte/actions/runs/897761847

@tuliren tuliren marked this pull request as ready for review June 1, 2021 23:57
@auto-assign auto-assign bot requested review from cgardens and davinchia June 1, 2021 23:57
@tuliren tuliren requested a review from sherifnada June 1, 2021 23:57
@sherifnada sherifnada requested a review from subodh1810 June 2, 2021 00:05
@tuliren
Copy link
Contributor Author

tuliren commented Jun 2, 2021

/publish connector=connectors/destination-s3

🕑 connectors/destination-s3 https://github.com/airbytehq/airbyte/actions/runs/897829029
❌ connectors/destination-s3 https://github.com/airbytehq/airbyte/actions/runs/897829029

Copy link
Contributor

@sherifnada sherifnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few comments but we are almost there!

They belong to another PR.
@tuliren
Copy link
Contributor Author

tuliren commented Jun 2, 2021

/test connector=connectors/destination-s3

🕑 connectors/destination-s3 https://github.com/airbytehq/airbyte/actions/runs/898964579
✅ connectors/destination-s3 https://github.com/airbytehq/airbyte/actions/runs/898964579

@tuliren tuliren requested a review from sherifnada June 2, 2021 10:10
@tuliren tuliren mentioned this pull request Jun 2, 2021
1 task
@tuliren
Copy link
Contributor Author

tuliren commented Jun 3, 2021

/publish connector=connectors/destination-s3

🕑 connectors/destination-s3 https://github.com/airbytehq/airbyte/actions/runs/903551907
✅ connectors/destination-s3 https://github.com/airbytehq/airbyte/actions/runs/903551907

@tuliren tuliren merged commit c13b988 into master Jun 3, 2021
@tuliren tuliren deleted the liren/s3-destination-mvp branch June 3, 2021 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

S3 as destination
2 participants