Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Source: AWS S3 #3965

Closed
dmdmishra opened this issue Jun 8, 2021 · 12 comments
Closed

New Source: AWS S3 #3965

dmdmishra opened this issue Jun 8, 2021 · 12 comments

Comments

@dmdmishra
Copy link

Hello team,

I believe it would be best if we can have an AWS S3 connector built in with Airbyte, it would open a lot of path for people who wants to build pipelines on top of AWS S3 and may perform data ingestion, run data science and ML in top of S3 directly.

Thanks

@marcosmarxm
Copy link
Member

marcosmarxm commented Jun 8, 2021

@dmdmishra Airbyte already has AWS S3 destination connector. Supports csv file and #3908 is adding parquet format. Do you have another use case or as source?

@dmdmishra
Copy link
Author

Hi,

Yes, we want to load data into aws S3 . We are extracting data from oracle.

Does airbyte currently support data extraction from oracle or sqlserver in avro format and get it loaded into aws S3.

Also I wanted to understand if airbyte support file movement into aws S3 from onpremise?

Thanks,
Deepak

@marcosmarxm
Copy link
Member

marcosmarxm commented Jun 10, 2021

@dmdmishra yes, Airbyte has Oracle and SQL Server as source connector and you can add S3 AWS as the destination. AWS S3 connector only supports csv format, but parquet and other formats will be supported in the future.

Recommend to you read the quick start guide

@tuliren
Copy link
Contributor

tuliren commented Jun 10, 2021

Does airbyte currently support data extraction from oracle or sqlserver in avro format and get it loaded into aws S3.

@dmdmishra, we are working on the Avro format on S3 at this moment. Should have some updates there either late this week or early next week.

@dmdmishra
Copy link
Author

That's great news, I will hold on till I hear more from airbyte team.

Can I also check with you if after extraction of data can we perform data validation before we invest it into S3?

@tuliren
Copy link
Contributor

tuliren commented Jun 13, 2021

@dmdmishra, sorry about the delayed reply.

if after extraction of data can we perform data validation

We will create an Avro schema based on the Json schema of the Airbyte stream from the source. So if the source connector can provide a meaningful Json schema, it will be transformed into a relatively good Avro schema, and the record will be validated against it. The Avro schema is only "relatively" good, because not all Json schema can be mapped to an Avro schema, and the initial version probably won't support keywords like allOf or oneOf.

You can find the documentation about the schema conversion here:

https://github.com/airbytehq/airbyte/blob/b5f5ca3939deac882a69e17353384dd088180534/docs/integrations/destinations/s3.md#data-schema

(It is the s3.md file in PR #3908.)

However, I am not sure if this answers your question. I think the validation will always pass, as long as the source connector does generate the data based on its Json schema.

Also not all sources will provide a meaningful Json schema. For example, data in mongo db is schemaless, and can be any json object. In those cases, no Avro schema can be generated, and the data cannot be validated. We are still thinking about how to support sources like that.

@marcosmarxm marcosmarxm changed the title Airbyte AWS S3 connector New Source: AWS S3 Jun 15, 2021
@blotouta2
Copy link
Contributor

I want to sync s3 bucket and my use case is like many analytics provider (like segment) gives a facility to sync data on s3 and we can fetch that data using airbyte source-s3 for further computation

@tuliren
Copy link
Contributor

tuliren commented Jun 15, 2021

I want to sync s3 bucket and my use case is like many analytics provider (like segment) gives a facility to sync data on s3 and we can fetch that data using airbyte source-s3 for further computation

Hey @blotouta2, we already have a File source connector that can read from S3. Documentation here:
https://docs.airbyte.io/integrations/sources/file

That should work for your use case.

@blotouta2
Copy link
Contributor

blotouta2 commented Jun 16, 2021

@tuliren Source-File read only single file from any source whereas i want to sync whole s3 bucket .
Source-file able to read path like s3://gdelt-open-data/events/20190914.export.csv
requirement is like i want to sync path like s3://gdelt-open-data/events

@tuliren
Copy link
Contributor

tuliren commented Jun 16, 2021

I see. That makes sense. We will either make the File source connector be able to read multiple files or create a dedicated S3 source.

@schlattk
Copy link
Contributor

yes agree with the above would like to sync whole bucket so we can easily get a new file once added.

@Phlair
Copy link
Contributor

Phlair commented Aug 2, 2021

closed by #4990.
Issues exist for building in more file format support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants