s3 support for Hoodie #110

yssharma · 2017-03-22T00:47:28Z

We have all our data on S3 and use HDFS only for intermediate data.
Can Hoodie work with S3 as well?

prazanna · 2017-03-22T01:09:47Z

Hoodie can work with S3 potentially if (that is a big if) we can

Use the DistributedFileSystem API to access S3
Achieve strong consistency guarantees (stricter than the default eventual consistency provided by S3)

Not an expert in S3, but in general the above are the requirements for plugging in any file system.
Hope this clarifies. We spoke to multiple people who wanted support for S3, this will be a great addition to Hoodie.

yssharma · 2017-03-22T01:20:34Z

Are there any classes/docs I can start having a look about adding a new data source ?

vinothchandar · 2017-03-22T04:15:45Z

#1 is doable.. #2 is where we may need to invent some tricks might be needed..

Per docs, https://aws.amazon.com/s3/faqs/

Amazon S3 buckets in all Regions provide read-after-write consistency for PUTS of new objects and eventual consistency for overwrite PUTS and DELETES.

http://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel has more on this. Assuming a new file is a new object in s3, with file name as key - what the doc says is if you list the bucket after you created new file, you may not see the files.. The above seems to conflict a bit with each other..

@yssharma this is probably why you have HDFS for intermediate files? We really want to support s3 :) (and have AWS Athena run on such a dataset). Would you be interested in taking a stab at this?

Folks at shopify are trying to do this on Google cloud and it seems a lot easier, since its strongly consistent like HDFS

yssharma · 2017-03-22T06:53:25Z

I am going through the docs and will have a look at the code soon.
We run Athena too with all our data sitting on S3, might be able to test it there.
Thanks.

yssharma · 2017-03-22T09:43:15Z

Facing some issues while creating the Hoodie test table. Looks like the new code removed - com.uber.hoodie.hadoop.HoodieOutputFormat.

CREATE EXTERNAL TABLE hoodie_test(_row_key string,_hoodie_commit_timestring,_hoodie_commit_seqno string, rider string, driver string, begin_lat double, begin_lon double, end_lat double, end_lon double, fare double) PARTITIONED BY (datestr string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'com.uber.hoodie.hadoop.HoodieInputFormat' OUTPUTFORMAT 'com.uber.hoodie.hadoop.HoodieOutputFormat' LOCATION 'hdfs:///tmp/hoodie/sample-table';

Also, I had to add the parquet-tools to classpath for hive. That might be worth mentioning in the quickstart guide.
hive> add jar ivy://com.twitter:parquet-tools:1.6.0;

vinothchandar · 2017-03-22T17:50:24Z

@yssharma Thanks for pointing out. Will fix the docs shortly..

TBH, there is still work to be done in terms of easing registration into Hive.. We have not fully integrated hoodie-hive into tools/examples.. Ideally we want the ClientExample to register the table into Hive and have it queryable in Spark/Presto, all in a docker image:) .. Will get there..

Please keep us posted on s3..

yssharma · 2017-03-22T21:36:17Z

+1 for Docker image.
I was thinking of creating a minimal hive external table with Hoodie, having its data on S3. Wanted to see how the data is written to s3. What is the best way to create tables with hoodie file formats ?
Few thoughts in my mind -

Accessing s3 file system through api should be doable, similar to how spark reads from s3.
Where do we use strong consistency in Hoodie ? Do we write timestamps when files were written to s3. It sometimes takes little longer to see the files written on s3 because of eventual consistency (usually very few seconds for large files). Can we handle eventual consistence in code by adding waits/checks to ensure file has been written and is visible?

vinothchandar · 2017-03-22T22:24:45Z

you can see #95 for adding s3 support.. That PR adds support for Google Cloud.. You can try running HoodieClientExample to write to S3.. basically the quick start.

Let me illustrate an example problem:

Hoodie commit writes a .commit file under the .hoodie metadata folder, to make successful ingestion. Query picks up new commit, but can't find the files that where written during the commit.. This will result in some new files not processed by the query.

In other places, as long as you giving sufficient time between batches (how much ever time is needed for s3 to make things consistent), it should be okay (I am like 99% sure).

@prazanna ??

vinothchandar · 2017-03-22T22:26:12Z

we can add you to our slack channel, if you are interested in pursuing this more.. We can have a more interactive discussion on this too..

yssharma · 2017-03-22T23:38:33Z

Yes i would like to check more on this. Please add me to your slack channel and I can ask my questions there.
Cheers.

vinothchandar · 2017-03-22T23:40:45Z

email?

yssharma · 2017-03-23T00:07:38Z

Emailed you.

HenryCaiHaiying · 2017-03-23T22:25:04Z

For the S3 eventually consistency on PUTS followed by LIST, for the problem mentioned above:

"Hoodie commit writes a .commit file under the .hoodie metadata folder, to make successful ingestion. Query picks up new commit, but can't find the files that where written during the commit.. This will result in some new files not processed by the query."

It seems to me a minor problems, you don't get the latest data for this query, but you will get those records for the next query. Anything more serious than that?

HenryCaiHaiying · 2017-03-23T22:33:04Z

Maybe one way to make it stronger consistency is use the fact that PUTS followed by GET is strongly consistent.

When the ingestion component generates the abc.commit file, it also adds/appends abc.commit filename into a inFlight S3 file. When query comes in later, and HoodieTableMetaClient.scanFiles is called, it will use the content in that inFlight S3 file to amend the result it got from fs.listStatus() by calling S3 GET on that abc.commit file.

vinothchandar · 2017-03-23T22:52:50Z

Top off my head, yes seems like the query will be stale until s3 syncs, and if you don't perform another commit within that, it should be fine..

@HenryCaiHaiying if you are interested, we can add you the slack group. We have a channel where we are toying around with this..

prazanna · 2017-03-23T23:09:15Z

Strong consistency on the underlying file system is an absolute requirement with the current way hoodie is implemented. Query is just one part. Ingestion looks up the files to get the index and look up whether a record is already present or not. If the underlying file system only guarantees eventual consistencies, you may end up with duplicates and once you have duplicates for a row_key within a partition then all bets are off in terms of updates in future. Durability will be impacted.
EDIT: As @vinothchandar said - If we can guarantee that ingestion interval is always > Eventual consistency sync up time. Then this would not be a problem. But this has to be a 100% guarantee.

So in general even though the impact from querying on eventual consistency is only optimization, index lookup relies on strict guarantees on consistency.

HenryCaiHaiying · 2017-03-24T00:21:05Z

What do you think the idea of keeping a directory listing file of meta files under each partition folder, keep the content current on each meta file add/delete. filesystem.listStatus will consult this directory listing file for any new files.

S3 guarantee read-after-write consistency for PUT then followed by GET.

HenryCaiHaiying · 2017-03-24T01:18:54Z

Another thought on this problem, the code usually do listStatus() on dataFolder get a list of data files, and do another listStatus() on metaFolder to get a list of meta files, the LIST operation would have consistency problems in S3.

But we don't have to do this way, when we get a list of data files, we have the file path for each data file and we know the corresponding meta path for each data file. So instead of doing another listStatus on metaFolder, we can ask for meta path for each file. And this is a GET/HEAD operation, should be strongly consistent. If we worry about the performance of making individual check for each file, we can do a listStatus on metaFolder and only do the 2nd check on the files which are in dataFolder but not in metaFolder.

yssharma · 2017-03-24T04:46:34Z

Can we add re-tries while Hoodie tries to read the files listed in the commit file. It would start processing the files present and will later re-attempt to process the files not found in last round.
Also, eventual consistency is not the only issue, since S3 also throttles read if they are beyond threshold, hence failing the read request. Re-try should address that as well.

Spark has a similar retry-count for Amazon Kinesis since kinesis brutally throttles applications reading beyond a particular rate.

I haven't got familiar with code and how we are making these requests so just thinking aloud before I actually see how things are working under the hood.

vinothchandar · 2017-03-24T20:15:54Z

Interesting ideas. IMO Such functionality to retry can be built more eleganantly outside, as a service that simply tracks to which commit is the dataset consistent at, on the underlying filesystem.. (rather than pushing it down to Hoodie/query engines) .. the hoodie commit file already contains the list of files that were written during the commit, and the service can simply verify that these files are infact available to listStatus and make the commit as consistent..

WDYT?

yssharma · 2017-03-27T06:54:58Z

Sounds good to me. Will have a look at this to add the check to list status.
How long can the list of commit files be typically. Would we be losing lot on the performance side doing a huge scan ?

vinothchandar · 2017-03-27T18:15:43Z

old commits get archived eventually.. so it will be bounded to < 50 or so.. (configurable)

vinothchandar · 2017-03-27T21:15:04Z

Hmmm.. Complication.. https://wiki.apache.org/hadoop/AmazonS3 as per this, rename is not atomic.. We need atleast 1 thing to be atomic, for all the above to be doable..

vinothchandar · 2017-05-10T21:26:42Z

All more follow up after chat with someone from netflix

https://medium.com/netflix-techblog/s3mper-consistency-in-the-cloud-b6a1076aa4f8

we can potentially use something like this.

yssharma · 2017-05-11T00:32:26Z

Thanks @vinothchandar . Interesting blog. Will read some more on S3amper.

yssharma · 2017-05-11T00:38:11Z

As I see it, it would be a separate service on cluster checking for inconsistencies and posting them to SQS queues, and the application checks the inconsistencies before using the data. Yet to see few examples how applications use it. Will post back once I get more info.

kainoa21 · 2017-05-11T17:52:59Z

Another potential solution is the EMRFS directly from Amazon: http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-fs.html

vinothchandar · 2017-05-11T17:54:20Z

I heard its NFS 4 (which seems to be vastly improved now) under the hoods.. Any operational experiecne with EMRFS? Def worth checking out..

vinothchandar · 2017-09-13T14:07:00Z

Here is the summary from discussion with @yssharma

My concern that commit semantics might be compromised, if .inflight -> .commit rename fails midstream, seems to be not a problem, since s3 will guarantee the underlying copy is atomic.
Only pending issue is , dealing with S3's eventual consistency itself in cases where files that are part of a commit, have not yet become available for listing
EMRFS Route seems to have higher cost/operational overhead due to use of a coupled DynamoDB instance.

Onward:

@yssharma is testing Copy-on-write datasets with S3, to unearth anyother issues
A simple (albeit adding more listing cost) fix here is to have HoodieInputFormat, optionally open up the commit file and ensure the files there are listable, before using that as the commit watermark for that query. (i,e a simple check to ensure queries wait enough time for files to show up)
A longer term fix here, could be building out the Hoodie Timeline server tier, that mostly @prazanna and I have discussed amongst ourselves.

vinothchandar · 2018-06-04T16:02:04Z

Per #293, we should be able to read/write to cloud stores properly now. Tested with S3.

Still need to add a check to ensure files are consistent on storage before publishing commit.. This will handle the eventual consistency issue . Filed #405

vinothchandar · 2018-10-01T13:20:00Z

Also merged support for ensuring file consistency before commit.. Closing this master issue.. Changes available in release 0.4.5 if you are still interested.

tooptoop4 · 2019-02-16T11:54:14Z

is .4.5 released or u mean .4.4?

vinothchandar · 2019-02-21T00:21:24Z

Sorry I mistyped.. 0.4.4 has the consistency checks .. https://github.com/uber/hudi/blob/master/RELEASE_NOTES.md

jeetgangele · 2019-12-26T11:37:27Z

any idea where are they release hudi with EMR in Mumbai, India region.

vinothchandar changed the title ~~Is HDFS the only supported data store for Hoodie? Can we use S3 with Hoodie ?~~ s3 support for Hoodie Mar 22, 2017

vinothchandar mentioned this issue Jan 3, 2018

Multi/Cloud FS Support for Copy-On-Write tables #293

Merged

vinothchandar added the enhancement label Jun 4, 2018

vinothchandar closed this as completed Oct 1, 2018

vinishjail97 pushed a commit to vinishjail97/hudi that referenced this issue Dec 15, 2023

Fix reflection instantion for source estimator (apache#110)

aa8af50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

s3 support for Hoodie #110

s3 support for Hoodie #110

yssharma commented Mar 22, 2017

prazanna commented Mar 22, 2017

yssharma commented Mar 22, 2017

vinothchandar commented Mar 22, 2017

yssharma commented Mar 22, 2017

yssharma commented Mar 22, 2017

vinothchandar commented Mar 22, 2017

yssharma commented Mar 22, 2017

vinothchandar commented Mar 22, 2017

vinothchandar commented Mar 22, 2017

yssharma commented Mar 22, 2017

vinothchandar commented Mar 22, 2017

yssharma commented Mar 23, 2017

HenryCaiHaiying commented Mar 23, 2017

HenryCaiHaiying commented Mar 23, 2017

vinothchandar commented Mar 23, 2017

prazanna commented Mar 23, 2017 •

edited

Loading

HenryCaiHaiying commented Mar 24, 2017

HenryCaiHaiying commented Mar 24, 2017

yssharma commented Mar 24, 2017

vinothchandar commented Mar 24, 2017

yssharma commented Mar 27, 2017

vinothchandar commented Mar 27, 2017

vinothchandar commented Mar 27, 2017

vinothchandar commented May 10, 2017

yssharma commented May 11, 2017

yssharma commented May 11, 2017

kainoa21 commented May 11, 2017

vinothchandar commented May 11, 2017

vinothchandar commented Sep 13, 2017

vinothchandar commented Jun 4, 2018

vinothchandar commented Oct 1, 2018

tooptoop4 commented Feb 16, 2019

vinothchandar commented Feb 21, 2019

jeetgangele commented Dec 26, 2019

s3 support for Hoodie #110

s3 support for Hoodie #110

Comments

yssharma commented Mar 22, 2017

prazanna commented Mar 22, 2017

yssharma commented Mar 22, 2017

vinothchandar commented Mar 22, 2017

yssharma commented Mar 22, 2017

yssharma commented Mar 22, 2017

vinothchandar commented Mar 22, 2017

yssharma commented Mar 22, 2017

vinothchandar commented Mar 22, 2017

vinothchandar commented Mar 22, 2017

yssharma commented Mar 22, 2017

vinothchandar commented Mar 22, 2017

yssharma commented Mar 23, 2017

HenryCaiHaiying commented Mar 23, 2017

HenryCaiHaiying commented Mar 23, 2017

vinothchandar commented Mar 23, 2017

prazanna commented Mar 23, 2017 • edited Loading

HenryCaiHaiying commented Mar 24, 2017

HenryCaiHaiying commented Mar 24, 2017

yssharma commented Mar 24, 2017

vinothchandar commented Mar 24, 2017

yssharma commented Mar 27, 2017

vinothchandar commented Mar 27, 2017

vinothchandar commented Mar 27, 2017

vinothchandar commented May 10, 2017

yssharma commented May 11, 2017

yssharma commented May 11, 2017

kainoa21 commented May 11, 2017

vinothchandar commented May 11, 2017

vinothchandar commented Sep 13, 2017

vinothchandar commented Jun 4, 2018

vinothchandar commented Oct 1, 2018

tooptoop4 commented Feb 16, 2019

vinothchandar commented Feb 21, 2019

jeetgangele commented Dec 26, 2019

prazanna commented Mar 23, 2017 •

edited

Loading