Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s3 support for Hoodie #110

Closed
yssharma opened this issue Mar 22, 2017 · 34 comments
Closed

s3 support for Hoodie #110

yssharma opened this issue Mar 22, 2017 · 34 comments

Comments

@yssharma
Copy link

We have all our data on S3 and use HDFS only for intermediate data.
Can Hoodie work with S3 as well?

@prazanna
Copy link
Contributor

Hoodie can work with S3 potentially if (that is a big if) we can

  1. Use the DistributedFileSystem API to access S3
  2. Achieve strong consistency guarantees (stricter than the default eventual consistency provided by S3)

Not an expert in S3, but in general the above are the requirements for plugging in any file system.
Hope this clarifies. We spoke to multiple people who wanted support for S3, this will be a great addition to Hoodie.

@yssharma
Copy link
Author

Are there any classes/docs I can start having a look about adding a new data source ?

@vinothchandar
Copy link
Member

#1 is doable.. #2 is where we may need to invent some tricks might be needed..

Per docs, https://aws.amazon.com/s3/faqs/

Amazon S3 buckets in all Regions provide read-after-write consistency for PUTS of new objects and eventual consistency for overwrite PUTS and DELETES.

http://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel has more on this. Assuming a new file is a new object in s3, with file name as key - what the doc says is if you list the bucket after you created new file, you may not see the files.. The above seems to conflict a bit with each other..

@yssharma this is probably why you have HDFS for intermediate files? We really want to support s3 :) (and have AWS Athena run on such a dataset). Would you be interested in taking a stab at this?

Folks at shopify are trying to do this on Google cloud and it seems a lot easier, since its strongly consistent like HDFS

@vinothchandar vinothchandar changed the title Is HDFS the only supported data store for Hoodie? Can we use S3 with Hoodie ? s3 support for Hoodie Mar 22, 2017
@yssharma
Copy link
Author

I am going through the docs and will have a look at the code soon.
We run Athena too with all our data sitting on S3, might be able to test it there.
Thanks.

@yssharma
Copy link
Author

Facing some issues while creating the Hoodie test table. Looks like the new code removed - com.uber.hoodie.hadoop.HoodieOutputFormat.

CREATE EXTERNAL TABLE hoodie_test(_row_key string,_hoodie_commit_timestring,_hoodie_commit_seqno string, rider string, driver string, begin_lat double, begin_lon double, end_lat double, end_lon double, fare double) PARTITIONED BY (datestr string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'com.uber.hoodie.hadoop.HoodieInputFormat' OUTPUTFORMAT 'com.uber.hoodie.hadoop.HoodieOutputFormat' LOCATION 'hdfs:///tmp/hoodie/sample-table';

Also, I had to add the parquet-tools to classpath for hive. That might be worth mentioning in the quickstart guide.
hive> add jar ivy://com.twitter:parquet-tools:1.6.0;

@vinothchandar
Copy link
Member

@yssharma Thanks for pointing out. Will fix the docs shortly..

TBH, there is still work to be done in terms of easing registration into Hive.. We have not fully integrated hoodie-hive into tools/examples.. Ideally we want the ClientExample to register the table into Hive and have it queryable in Spark/Presto, all in a docker image:) .. Will get there..

Please keep us posted on s3..

@yssharma
Copy link
Author

+1 for Docker image.
I was thinking of creating a minimal hive external table with Hoodie, having its data on S3. Wanted to see how the data is written to s3. What is the best way to create tables with hoodie file formats ?
Few thoughts in my mind -

  • Accessing s3 file system through api should be doable, similar to how spark reads from s3.
  • Where do we use strong consistency in Hoodie ? Do we write timestamps when files were written to s3. It sometimes takes little longer to see the files written on s3 because of eventual consistency (usually very few seconds for large files). Can we handle eventual consistence in code by adding waits/checks to ensure file has been written and is visible?

@vinothchandar
Copy link
Member

you can see #95 for adding s3 support.. That PR adds support for Google Cloud.. You can try running HoodieClientExample to write to S3.. basically the quick start.

Let me illustrate an example problem:

Hoodie commit writes a .commit file under the .hoodie metadata folder, to make successful ingestion. Query picks up new commit, but can't find the files that where written during the commit.. This will result in some new files not processed by the query.

In other places, as long as you giving sufficient time between batches (how much ever time is needed for s3 to make things consistent), it should be okay (I am like 99% sure).

@prazanna ??

@vinothchandar
Copy link
Member

we can add you to our slack channel, if you are interested in pursuing this more.. We can have a more interactive discussion on this too..

@yssharma
Copy link
Author

Yes i would like to check more on this. Please add me to your slack channel and I can ask my questions there.
Cheers.

@vinothchandar
Copy link
Member

email?

@yssharma
Copy link
Author

Emailed you.

@HenryCaiHaiying
Copy link

For the S3 eventually consistency on PUTS followed by LIST, for the problem mentioned above:

"Hoodie commit writes a .commit file under the .hoodie metadata folder, to make successful ingestion. Query picks up new commit, but can't find the files that where written during the commit.. This will result in some new files not processed by the query."

It seems to me a minor problems, you don't get the latest data for this query, but you will get those records for the next query. Anything more serious than that?

@HenryCaiHaiying
Copy link

Maybe one way to make it stronger consistency is use the fact that PUTS followed by GET is strongly consistent.

When the ingestion component generates the abc.commit file, it also adds/appends abc.commit filename into a inFlight S3 file. When query comes in later, and HoodieTableMetaClient.scanFiles is called, it will use the content in that inFlight S3 file to amend the result it got from fs.listStatus() by calling S3 GET on that abc.commit file.

@vinothchandar
Copy link
Member

Top off my head, yes seems like the query will be stale until s3 syncs, and if you don't perform another commit within that, it should be fine..

@HenryCaiHaiying if you are interested, we can add you the slack group. We have a channel where we are toying around with this..

@prazanna
Copy link
Contributor

prazanna commented Mar 23, 2017

Strong consistency on the underlying file system is an absolute requirement with the current way hoodie is implemented. Query is just one part. Ingestion looks up the files to get the index and look up whether a record is already present or not. If the underlying file system only guarantees eventual consistencies, you may end up with duplicates and once you have duplicates for a row_key within a partition then all bets are off in terms of updates in future. Durability will be impacted.
EDIT: As @vinothchandar said - If we can guarantee that ingestion interval is always > Eventual consistency sync up time. Then this would not be a problem. But this has to be a 100% guarantee.

So in general even though the impact from querying on eventual consistency is only optimization, index lookup relies on strict guarantees on consistency.

@HenryCaiHaiying
Copy link

What do you think the idea of keeping a directory listing file of meta files under each partition folder, keep the content current on each meta file add/delete. filesystem.listStatus will consult this directory listing file for any new files.

S3 guarantee read-after-write consistency for PUT then followed by GET.

@HenryCaiHaiying
Copy link

Another thought on this problem, the code usually do listStatus() on dataFolder get a list of data files, and do another listStatus() on metaFolder to get a list of meta files, the LIST operation would have consistency problems in S3.

But we don't have to do this way, when we get a list of data files, we have the file path for each data file and we know the corresponding meta path for each data file. So instead of doing another listStatus on metaFolder, we can ask for meta path for each file. And this is a GET/HEAD operation, should be strongly consistent. If we worry about the performance of making individual check for each file, we can do a listStatus on metaFolder and only do the 2nd check on the files which are in dataFolder but not in metaFolder.

@yssharma
Copy link
Author

Can we add re-tries while Hoodie tries to read the files listed in the commit file. It would start processing the files present and will later re-attempt to process the files not found in last round.
Also, eventual consistency is not the only issue, since S3 also throttles read if they are beyond threshold, hence failing the read request. Re-try should address that as well.

Spark has a similar retry-count for Amazon Kinesis since kinesis brutally throttles applications reading beyond a particular rate.

I haven't got familiar with code and how we are making these requests so just thinking aloud before I actually see how things are working under the hood.

@vinothchandar
Copy link
Member

Interesting ideas. IMO Such functionality to retry can be built more eleganantly outside, as a service that simply tracks to which commit is the dataset consistent at, on the underlying filesystem.. (rather than pushing it down to Hoodie/query engines) .. the hoodie commit file already contains the list of files that were written during the commit, and the service can simply verify that these files are infact available to listStatus and make the commit as consistent..

WDYT?

@yssharma
Copy link
Author

Sounds good to me. Will have a look at this to add the check to list status.
How long can the list of commit files be typically. Would we be losing lot on the performance side doing a huge scan ?

@vinothchandar
Copy link
Member

old commits get archived eventually.. so it will be bounded to < 50 or so.. (configurable)

@vinothchandar
Copy link
Member

Hmmm.. Complication.. https://wiki.apache.org/hadoop/AmazonS3 as per this, rename is not atomic.. We need atleast 1 thing to be atomic, for all the above to be doable..

@vinothchandar
Copy link
Member

All more follow up after chat with someone from netflix

https://medium.com/netflix-techblog/s3mper-consistency-in-the-cloud-b6a1076aa4f8

we can potentially use something like this.

@yssharma
Copy link
Author

Thanks @vinothchandar . Interesting blog. Will read some more on S3amper.

@yssharma
Copy link
Author

As I see it, it would be a separate service on cluster checking for inconsistencies and posting them to SQS queues, and the application checks the inconsistencies before using the data. Yet to see few examples how applications use it. Will post back once I get more info.

@kainoa21
Copy link

Another potential solution is the EMRFS directly from Amazon: http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-fs.html

@vinothchandar
Copy link
Member

I heard its NFS 4 (which seems to be vastly improved now) under the hoods.. Any operational experiecne with EMRFS? Def worth checking out..

@vinothchandar
Copy link
Member

Here is the summary from discussion with @yssharma

  • My concern that commit semantics might be compromised, if .inflight -> .commit rename fails midstream, seems to be not a problem, since s3 will guarantee the underlying copy is atomic.
  • Only pending issue is , dealing with S3's eventual consistency itself in cases where files that are part of a commit, have not yet become available for listing
  • EMRFS Route seems to have higher cost/operational overhead due to use of a coupled DynamoDB instance.

Onward:

  • @yssharma is testing Copy-on-write datasets with S3, to unearth anyother issues
  • A simple (albeit adding more listing cost) fix here is to have HoodieInputFormat, optionally open up the commit file and ensure the files there are listable, before using that as the commit watermark for that query. (i,e a simple check to ensure queries wait enough time for files to show up)
  • A longer term fix here, could be building out the Hoodie Timeline server tier, that mostly @prazanna and I have discussed amongst ourselves.

@vinothchandar
Copy link
Member

Per #293, we should be able to read/write to cloud stores properly now. Tested with S3.

Still need to add a check to ensure files are consistent on storage before publishing commit.. This will handle the eventual consistency issue . Filed #405

@vinothchandar
Copy link
Member

Also merged support for ensuring file consistency before commit.. Closing this master issue.. Changes available in release 0.4.5 if you are still interested.

@tooptoop4
Copy link

is .4.5 released or u mean .4.4?

@vinothchandar
Copy link
Member

Sorry I mistyped.. 0.4.4 has the consistency checks .. https://github.com/uber/hudi/blob/master/RELEASE_NOTES.md

@jeetgangele
Copy link

any idea where are they release hudi with EMR in Mumbai, India region.

vinishjail97 pushed a commit to vinishjail97/hudi that referenced this issue Dec 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants