-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
s3 support for Hoodie #110
Comments
Hoodie can work with S3 potentially if (that is a big if) we can
Not an expert in S3, but in general the above are the requirements for plugging in any file system. |
Are there any classes/docs I can start having a look about adding a new data source ? |
#1 is doable.. #2 is where we may need to invent some tricks might be needed.. Per docs, https://aws.amazon.com/s3/faqs/
http://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel has more on this. Assuming a new file is a new object in s3, with file name as key - what the doc says is if you list the bucket after you created new file, you may not see the files.. The above seems to conflict a bit with each other.. @yssharma this is probably why you have HDFS for intermediate files? We really want to support s3 :) (and have AWS Athena run on such a dataset). Would you be interested in taking a stab at this? Folks at shopify are trying to do this on Google cloud and it seems a lot easier, since its strongly consistent like HDFS |
I am going through the docs and will have a look at the code soon. |
Facing some issues while creating the Hoodie test table. Looks like the new code removed - com.uber.hoodie.hadoop.HoodieOutputFormat.
Also, I had to add the parquet-tools to classpath for hive. That might be worth mentioning in the quickstart guide. |
@yssharma Thanks for pointing out. Will fix the docs shortly.. TBH, there is still work to be done in terms of easing registration into Hive.. We have not fully integrated hoodie-hive into tools/examples.. Ideally we want the ClientExample to register the table into Hive and have it queryable in Spark/Presto, all in a docker image:) .. Will get there.. Please keep us posted on s3.. |
+1 for Docker image.
|
you can see #95 for adding s3 support.. That PR adds support for Google Cloud.. You can try running HoodieClientExample to write to S3.. basically the quick start. Let me illustrate an example problem: Hoodie commit writes a .commit file under the In other places, as long as you giving sufficient time between batches (how much ever time is needed for s3 to make things consistent), it should be okay (I am like 99% sure). @prazanna ?? |
we can add you to our slack channel, if you are interested in pursuing this more.. We can have a more interactive discussion on this too.. |
Yes i would like to check more on this. Please add me to your slack channel and I can ask my questions there. |
email? |
Emailed you. |
For the S3 eventually consistency on PUTS followed by LIST, for the problem mentioned above: "Hoodie commit writes a .commit file under the .hoodie metadata folder, to make successful ingestion. Query picks up new commit, but can't find the files that where written during the commit.. This will result in some new files not processed by the query." It seems to me a minor problems, you don't get the latest data for this query, but you will get those records for the next query. Anything more serious than that? |
Maybe one way to make it stronger consistency is use the fact that PUTS followed by GET is strongly consistent. When the ingestion component generates the abc.commit file, it also adds/appends abc.commit filename into a inFlight S3 file. When query comes in later, and HoodieTableMetaClient.scanFiles is called, it will use the content in that inFlight S3 file to amend the result it got from fs.listStatus() by calling S3 GET on that abc.commit file. |
Top off my head, yes seems like the query will be stale until s3 syncs, and if you don't perform another commit within that, it should be fine.. @HenryCaiHaiying if you are interested, we can add you the slack group. We have a channel where we are toying around with this.. |
Strong consistency on the underlying file system is an absolute requirement with the current way hoodie is implemented. Query is just one part. Ingestion looks up the files to get the index and look up whether a record is already present or not. If the underlying file system only guarantees eventual consistencies, you may end up with duplicates and once you have duplicates for a row_key within a partition then all bets are off in terms of updates in future. Durability will be impacted. So in general even though the impact from querying on eventual consistency is only optimization, index lookup relies on strict guarantees on consistency. |
What do you think the idea of keeping a directory listing file of meta files under each partition folder, keep the content current on each meta file add/delete. filesystem.listStatus will consult this directory listing file for any new files. S3 guarantee read-after-write consistency for PUT then followed by GET. |
Another thought on this problem, the code usually do listStatus() on dataFolder get a list of data files, and do another listStatus() on metaFolder to get a list of meta files, the LIST operation would have consistency problems in S3. But we don't have to do this way, when we get a list of data files, we have the file path for each data file and we know the corresponding meta path for each data file. So instead of doing another listStatus on metaFolder, we can ask for meta path for each file. And this is a GET/HEAD operation, should be strongly consistent. If we worry about the performance of making individual check for each file, we can do a listStatus on metaFolder and only do the 2nd check on the files which are in dataFolder but not in metaFolder. |
Can we add re-tries while Hoodie tries to read the files listed in the commit file. It would start processing the files present and will later re-attempt to process the files not found in last round. Spark has a similar retry-count for Amazon Kinesis since kinesis brutally throttles applications reading beyond a particular rate. I haven't got familiar with code and how we are making these requests so just thinking aloud before I actually see how things are working under the hood. |
Interesting ideas. IMO Such functionality to retry can be built more eleganantly outside, as a service that simply tracks to which commit is the dataset consistent at, on the underlying filesystem.. (rather than pushing it down to Hoodie/query engines) .. the hoodie commit file already contains the list of files that were written during the commit, and the service can simply verify that these files are infact available to listStatus and make the commit as consistent.. WDYT? |
Sounds good to me. Will have a look at this to add the check to list status. |
old commits get archived eventually.. so it will be bounded to < 50 or so.. (configurable) |
Hmmm.. Complication.. https://wiki.apache.org/hadoop/AmazonS3 as per this, rename is not atomic.. We need atleast 1 thing to be atomic, for all the above to be doable.. |
All more follow up after chat with someone from netflix https://medium.com/netflix-techblog/s3mper-consistency-in-the-cloud-b6a1076aa4f8 we can potentially use something like this. |
Thanks @vinothchandar . Interesting blog. Will read some more on S3amper. |
As I see it, it would be a separate service on cluster checking for inconsistencies and posting them to SQS queues, and the application checks the inconsistencies before using the data. Yet to see few examples how applications use it. Will post back once I get more info. |
Another potential solution is the EMRFS directly from Amazon: http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-fs.html |
I heard its NFS 4 (which seems to be vastly improved now) under the hoods.. Any operational experiecne with EMRFS? Def worth checking out.. |
Here is the summary from discussion with @yssharma
Onward:
|
Also merged support for ensuring file consistency before commit.. Closing this master issue.. Changes available in release |
is .4.5 released or u mean .4.4? |
Sorry I mistyped.. 0.4.4 has the consistency checks .. https://github.com/uber/hudi/blob/master/RELEASE_NOTES.md |
any idea where are they release hudi with EMR in Mumbai, India region. |
We have all our data on S3 and use HDFS only for intermediate data.
Can Hoodie work with S3 as well?
The text was updated successfully, but these errors were encountered: