Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi/Cloud FS Support for Copy-On-Write tables #293

Merged
merged 3 commits into from
Jan 18, 2018

Conversation

vinothchandar
Copy link
Member

@vinothchandar vinothchandar commented Jan 3, 2018

Addresses few issues for s3 support #110 #120 , subsumes #191 . Its touches a lot of files, but most of it is cleaning up some debt, key changes called out in commit summaries.

Tested on some jobs reading from s3 and writing to HDFS, HDFS to HDFS & so on.. CLI works with S3 datasets via env var setting as well..

cc @n3nash @ovj @jianxu for double checking that changes to archive log don't impact production at Uber.

cc @zqureshi @alunarbeach who may be interested in this..

@vinothchandar
Copy link
Member Author

Tests pass locally.. debugging why it fails on travis.

@vinothchandar vinothchandar force-pushed the s3-support branch 6 times, most recently from 7245522 to 282a436 Compare January 4, 2018 03:56
Copy link
Contributor

@prazanna prazanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Glad we were able to get this working. One caveat to be wary about - If the client and the hadoop cluster distribution is different the Configuration serializable may not work right.

Vinoth Chandar and others added 3 commits January 17, 2018 23:03
 - Reviving PR 191, to make FileSystem creation off actual path
 - Streamline all filesystem access to HoodieTableMetaClient
 - Hadoop Conf from Spark Context serialized & passed to executor code too
 - Pick up env vars prefixed with HOODIE_ENV_ into Configuration object
 - Cleanup usage of FSUtils.getFS, piggybacking off HoodieTableMetaClient.getFS
 - Adding s3a to supported schemes & support escaping "." in env vars
 - Tests use HoodieTestUtils.getDefaultHadoopConf
 - When append() is not supported, rollover to new file always (instead of failing)
 - Provide way to configure archive log folder (avoids small files inside .hoodie)
 - Datasets written via Spark datasource archive to .hoodie/archived
 - HoodieClientExample will now retain only 2,3 commits to exercise archival path during dev cycles
 - Few tweaks to code structure around CommitArchiveLog
@vinothchandar vinothchandar merged commit 21ce846 into apache:master Jan 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants