Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(hive): Move WriterOptions update to file-formats #11956

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

majetideepak
Copy link
Collaborator

@majetideepak majetideepak commented Dec 25, 2024

WriterOptions of the file format should implement how to process
format-specific session and connector configs.
This will decouple the Hive Connector library from the file-format libraries.
This was previously changed here #10915

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 25, 2024
Copy link

netlify bot commented Dec 25, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 3a14e73
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/6772f9c83371d40008e8f78c

@majetideepak majetideepak requested a review from JkSelf December 25, 2024 00:57
@majetideepak majetideepak changed the title refactor(hive): Restore WriterOptions to File Formats refactor(hive): Restore WriterOptions update to File Formats Dec 25, 2024
@majetideepak majetideepak changed the title refactor(hive): Restore WriterOptions update to File Formats refactor(hive): Move WriterOptions update to file-formats Dec 25, 2024
@majetideepak majetideepak marked this pull request as ready for review December 26, 2024 00:50
@@ -44,9 +44,6 @@ class HiveConfig {

/// Maximum number of (bucketed) partitions per a single table writer
/// instance.
///
/// TODO: remove hive_orc_use_column_names since it doesn't exist in presto,
/// right now this is only used for testing.
static constexpr const char* kMaxPartitionsPerWriters =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@majetideepak How about moving the following file format-related configurations in HiveConfig into the file format's configuration file as well?

https://github.com/facebookincubator/velox/blob/main/velox/connectors/hive/HiveConfig.h#L76-L84

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JkSelf I prefer to handle the ReaderOptions in a separate PR. The ReaderOptions are not specialized per file format similar to the WriterOptions

Copy link
Contributor

@pedroerp pedroerp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, we were on-and-off talking about this refactor, thanks for taking a stab!

Overall it looks good to me, just a few minor nits. Please double check that we are not changing the default values of any these flags, or the precedence in which we apply these different sources of configs. I remember these looked quite messy last I looked.

Thanks for the cleanup!

@@ -78,6 +78,48 @@ class Config : public config::ConfigBase {
static Entry<uint64_t> RAW_DATA_SIZE_PER_BATCH;
static Entry<bool> MAP_STATISTICS;

/// Maximum stripe size in orc writer.
static constexpr const char* kOrcWriterMaxStripeSize =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: in the future, we should probably make these consistent with the upper case ones above. But nice that we are taking small(er) steps.

options.processConfigs(base, emptySession);
auto config = options.config;
ASSERT_EQ(getConfig(config, Config::STRIPE_SIZE), 64L * 1024L * 1024L);
ASSERT_EQ(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose here we are copying the exact same tests from the hive connector? My only concern is that we don't change the default values (as these are currently all over the place), since this may have hard to predict effects. But I'm assuming all of these are preserved.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API changed a bit since we moved from HiveConfig to Dwrf::Config and processConfigs. But the body and default values are the same.
There are 8 configs and I confirmed the defaults are the same as previous.

@@ -71,6 +78,79 @@ struct ConfigTestParams {
std::vector<uint32_t> expectedCols{}; // do we expect the spec to be valid
};

TEST(ConfigTests, WriterOptionsDefaultConfig) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: test names are lower camel case

@@ -44,6 +44,9 @@ struct WriterOptions : public dwio::common::WriterOptions {
columnWriterFactory;
const tz::TimeZone* sessionTimezone{nullptr};
bool adjustTimestampToTimezone{false};
void processConfigs(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: newline to separate method declaration.

@majetideepak
Copy link
Collaborator Author

@pedroerp thanks for the review. I addressed your comments. I limited the refactoring to only that is required to minimize risk.

Copy link
Collaborator

@JkSelf JkSelf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for your great work.

@pedroerp pedroerp added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Dec 30, 2024
@facebook-github-bot
Copy link
Contributor

@pedroerp has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@xiaoxmeng xiaoxmeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@majetideepak thanks for the refactor % one comment. Thanks!

@@ -78,6 +78,48 @@ class Config : public config::ConfigBase {
static Entry<uint64_t> RAW_DATA_SIZE_PER_BATCH;
static Entry<bool> MAP_STATISTICS;

/// Maximum stripe size in orc writer.
static constexpr const char* kOrcWriterMaxStripeSize =
"hive.orc.writer.stripe-max-size";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need hive prefix in the config? If it is hive specific, then shall we pass connector id to processConfig API (this might cause the dependency from file format to connector config) or does hive data sink should convert hive specific config to the corresponding format ones?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiaoxmeng We have the same dilemma for others like storage adapters (s3, gcs) config as well. eg. We use hive.s3. for S3.
We should discuss this and modify all the configs in a separate PR.

Copy link
Collaborator Author

@majetideepak majetideepak Dec 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a separate PR, I think we can remove the hive prefix for all and modify the config key in the HiveConnector before passing it to file-formats, storage-adapters.

@@ -136,48 +133,6 @@ class HiveConfig {
/// meta data together. Optimization to decrease the small IO requests
static constexpr const char* kFilePreloadThreshold = "file-preload-threshold";

/// Maximum stripe size in orc writer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update configs.rst and maybe have one section for each supported file format? Thanks!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@majetideepak
Copy link
Collaborator Author

@xiaoxmeng I updated the configs.rst. Thanks!

Copy link
Contributor

@xiaoxmeng xiaoxmeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@majetideepak thanks for the update!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants