-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(hive): Move WriterOptions update to file-formats #11956
base: main
Are you sure you want to change the base?
refactor(hive): Move WriterOptions update to file-formats #11956
Conversation
✅ Deploy Preview for meta-velox canceled.
|
@@ -44,9 +44,6 @@ class HiveConfig { | |||
|
|||
/// Maximum number of (bucketed) partitions per a single table writer | |||
/// instance. | |||
/// | |||
/// TODO: remove hive_orc_use_column_names since it doesn't exist in presto, | |||
/// right now this is only used for testing. | |||
static constexpr const char* kMaxPartitionsPerWriters = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@majetideepak How about moving the following file format-related configurations in HiveConfig into the file format's configuration file as well?
https://github.com/facebookincubator/velox/blob/main/velox/connectors/hive/HiveConfig.h#L76-L84
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JkSelf I prefer to handle the ReaderOptions in a separate PR. The ReaderOptions are not specialized per file format similar to the WriterOptions
85eaa5f
to
269a98f
Compare
269a98f
to
bebc015
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, we were on-and-off talking about this refactor, thanks for taking a stab!
Overall it looks good to me, just a few minor nits. Please double check that we are not changing the default values of any these flags, or the precedence in which we apply these different sources of configs. I remember these looked quite messy last I looked.
Thanks for the cleanup!
@@ -78,6 +78,48 @@ class Config : public config::ConfigBase { | |||
static Entry<uint64_t> RAW_DATA_SIZE_PER_BATCH; | |||
static Entry<bool> MAP_STATISTICS; | |||
|
|||
/// Maximum stripe size in orc writer. | |||
static constexpr const char* kOrcWriterMaxStripeSize = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: in the future, we should probably make these consistent with the upper case ones above. But nice that we are taking small(er) steps.
options.processConfigs(base, emptySession); | ||
auto config = options.config; | ||
ASSERT_EQ(getConfig(config, Config::STRIPE_SIZE), 64L * 1024L * 1024L); | ||
ASSERT_EQ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose here we are copying the exact same tests from the hive connector? My only concern is that we don't change the default values (as these are currently all over the place), since this may have hard to predict effects. But I'm assuming all of these are preserved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API changed a bit since we moved from HiveConfig to Dwrf::Config and processConfigs
. But the body and default values are the same.
There are 8 configs and I confirmed the defaults are the same as previous.
velox/dwio/dwrf/test/ConfigTests.cpp
Outdated
@@ -71,6 +78,79 @@ struct ConfigTestParams { | |||
std::vector<uint32_t> expectedCols{}; // do we expect the spec to be valid | |||
}; | |||
|
|||
TEST(ConfigTests, WriterOptionsDefaultConfig) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: test names are lower camel case
@@ -44,6 +44,9 @@ struct WriterOptions : public dwio::common::WriterOptions { | |||
columnWriterFactory; | |||
const tz::TimeZone* sessionTimezone{nullptr}; | |||
bool adjustTimestampToTimezone{false}; | |||
void processConfigs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: newline to separate method declaration.
@pedroerp thanks for the review. I addressed your comments. I limited the refactoring to only that is required to minimize risk. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for your great work.
@pedroerp has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@majetideepak thanks for the refactor % one comment. Thanks!
@@ -78,6 +78,48 @@ class Config : public config::ConfigBase { | |||
static Entry<uint64_t> RAW_DATA_SIZE_PER_BATCH; | |||
static Entry<bool> MAP_STATISTICS; | |||
|
|||
/// Maximum stripe size in orc writer. | |||
static constexpr const char* kOrcWriterMaxStripeSize = | |||
"hive.orc.writer.stripe-max-size"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need hive prefix in the config? If it is hive specific, then shall we pass connector id to processConfig API (this might cause the dependency from file format to connector config) or does hive data sink should convert hive specific config to the corresponding format ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiaoxmeng We have the same dilemma for others like storage adapters (s3, gcs) config as well. eg. We use hive.s3.
for S3.
We should discuss this and modify all the configs in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a separate PR, I think we can remove the hive
prefix for all and modify the config key in the HiveConnector before passing it to file-formats, storage-adapters.
@@ -136,48 +133,6 @@ class HiveConfig { | |||
/// meta data together. Optimization to decrease the small IO requests | |||
static constexpr const char* kFilePreloadThreshold = "file-preload-threshold"; | |||
|
|||
/// Maximum stripe size in orc writer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update configs.rst and maybe have one section for each supported file format? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
@xiaoxmeng I updated the configs.rst. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@majetideepak thanks for the update!
WriterOptions of the file format should implement how to process
format-specific session and connector configs.
This will decouple the Hive Connector library from the file-format libraries.
This was previously changed here #10915