Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix replConfig merging and evaluate values in Config.fromHadoop #1015

Merged
merged 10 commits into from
Aug 14, 2014

Conversation

bholt
Copy link
Contributor

@bholt bholt commented Aug 13, 2014

Adds Config.defaultFrom which will correctly merge Config.default with existing values in JobConf (keys besides "io.serializations" can be added as needed), and use this in replConfig so we pick up the correct values in Hadoop mode.

Also had to fix ReplTest because the Execution[T] changes had made it so it was only ever running in Local mode. Because ReplImplicits.mode is a mutable var which gets picked up dynamically, the repl tests have to be run sequentially (for now).

@bholt
Copy link
Contributor Author

bholt commented Aug 14, 2014

Any feelings about this @johnynek? Particularly the bit about forcing JobConf to evaluate everything when calling Config.fromHadoop?

@@ -353,7 +362,7 @@ object Config {
* made on calling here. If you need to update Config, you do it by modifying it.
*/
def fromHadoop(conf: Configuration): Config =
Config(conf.asScala.map { e => (e.getKey, e.getValue) }.toMap)
Config(conf.asScala.map { e => e.getKey -> conf.get(e.getKey) }.toMap)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you comment on this change? Why is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has to do with #1014. This forces JobConf to evaluate expressions before handing over the value. I'll add a comment to this effect in the code to avoid future confusion.

@johnynek
Copy link
Collaborator

why are we causing that evaluation, Brandon? I know it bit us with strings we got out (like hadoop.tmp.dir).

I am fine with it as I didn't realize there was a difference. :)

@bholt
Copy link
Contributor Author

bholt commented Aug 14, 2014

why are we causing that evaluation, Brandon? I know it bit us with strings we got out (like hadoop.tmp.dir).

The only problem I know of so far is that hadoop.tmp.dir having an expression in it causes validateSources to fail (#1014). And it just seemed more straightforward, if we're "freezing" the config into an immutable structure, to evaluate them then.

Unfortunately because we now have two configs (one coming from the mode), this makes things a bit more complicated, requiring tricks like what I just pushed (f309138) to merge the values from JobConf with those from our custom config. It would be great to come up with a way to separate config from mode, if you can think of how to do it (but going in a different PR preferably)


I should clarify that this problem with two configs is unrelated to the changes in this PR.

*/
def defaultFrom(mode: Mode): Config =
default ++ (mode match {
case m: HadoopMode => Config.fromHadoop(m.jobConf) - IoSerializationsKey
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to this change. To use scalding you must, in scalding, set the ioSerializations key. It is a pain that we have all these special cases in the config. But given the way hadoop works, it seems that special cases are the only way around.

@johnynek
Copy link
Collaborator

merge when green.

val conf = Config.default
// make this lazy so we don't call 'createReplCodeJar' until we've initialized ScaldingShell
// (mutability is the worst)
lazy val tmpReplJarConfig = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, can we do this just once? We might need to make a fresh jar for every run to ship to the mappers. I'm not 100% sure, but the safest course might be to have config be the way you have it, but a replConfig or executionConfig that is a def that adds the jar on every call.

This is just anxiety on my part. I really don't understand the createReplJar mechanism or what happens if it is not there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid concern. To avoid voodoo, I'll do some investigating to see what the deal is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, touché. We do need to re-create that jar each run. Can't replicate with "--hdfs-local", unfortunately, so don't know of an easy way to test it automatically.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, this is fine. We should add an Issue to test the repl with the minicluster.

Glad we caught it before it was an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #1020.

@johnynek
Copy link
Collaborator

merge when green

bholt added a commit that referenced this pull request Aug 14, 2014
Fix replConfig merging and evaluate values in Config.fromHadoop
@bholt bholt merged commit 82265e6 into twitter:develop Aug 14, 2014
@bholt bholt deleted the bholt/config-merging branch August 14, 2014 23:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants