-
Notifications
You must be signed in to change notification settings - Fork 707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix replConfig merging and evaluate values in Config.fromHadoop #1015
Conversation
…ainst the evaluated tmp dir
Any feelings about this @johnynek? Particularly the bit about forcing JobConf to evaluate everything when calling Config.fromHadoop? |
@@ -353,7 +362,7 @@ object Config { | |||
* made on calling here. If you need to update Config, you do it by modifying it. | |||
*/ | |||
def fromHadoop(conf: Configuration): Config = | |||
Config(conf.asScala.map { e => (e.getKey, e.getValue) }.toMap) | |||
Config(conf.asScala.map { e => e.getKey -> conf.get(e.getKey) }.toMap) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you comment on this change? Why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has to do with #1014. This forces JobConf to evaluate expressions before handing over the value. I'll add a comment to this effect in the code to avoid future confusion.
why are we causing that evaluation, Brandon? I know it bit us with strings we got out (like hadoop.tmp.dir). I am fine with it as I didn't realize there was a difference. :) |
The only problem I know of so far is that hadoop.tmp.dir having an expression in it causes Unfortunately because we now have two configs (one coming from the mode), this makes things a bit more complicated, requiring tricks like what I just pushed (f309138) to merge the values from JobConf with those from our custom config. It would be great to come up with a way to separate config from mode, if you can think of how to do it (but going in a different PR preferably) I should clarify that this problem with two configs is unrelated to the changes in this PR. |
*/ | ||
def defaultFrom(mode: Mode): Config = | ||
default ++ (mode match { | ||
case m: HadoopMode => Config.fromHadoop(m.jobConf) - IoSerializationsKey |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to this change. To use scalding you must, in scalding, set the ioSerializations key. It is a pain that we have all these special cases in the config. But given the way hadoop works, it seems that special cases are the only way around.
merge when green. |
…void initialization problmes
val conf = Config.default | ||
// make this lazy so we don't call 'createReplCodeJar' until we've initialized ScaldingShell | ||
// (mutability is the worst) | ||
lazy val tmpReplJarConfig = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, can we do this just once? We might need to make a fresh jar for every run to ship to the mappers. I'm not 100% sure, but the safest course might be to have config be the way you have it, but a replConfig or executionConfig that is a def that adds the jar on every call.
This is just anxiety on my part. I really don't understand the createReplJar mechanism or what happens if it is not there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Valid concern. To avoid voodoo, I'll do some investigating to see what the deal is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, touché. We do need to re-create that jar each run. Can't replicate with "--hdfs-local", unfortunately, so don't know of an easy way to test it automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, this is fine. We should add an Issue to test the repl with the minicluster.
Glad we caught it before it was an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #1020.
Needed to add repl jar to classpath for case where no file/job is specified
merge when green |
Fix replConfig merging and evaluate values in Config.fromHadoop
Adds
Config.defaultFrom
which will correctly merge Config.default with existing values in JobConf (keys besides "io.serializations" can be added as needed), and use this inreplConfig
so we pick up the correct values in Hadoop mode.Also had to fix ReplTest because the Execution[T] changes had made it so it was only ever running in Local mode. Because ReplImplicits.mode is a mutable var which gets picked up dynamically, the repl tests have to be run sequentially (for now).