Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add options for verbose logging in FileSource #1613

Merged
merged 2 commits into from
Oct 28, 2016
Merged

Add options for verbose logging in FileSource #1613

merged 2 commits into from
Oct 28, 2016

Conversation

isnotinvain
Copy link
Contributor

For a long time we've had issues with getting the data availability checks right in FileSource. It's often hard to debug reports of intermittent errors in FileSource's behavior, because by the time we get to take a look, the HDFS file system may have changed already.

This PR adds a hadoop config variable that can be enabled which adds some very verbose logging, which sort of shows "what the file system looked like" when a job was run and when it was doing its various validations / sanity checks.

It's a little bit messy to have this logging code inside of FileSource but I do think it will help us going forward. Right now we are trying to debug some issues with the new behavior that allows for empty directories that contain success files. It may be doing exactly what it's supposed to, or it may have a bug, but we can't really catch it as it happens. This way we'll at least have the option to know what the FS looked like at run time.

@@ -119,6 +110,32 @@ object AcceptAllPathFilter extends PathFilter {
}

object FileSource {
val LOG = LoggerFactory.getLogger(this.getClass)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you tested this on the cluster? Seems like serialization woes are possible here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same worry. I'm in the middle of testing it out. It's also really hard to get these logs to show up in the tests. Something to do w/ slf4j configuration. So I also want to confirm these logs actually get printed on a real cluster.

I think this'll work, because I see it being done the same way in the reducer estimator and a few other places. Worse case I'll make it lazy transient.


val VerboseFileSourceLoggingKey = "com.twitter.scalding.filesource.verbose.logging"

def verboseLogEnabled(conf: Configuration): Boolean = conf.getBoolean(VerboseFileSourceLoggingKey, false)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we try to use scalding.Config where possible. Can we use that instead?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that or make this private please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make it private. Is it better to use scalding.Config even if it's private? It seems that what I'll have access to is a Configuration -- does it make sense to convert that to a scalding.Config (probably not?)


def verboseLogEnabled(conf: Configuration): Boolean = conf.getBoolean(VerboseFileSourceLoggingKey, false)

def ifVerboseLog(conf: Configuration)(msgFn: => String): Unit = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this take shouldLog: Boolean instead? And maybe make a

private def ifVerboseConf(conf: Configuration)(msg: => String): Unit =
  ifVerboreLog(verboseLogEnabled(conf))

to minimize exposing Configuration on our APIs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these can actually all be private, or maybe private[scalding] -- would that be enough?

@@ -7,7 +7,7 @@ import java.net.URI
import java.nio.ByteBuffer

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.mapreduce.filecache.{DistributedCache => HDistributedCache}
import org.apache.hadoop.mapreduce.filecache.{ DistributedCache => HDistributedCache }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we not commit style changes like this on the side?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so this is what happens when you run the unit tests (scalafmt) -- I wasn't sure whether to keep this. If I don't commit this, next time someone runs the unit tests, it'll do this again. So I thought I'd just keep the changes. I can back them out though and send a separate PR that only does the style fixes.

@@ -59,7 +59,6 @@ object RichPipeSpecification extends Properties("RichPipe") {
(newNum > basePipeNumber) && ap.getName.endsWith(s"-${lastGroup}")
}


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove this from the diff?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, what do you think?

@@ -119,6 +110,32 @@ object AcceptAllPathFilter extends PathFilter {
}

object FileSource {
val LOG = LoggerFactory.getLogger(this.getClass)

val VerboseFileSourceLoggingKey = "com.twitter.scalding.filesource.verbose.logging"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we declare this in Config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes maybe, let me see how that works


def ifVerboseLog(conf: Configuration)(msgFn: => String): Unit = {
if (verboseLogEnabled(conf)) {
val stack = new RuntimeException()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we instead do Thread.currentThread().getStackTrace() ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe -- I need to check if there's a difference. I think there might be but I'm not 100% sure.

@@ -119,6 +110,32 @@ object AcceptAllPathFilter extends PathFilter {
}

object FileSource {
val LOG = LoggerFactory.getLogger(this.getClass)

val VerboseFileSourceLoggingKey = "com.twitter.scalding.filesource.verbose.logging"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a short line describing the use of this param? Add if we have a sense of the overhead it causes, that as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah good idea, i'll document this

@isnotinvain
Copy link
Contributor Author

OK -- updated to have comments + to use scalding.Config (to some extent) and to make the helper methods private. Let me know what you think about the style changes / if the should be removed.

@piyushnarang
Copy link
Collaborator

👍 if you were able to verify that this works on the cluster

@isnotinvain
Copy link
Contributor Author

OK I've tested this on the cluster as well. It's pretty noisy -- we could log this to a side file instead maybe.

@johnynek
Copy link
Collaborator

👍

@isnotinvain
Copy link
Contributor Author

Thanks! Any opinion on log-to-console vs log-to-side-file?

@johnynek
Copy link
Collaborator

I think we should just use normal terrible java logging. If you want to log
to a file, you set that up in the terrible configuration of slf4j.

I don't see myself running this option, so I am not too worried about it as
long as the default is a no-op.

On Thu, Oct 27, 2016 at 1:27 PM Alex Levenson notifications@github.com
wrote:

Thanks! Any opinion on log-to-console vs log-to-side-file?


You are receiving this because you commented.

Reply to this email directly, view it on GitHub
#1613 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEJdqP48ScKLHPgnyNEzt6WRNTdHo8kks5q4TNngaJpZM4Kh0QF
.

@isnotinvain
Copy link
Contributor Author

Ya makes sense. I could have done all the turning on/off of this via terrible slf4j configs but I didn't want to do that because it never really works / often gets misconfigured outside of your control. So that's why I used the hadoop conf. I'll leave this as is.

@isnotinvain isnotinvain merged commit c2dba3b into twitter:develop Oct 28, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants