-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-6990] [Build] Add Java linting script; fix minor warnings #9867
Conversation
Closing '>' and method name shouldn't have whitespace between them, according to http://checkstyle.sourceforge.net/config_whitespace.html#GenericWhitespace
Invoke Checkstyle and print any errors to the console, failing the step. Use Google's style rules modified according to https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide Some important checks are disabled (see TODOs in checkstyle.xml) due to multiple violations being present in the codebase.
The code was copied from a third-party source and needs to be in sync with that, so we shouldn't make our own modifications to it. The file contains some style violations, so suppress the checks.
This makes sure all case statements end with a break. See http://checkstyle.sourceforge.net/config_coding.html#FallThrough
The code was copied from a third-party source and needs to be in sync with that, so we shouldn't make our own modifications to it. The file contains some style violations, so suppress the checks.
The code was copied from a third-party source and needs to be in sync with that, so we shouldn't make our own modifications to it. The file contains some style violations, so suppress the checks.
Checks fails in UnsafeRowParquetRecordReader.java. Let's enable the check in a separate change.
The code was copied from a third-party source and needs to be in sync with that, so we shouldn't make our own modifications to it. The file contains some style violations, so suppress the checks.
This makes sure all case statements end with a break. See http://checkstyle.sourceforge.net/config_coding.html#FallThrough
Checks fails in UnsafeRowParquetRecordReader.java. Let's enable the check in a separate change.
Jenkins, this is ok to test. |
Test build #46433 has finished for PR 9867 at commit
|
Test build #46482 has finished for PR 9867 at commit
|
Added some more commits so that new changes are in line with the style guide. |
Ping? I've just merged the latest changes locally and verified the checks still pass. |
I haven't looked at it super closely yet but I think this is definitely good to have! |
Test build #46692 has finished for PR 9867 at commit
|
Thanks @rxin. Would appreciate a "Ship it" on this (unless there are issues). Don't mean to whine, but the longer we wait, the bigger this PR becomes as I have to fix any new code that doesn't pass the checks. Right now the latest merge from upstream passes successfully. Added some more details in the description. |
Jenkins, retest this please. |
Hey @dskrvk, sorry to let this slip through the cracks. I'm going to shepherd this today to try to get it merged. |
SPARK_ROOT_DIR="$(dirname $SCRIPT_DIR)" | ||
|
||
$SCRIPT_DIR/../build/mvn -Pkinesis-asl -Phive -Phive-thriftserver checkstyle:check > checkstyle.txt | ||
$SCRIPT_DIR/../build/mvn -Pkinesis-asl -Pyarn -Phadoop-2.2 checkstyle:check >> checkstyle.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick question: why do you need to run twice with different profiles? AFAIK the set of source files should be the same under all of the Hadoop profiles, so I don't think we need to set -Phadoop-2.2
here.
Why can't we just use one Maven run with the profiles -Pkinesis-asl -Phive -Phive-thriftserver -Pyarn
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't actually observe any differences between the two profiles in terms of Checkstyle warnings, but decided to add the second run just to be thorough.
My reasoning was that since some profiles omit some of the modules, we need to exercise all of the possible ones, even though at the moment the set of Java sources may be the same. In any case, this only adds a few seconds to the build - negligible compared to the overall run-tests
time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to have a single run with "-Pkinesis-asl -Pyarn -Phive -Phive-thriftserver" - I even think "-Phive" is unnecessary, I think it only affects packaging right now.
"-Phadoop2.2" is unnecessary, that's the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point; changed.
Changes look good to me and ready to merge today; my only question concerns why we need to run Checkstyle twice with different sets of profiles. |
Test build #47074 has finished for PR 9867 at commit
|
Minor, but there's a typo in the title: "mix" -> "fix". |
Add -Pyarn to the first run so that all modules are covered at once. No need to execute checkstyle twice any more.
Test build #47183 has finished for PR 9867 at commit
|
Jenkins, retest this please. |
retest this please |
Test build #47201 has finished for PR 9867 at commit
|
LGTM, so I'm going to merge this into master. Thanks for being so patient, @dskrvk! |
Hey @dskrvk, what's your Apache JIRA username? I need it in order to assign the JIRA to you so that you are properly credit by our release-notes generation script. |
Hey @JoshRosen, thanks, my handle is dskrvk there as well. |
…ating many Scala List objects for deep expression trees ### What changes were proposed in this pull request? In some use cases with deep expression trees, the driver's heap shows many `scala.collection.immutable.$colon$colon` objects from the heap. The objects are allocated due to deep recursion in the `gatherCommutative` method which uses `flatmap` recursively. Each invocation of `flatmap` creates a new temporary Scala collection. Our claim is based on the following stack trace (>1K lines) of a thread in the driver below, truncated here for brevity: ``` "HiveServer2-Background-Pool: Thread-9867" #9867 daemon prio=5 os_prio=0 tid=0x00007f35080bf000 nid=0x33e7 runnable [0x00007f3393372000] java.lang.Thread.State: RUNNABLE at scala.collection.immutable.List$Appender$1.apply(List.scala:350) at scala.collection.immutable.List$Appender$1.apply(List.scala:341) at scala.collection.immutable.List.flatMap(List.scala:431) at org.apache.spark.sql.catalyst.expressions.CommutativeExpression.gatherCommutative(Expression.scala:1479) at org.apache.spark.sql.catalyst.expressions.CommutativeExpression.$anonfun$gatherCommutative$1(Expression.scala:1479) at org.apache.spark.sql.catalyst.expressions.CommutativeExpression$$Lambda$5280/143713747.apply(Unknown Source) at scala.collection.immutable.List.flatMap(List.scala:366) .... at org.apache.spark.sql.catalyst.expressions.CommutativeExpression.gatherCommutative(Expression.scala:1479) at org.apache.spark.sql.catalyst.expressions.CommutativeExpression.$anonfun$gatherCommutative$1(Expression.scala:1479) at org.apache.spark.sql.catalyst.expressions.CommutativeExpression$$Lambda$5280/143713747.apply(Unknown Source) at scala.collection.immutable.List.flatMap(List.scala:366) .... ``` This PR fixes the issue by using a stack-based iterative computation, completely avoiding the creation of temporary Scala objects. ### Why are the changes needed? Reduce heap usage of the driver ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests, refactor ### Was this patch authored or co-authored using generative AI tooling? No Closes #48481 from utkarsh39/SPARK-49977. Lead-authored-by: Utkarsh <utkarsh.agarwal@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
This replaces #9696
Invoke Checkstyle and print any errors to the console, failing the step.
Use Google's style rules modified according to
https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide
Some important checks are disabled (see TODOs in
checkstyle.xml
) due tomultiple violations being present in the codebase.
Suggest fixing those TODOs in a separate PR(s).
More on Checkstyle can be found on the official website.
Sample output (from build 46345) (duplicated because I run the build twice with different profiles):
Also fix some of the minor violations that didn't require sweeping changes.
Apologies for the previous botched PRs - I finally figured out the issue.
cr: @JoshRosen, @pwendell