Improve building with maven docs #70

witgo · 2014-03-04T02:16:18Z

 mvn -Dhadoop.version=... -Dsuites=spark.repl.ReplSuite test

to

 mvn -Dhadoop.version=... -Dsuites=org.apache.spark.repl.ReplSuite test

AmplabJenkins · 2014-03-04T02:22:28Z

Can one of the admins verify this patch?

pwendell · 2014-03-06T00:38:55Z

Thanks, merged.

Fast, memory-efficient hash set, hash table implementations optimized for primitive data types. This pull request adds two hash table implementations optimized for primitive data types. For primitive types, the new hash tables are much faster than the current Spark AppendOnlyMap (3X faster - note that the current AppendOnlyMap is already much better than the Java map) while uses much less space (1/4 of the space). Details: This PR first adds a open hash set implementation (OpenHashSet) optimized for primitive types (using Scala's specialization feature). This OpenHashSet is designed to serve as building blocks for more advanced structures. It is currently used to build the following two hash tables, but can be used in the future to build multi-valued hash tables as well (GraphX has this use case). Note that there are some peculiarities in the code for working around some Scala compiler bugs. Building on top of OpenHashSet, this PR adds two different hash tables implementations: 1. OpenHashSet: for nullable keys, optional specialization for primitive values 2. PrimitiveKeyOpenHashMap: for primitive keys that are not nullable, and optional specialization for primitive values I tested the update speed of these two implementations using the changeValue function (which is what Aggregator and cogroup would use). Runtime relative to AppendOnlyMap for inserting 10 million items: Int to Int: ~30% java.lang.Integer to java.lang.Integer: ~100% Int to java.lang.Integer: ~50% java.lang.Integer to Int: ~85% (cherry picked from commit b5dc339) Signed-off-by: Reynold Xin <rxin@apache.org>

[SPARK-18364][YARN] Expose metrics for YarnShuffleService

…es when call executeCollect, executeToIterator and executeTake action multi-times (apache#70) * Avoid the prepareExecuteStage#QueryStage method is executed multi-times when call executeCollect, executeToIterator and executeTake action multi-times * only add the check in prepareExecuteStage method to avoid duplicate check in other methods * small fix

[SPARK-18364][YARN] Expose metrics for YarnShuffleService

…es when call executeCollect, executeToIterator and executeTake action multi-times (apache#70) * Avoid the prepareExecuteStage#QueryStage method is executed multi-times when call executeCollect, executeToIterator and executeTake action multi-times * only add the check in prepareExecuteStage method to avoid duplicate check in other methods * small fix

Update the opentelekom password of credentials

…ence index for optimization ### What changes were proposed in this pull request? This PR proposes to move distributed-sequence index implementation to SQL plan to leverage optimizations such as column pruning. ```python import pyspark.pandas as ps ps.set_option('compute.default_index_type', 'distributed-sequence') ps.range(10).id.value_counts().to_frame().spark.explain() ``` **Before:** ```bash == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Sort [count#51L DESC NULLS LAST], true, 0 +- Exchange rangepartitioning(count#51L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#70] +- HashAggregate(keys=[id#37L], functions=[count(1)], output=[__index_level_0__#48L, count#51L]) +- Exchange hashpartitioning(id#37L, 200), ENSURE_REQUIREMENTS, [id=#67] +- HashAggregate(keys=[id#37L], functions=[partial_count(1)], output=[id#37L, count#63L]) +- Project [id#37L] +- Filter atleastnnonnulls(1, id#37L) +- Scan ExistingRDD[__index_level_0__#36L,id#37L] # ^^^ Base DataFrame created by the output RDD from zipWithIndex (and checkpointed) ``` **After:** ```bash == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Sort [count#275L DESC NULLS LAST], true, 0 +- Exchange rangepartitioning(count#275L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#174] +- HashAggregate(keys=[id#258L], functions=[count(1)]) +- HashAggregate(keys=[id#258L], functions=[partial_count(1)]) +- Filter atleastnnonnulls(1, id#258L) +- Range (0, 10, step=1, splits=16) # ^^^ Removed the Spark job execution for `zipWithIndex` ``` ### Why are the changes needed? To leverage optimization of SQL engine and avoid unnecessary shuffle to create default index. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unittests were added. Also, this PR will test all unittests in pandas API on Spark after switching the default index implementation to `distributed-sequence`. Closes #33807 from HyukjinKwon/SPARK-36559. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…ence index for optimization ### What changes were proposed in this pull request? This PR proposes to move distributed-sequence index implementation to SQL plan to leverage optimizations such as column pruning. ```python import pyspark.pandas as ps ps.set_option('compute.default_index_type', 'distributed-sequence') ps.range(10).id.value_counts().to_frame().spark.explain() ``` **Before:** ```bash == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Sort [count#51L DESC NULLS LAST], true, 0 +- Exchange rangepartitioning(count#51L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#70] +- HashAggregate(keys=[id#37L], functions=[count(1)], output=[__index_level_0__#48L, count#51L]) +- Exchange hashpartitioning(id#37L, 200), ENSURE_REQUIREMENTS, [id=#67] +- HashAggregate(keys=[id#37L], functions=[partial_count(1)], output=[id#37L, count#63L]) +- Project [id#37L] +- Filter atleastnnonnulls(1, id#37L) +- Scan ExistingRDD[__index_level_0__#36L,id#37L] # ^^^ Base DataFrame created by the output RDD from zipWithIndex (and checkpointed) ``` **After:** ```bash == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Sort [count#275L DESC NULLS LAST], true, 0 +- Exchange rangepartitioning(count#275L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#174] +- HashAggregate(keys=[id#258L], functions=[count(1)]) +- HashAggregate(keys=[id#258L], functions=[partial_count(1)]) +- Filter atleastnnonnulls(1, id#258L) +- Range (0, 10, step=1, splits=16) # ^^^ Removed the Spark job execution for `zipWithIndex` ``` ### Why are the changes needed? To leverage optimization of SQL engine and avoid unnecessary shuffle to create default index. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unittests were added. Also, this PR will test all unittests in pandas API on Spark after switching the default index implementation to `distributed-sequence`. Closes #33807 from HyukjinKwon/SPARK-36559. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 93cec49) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

spark.repl.ReplSuite to org.apache.spark.repl.ReplSuite

6ec8a54

asfgit closed this in 51ca7bd Mar 6, 2014

witgo deleted the building_with_maven branch March 6, 2014 03:27

robert3005 pushed a commit to robert3005/spark that referenced this pull request Jan 12, 2017

Fix for ineligible filters, use compressed block size (apache#70)

ded7fee

ashangit pushed a commit to ashangit/spark that referenced this pull request Jul 13, 2018

Merge pull request apache#70 from ashangit/criteo-2.2

4119fed

[SPARK-18364][YARN] Expose metrics for YarnShuffleService

clems4ever pushed a commit to clems4ever/spark that referenced this pull request Feb 11, 2019

Merge pull request apache#70 from ashangit/criteo-2.2

1197ed8

[SPARK-18364][YARN] Expose metrics for YarnShuffleService

bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019

Merge pull request apache#70 from liu-sheng/update-password

4bc918c

Update the opentelekom password of credentials

hn5092 pushed a commit to hn5092/spark that referenced this pull request Nov 4, 2019

apache#70, use map to pass properites in PostQueryExecutionForKylin.

43bec81

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve building with maven docs #70

Improve building with maven docs #70

witgo commented Mar 4, 2014

AmplabJenkins commented Mar 4, 2014

pwendell commented Mar 6, 2014

Improve building with maven docs #70

Improve building with maven docs #70

Conversation

witgo commented Mar 4, 2014

AmplabJenkins commented Mar 4, 2014

pwendell commented Mar 6, 2014