Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23668][K8S] Add config option for passing through k8s Pod.spec.imagePullSecrets #20811

Closed

Conversation

andrusha
Copy link
Contributor

@andrusha andrusha commented Mar 13, 2018

What changes were proposed in this pull request?

Pass through the imagePullSecrets option to the k8s pod in order to allow user to access private image registries.

See https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/

How was this patch tested?

Unit tests + manual testing.

Manual testing procedure:

  1. Have private image registry.
  2. Spark-submit application with no spark.kubernetes.imagePullSecret set. Do kubectl describe pod .... See the error message:
Error syncing pod, skipping: failed to "StartContainer" for "spark-kubernetes-driver" with ErrImagePull: "rpc error: code = 2 desc = Error: Status 400 trying to pull repository ...: \"{\\n  \\\"errors\\\" : [ {\\n    \\\"status\\\" : 400,\\n    \\\"message\\\" : \\\"Unsupported docker v1 repository request for '...'\\\"\\n  } ]\\n}\""
  1. Create secret kubectl create secret docker-registry ...
  2. Spark-submit with spark.kubernetes.imagePullSecret set to the new secret. See that deployment was successful.

@foxish
Copy link
Contributor

foxish commented Mar 13, 2018

ok to test

@foxish
Copy link
Contributor

foxish commented Mar 13, 2018

cc/ @mccheah @liyinan926 @vanzin

Copy link
Contributor

@liyinan926 liyinan926 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with some minor comments.

@@ -54,6 +54,12 @@ private[spark] object Config extends Logging {
.checkValues(Set("Always", "Never", "IfNotPresent"))
.createWithDefault("IfNotPresent")

val IMAGE_PULL_SECRET =
ConfigBuilder("spark.kubernetes.imagePullSecret")
.doc("Specifies the Kubernetes image secret used to access private image registry.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first image can be removed.


import io.fabric8.kubernetes.api.model.{ContainerBuilder, EnvVarBuilder, EnvVarSourceBuilder, PodBuilder, QuantityBuilder}

import io.fabric8.kubernetes.api.model._
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be an empty line between third-party imports and org.apache.spark.* imports.

@@ -108,6 +109,8 @@ private[spark] class ExecutorPodFactory(
nodeToLocalTaskCount: Map[String, Int]): Pod = {
val name = s"$executorPodNamePrefix-exec-$executorId"

val imagePullSecrets = imagePullSecret.map(new LocalObjectReference(_)).toList
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the same code is used to configure both the driver and executor pods, it can be extracted out into a utility method in KubernetesUtil.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liyinan926 I made the change, but imo it looks a bit awkward, as what is done here is just type conversion Option[String] ~> List[LocalObjectReference] to make config option palatable for the k8s.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it seems a bit awkward. I guess the original form looks better. Sorry for that, but looks like it's better if you revert this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liyinan926 reverted! Should be good to merge now. Rebased against master too.

@SparkQA
Copy link

SparkQA commented Mar 13, 2018

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1476/

@SparkQA
Copy link

SparkQA commented Mar 13, 2018

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1476/

@andrusha andrusha force-pushed the spark-23668-image-pull-secrets branch from dc2c185 to d4633d7 Compare March 20, 2018 10:06
@SparkQA
Copy link

SparkQA commented Mar 20, 2018

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1623/

@SparkQA
Copy link

SparkQA commented Mar 20, 2018

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1623/

@andrusha andrusha force-pushed the spark-23668-image-pull-secrets branch from d4633d7 to 0987bb7 Compare March 23, 2018 10:05
@SparkQA
Copy link

SparkQA commented Mar 23, 2018

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1701/

@SparkQA
Copy link

SparkQA commented Mar 23, 2018

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1701/

Andrew Korzhuev added 2 commits March 23, 2018 16:00
This will allow users to access images from private registries.
@andrusha andrusha force-pushed the spark-23668-image-pull-secrets branch from 0987bb7 to 4b8b39a Compare March 23, 2018 15:00
@SparkQA
Copy link

SparkQA commented Mar 23, 2018

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1705/

@liyinan926
Copy link
Contributor

LGTM. @foxish.

@SparkQA
Copy link

SparkQA commented Mar 23, 2018

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1705/

@liyinan926
Copy link
Contributor

@foxish @mccheah can you help merge this?

@felixcheung
Copy link
Member

Only integration tests result here, I think?

@felixcheung
Copy link
Member

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Mar 30, 2018

Test build #88737 has finished for PR 20811 at commit 4b8b39a.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

looks like legit style issue, @andrusha could you update?
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88737/console

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need style change

@SparkQA
Copy link

SparkQA commented Mar 30, 2018

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1829/

@SparkQA
Copy link

SparkQA commented Mar 30, 2018

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1829/

@foxish
Copy link
Contributor

foxish commented Apr 2, 2018

@andrusha, can you address the style check failure that @felixcheung pointed out. This should be good to merge after that.

@andrusha
Copy link
Contributor Author

andrusha commented Apr 3, 2018

@foxish fixed

@andrusha
Copy link
Contributor Author

andrusha commented Apr 3, 2018

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Apr 3, 2018

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1888/

@andrusha
Copy link
Contributor Author

andrusha commented Apr 4, 2018 via email

@SparkQA
Copy link

SparkQA commented Apr 4, 2018

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1934/

@SparkQA
Copy link

SparkQA commented Apr 4, 2018

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1935/

@foxish
Copy link
Contributor

foxish commented Apr 4, 2018

Jenkins, test this please

@foxish
Copy link
Contributor

foxish commented Apr 4, 2018

@andrusha, no worries. I think this is going to be rather important for folks with private registries. Thanks for following up on it!

@SparkQA
Copy link

SparkQA commented Apr 4, 2018

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1935/

@foxish
Copy link
Contributor

foxish commented Apr 4, 2018

@mccheah - heads up, this will likely lead to a rebase on #20910.

@SparkQA
Copy link

SparkQA commented Apr 4, 2018

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1937/

@SparkQA
Copy link

SparkQA commented Apr 4, 2018

Test build #88902 has finished for PR 20811 at commit 0291f0f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@foxish
Copy link
Contributor

foxish commented Apr 4, 2018

LGTM.
Merging to master

@SparkQA
Copy link

SparkQA commented Apr 4, 2018

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/1937/

@asfgit asfgit closed this in cccaaa1 Apr 4, 2018
@andrusha andrusha deleted the spark-23668-image-pull-secrets branch April 5, 2018 07:40
mshtelma pushed a commit to mshtelma/spark that referenced this pull request Apr 5, 2018
….imagePullSecrets

## What changes were proposed in this pull request?

Pass through the `imagePullSecrets` option to the k8s pod in order to allow user to access private image registries.

See https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/

## How was this patch tested?

Unit tests + manual testing.

Manual testing procedure:
1. Have private image registry.
2. Spark-submit application with no `spark.kubernetes.imagePullSecret` set. Do `kubectl describe pod ...`. See the error message:
```
Error syncing pod, skipping: failed to "StartContainer" for "spark-kubernetes-driver" with ErrImagePull: "rpc error: code = 2 desc = Error: Status 400 trying to pull repository ...: \"{\\n  \\\"errors\\\" : [ {\\n    \\\"status\\\" : 400,\\n    \\\"message\\\" : \\\"Unsupported docker v1 repository request for '...'\\\"\\n  } ]\\n}\""
```
3. Create secret `kubectl create secret docker-registry ...`
4. Spark-submit with `spark.kubernetes.imagePullSecret` set to the new secret. See that deployment was successful.

Author: Andrew Korzhuev <andrew.korzhuev@klarna.com>
Author: Andrew Korzhuev <korzhuev@andrusha.me>

Closes apache#20811 from andrusha/spark-23668-image-pull-secrets.
andrusha pushed a commit to andrusha/spark that referenced this pull request Apr 5, 2018
….imagePullSecrets

Pass through the `imagePullSecrets` option to the k8s pod in order to allow user to access private image registries.

See https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/

Unit tests + manual testing.

Manual testing procedure:
1. Have private image registry.
2. Spark-submit application with no `spark.kubernetes.imagePullSecret` set. Do `kubectl describe pod ...`. See the error message:
```
Error syncing pod, skipping: failed to "StartContainer" for "spark-kubernetes-driver" with ErrImagePull: "rpc error: code = 2 desc = Error: Status 400 trying to pull repository ...: \"{\\n  \\\"errors\\\" : [ {\\n    \\\"status\\\" : 400,\\n    \\\"message\\\" : \\\"Unsupported docker v1 repository request for '...'\\\"\\n  } ]\\n}\""
```
3. Create secret `kubectl create secret docker-registry ...`
4. Spark-submit with `spark.kubernetes.imagePullSecret` set to the new secret. See that deployment was successful.

Author: Andrew Korzhuev <andrew.korzhuev@klarna.com>
Author: Andrew Korzhuev <korzhuev@andrusha.me>

Closes apache#20811 from andrusha/spark-23668-image-pull-secrets.
robert3005 pushed a commit to palantir/spark that referenced this pull request Apr 7, 2018
….imagePullSecrets

## What changes were proposed in this pull request?

Pass through the `imagePullSecrets` option to the k8s pod in order to allow user to access private image registries.

See https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/

## How was this patch tested?

Unit tests + manual testing.

Manual testing procedure:
1. Have private image registry.
2. Spark-submit application with no `spark.kubernetes.imagePullSecret` set. Do `kubectl describe pod ...`. See the error message:
```
Error syncing pod, skipping: failed to "StartContainer" for "spark-kubernetes-driver" with ErrImagePull: "rpc error: code = 2 desc = Error: Status 400 trying to pull repository ...: \"{\\n  \\\"errors\\\" : [ {\\n    \\\"status\\\" : 400,\\n    \\\"message\\\" : \\\"Unsupported docker v1 repository request for '...'\\\"\\n  } ]\\n}\""
```
3. Create secret `kubectl create secret docker-registry ...`
4. Spark-submit with `spark.kubernetes.imagePullSecret` set to the new secret. See that deployment was successful.

Author: Andrew Korzhuev <andrew.korzhuev@klarna.com>
Author: Andrew Korzhuev <korzhuev@andrusha.me>

Closes apache#20811 from andrusha/spark-23668-image-pull-secrets.
rdjy pushed a commit to rdjy/spark that referenced this pull request May 9, 2018
….imagePullSecrets

Pass through the `imagePullSecrets` option to the k8s pod in order to allow user to access private image registries.

See https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/

Unit tests + manual testing.

Manual testing procedure:
1. Have private image registry.
2. Spark-submit application with no `spark.kubernetes.imagePullSecret` set. Do `kubectl describe pod ...`. See the error message:
```
Error syncing pod, skipping: failed to "StartContainer" for "spark-kubernetes-driver" with ErrImagePull: "rpc error: code = 2 desc = Error: Status 400 trying to pull repository ...: \"{\\n  \\\"errors\\\" : [ {\\n    \\\"status\\\" : 400,\\n    \\\"message\\\" : \\\"Unsupported docker v1 repository request for '...'\\\"\\n  } ]\\n}\""
```
3. Create secret `kubectl create secret docker-registry ...`
4. Spark-submit with `spark.kubernetes.imagePullSecret` set to the new secret. See that deployment was successful.

Author: Andrew Korzhuev <andrew.korzhuev@klarna.com>
Author: Andrew Korzhuev <korzhuev@andrusha.me>

Closes apache#20811 from andrusha/spark-23668-image-pull-secrets.
gitfy pushed a commit to gitfy/spark that referenced this pull request May 21, 2018
….imagePullSecrets

## What changes were proposed in this pull request?

Pass through the `imagePullSecrets` option to the k8s pod in order to allow user to access private image registries.

See https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/

## How was this patch tested?

Unit tests + manual testing.

Manual testing procedure:
1. Have private image registry.
2. Spark-submit application with no `spark.kubernetes.imagePullSecret` set. Do `kubectl describe pod ...`. See the error message:
```
Error syncing pod, skipping: failed to "StartContainer" for "spark-kubernetes-driver" with ErrImagePull: "rpc error: code = 2 desc = Error: Status 400 trying to pull repository ...: \"{\\n  \\\"errors\\\" : [ {\\n    \\\"status\\\" : 400,\\n    \\\"message\\\" : \\\"Unsupported docker v1 repository request for '...'\\\"\\n  } ]\\n}\""
```
3. Create secret `kubectl create secret docker-registry ...`
4. Spark-submit with `spark.kubernetes.imagePullSecret` set to the new secret. See that deployment was successful.

Author: Andrew Korzhuev <andrew.korzhuev@klarna.com>
Author: Andrew Korzhuev <korzhuev@andrusha.me>

Closes apache#20811 from andrusha/spark-23668-image-pull-secrets.
ArturLakizo added a commit to ArturLakizo/spark that referenced this pull request Jul 15, 2018
* [SPARK-24381][TESTING] Add unit tests for NOT IN subquery around null values

## What changes were proposed in this pull request?
This PR adds several unit tests along the `cols NOT IN (subquery)` pathway. There are a scattering of tests here and there which cover this codepath, but there doesn't seem to be a unified unit test of the correctness of null-aware anti joins anywhere. I have also added a brief explanation of how this expression behaves in SubquerySuite. Lastly, I made some clarifying changes in the NOT IN pathway in RewritePredicateSubquery.

## How was this patch tested?
Added unit tests! There should be no behavioral change in this PR.

Author: Miles Yucht <miles@databricks.com>

Closes #21425 from mgyucht/spark-24381.

* [SPARK-24334] Fix race condition in ArrowPythonRunner causes unclean shutdown of Arrow memory allocator

## What changes were proposed in this pull request?

There is a race condition of closing Arrow VectorSchemaRoot and Allocator in the writer thread of ArrowPythonRunner.

The race results in memory leak exception when closing the allocator. This patch removes the closing routine from the TaskCompletionListener and make the writer thread responsible for cleaning up the Arrow memory.

This issue be reproduced by this test:

```
def test_memory_leak(self):
    from pyspark.sql.functions import pandas_udf, col, PandasUDFType, array, lit, explode

   # Have all data in a single executor thread so it can trigger the race condition easier
    with self.sql_conf({'spark.sql.shuffle.partitions': 1}):
        df = self.spark.range(0, 1000)
        df = df.withColumn('id', array([lit(i) for i in range(0, 300)])) \
                   .withColumn('id', explode(col('id'))) \
                   .withColumn('v',  array([lit(i) for i in range(0, 1000)]))

       pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
       def foo(pdf):
           xxx
           return pdf

       result = df.groupby('id').apply(foo)

       with QuietTest(self.sc):
           with self.assertRaises(py4j.protocol.Py4JJavaError) as context:
               result.count()
           self.assertTrue('Memory leaked' not in str(context.exception))
```

Note: Because of the race condition, the test case cannot reproduce the issue reliably so it's not added to test cases.

## How was this patch tested?

Because of the race condition, the bug cannot be unit test easily. So far it has only happens on large amount of data. This is currently tested manually.

Author: Li Jin <ice.xelloss@gmail.com>

Closes #21397 from icexelloss/SPARK-24334-arrow-memory-leak.

* [SPARK-24373][SQL] Add AnalysisBarrier to RelationalGroupedDataset's and KeyValueGroupedDataset's child

## What changes were proposed in this pull request?

When we create a `RelationalGroupedDataset` or a `KeyValueGroupedDataset` we set its child to the `logicalPlan` of the `DataFrame` we need to aggregate. Since the `logicalPlan` is already analyzed, we should not analyze it again. But this happens when the new plan of the aggregate is analyzed.

The current behavior in most of the cases is likely to produce no harm, but in other cases re-analyzing an analyzed plan can change it, since the analysis is not idempotent. This can cause issues like the one described in the JIRA (missing to find a cached plan).

The PR adds an `AnalysisBarrier` to the `logicalPlan` which is used as child of `RelationalGroupedDataset` or a `KeyValueGroupedDataset`.

## How was this patch tested?

added UT

Author: Marco Gaido <marcogaido91@gmail.com>

Closes #21432 from mgaido91/SPARK-24373.

* [SPARK-24392][PYTHON] Label pandas_udf as Experimental

## What changes were proposed in this pull request?

The pandas_udf functionality was introduced in 2.3.0, but is not completely stable and still evolving.  This adds a label to indicate it is still an experimental API.

## How was this patch tested?

NA

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #21435 from BryanCutler/arrow-pandas_udf-experimental-SPARK-24392.

* [SPARK-19613][SS][TEST] Random.nextString is not safe for directory namePrefix

## What changes were proposed in this pull request?

`Random.nextString` is good for generating random string data, but it's not proper for directory name prefix in `Utils.createDirectory(tempDir, Random.nextString(10))`. This PR uses more safe directory namePrefix.

```scala
scala> scala.util.Random.nextString(10)
res0: String = 馨쭔ᎰႻ穚䃈兩㻞藑並
```

```scala
StateStoreRDDSuite:
- versioning and immutability
- recovering from files
- usage with iterators - only gets and only puts
- preferred locations using StateStoreCoordinator *** FAILED ***
  java.io.IOException: Failed to create a temp directory (under /.../spark/sql/core/target/tmp/StateStoreRDDSuite8712796397908632676) after 10 attempts!
  at org.apache.spark.util.Utils$.createDirectory(Utils.scala:295)
  at org.apache.spark.sql.execution.streaming.state.StateStoreRDDSuite$$anonfun$13$$anonfun$apply$6.apply(StateStoreRDDSuite.scala:152)
  at org.apache.spark.sql.execution.streaming.state.StateStoreRDDSuite$$anonfun$13$$anonfun$apply$6.apply(StateStoreRDDSuite.scala:149)
  at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42)
  at org.apache.spark.sql.execution.streaming.state.StateStoreRDDSuite$$anonfun$13.apply(StateStoreRDDSuite.scala:149)
  at org.apache.spark.sql.execution.streaming.state.StateStoreRDDSuite$$anonfun$13.apply(StateStoreRDDSuite.scala:149)
...
- distributed test *** FAILED ***
  java.io.IOException: Failed to create a temp directory (under /.../spark/sql/core/target/tmp/StateStoreRDDSuite8712796397908632676) after 10 attempts!
  at org.apache.spark.util.Utils$.createDirectory(Utils.scala:295)
```

## How was this patch tested?

Pass the existing tests.StateStoreRDDSuite:

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #21446 from dongjoon-hyun/SPARK-19613.

* [SPARK-24377][SPARK SUBMIT] make --py-files work in non pyspark application

## What changes were proposed in this pull request?

For some Spark applications, though they're a java program, they require not only jar dependencies, but also python dependencies. One example is Livy remote SparkContext application, this application is actually an embedded REPL for Scala/Python/R, it will not only load in jar dependencies, but also python and R deps, so we should specify not only "--jars", but also "--py-files".

Currently for a Spark application, --py-files can only be worked for a pyspark application, so it will not be worked in the above case. So here propose to remove such restriction.

Also we tested that "spark.submit.pyFiles" only supports quite limited scenario (client mode with local deps), so here also expand the usage of "spark.submit.pyFiles" to be alternative of --py-files.

## How was this patch tested?

UT added.

Author: jerryshao <sshao@hortonworks.com>

Closes #21420 from jerryshao/SPARK-24377.

* [SPARK-24250][SQL][FOLLOW-UP] support accessing SQLConf inside tasks

## What changes were proposed in this pull request?
We should not stop users from calling `getActiveSession` and `getDefaultSession` in executors. To not break the existing behaviors, we should simply return None.

## How was this patch tested?
N/A

Author: Xiao Li <gatorsmile@gmail.com>

Closes #21436 from gatorsmile/followUpSPARK-24250.

* [SPARK-23991][DSTREAMS] Fix data loss when WAL write fails in allocateBlocksToBatch

When blocks tried to get allocated to a batch and WAL write fails then the blocks will be removed from the received block queue. This fact simply produces data loss because the next allocation will not find the mentioned blocks in the queue.

In this PR blocks will be removed from the received queue only if WAL write succeded.

Additional unit test.

Author: Gabor Somogyi <gabor.g.somogyi@gmail.com>

Closes #21430 from gaborgsomogyi/SPARK-23991.

Change-Id: I5ead84f0233f0c95e6d9f2854ac2ff6906f6b341

* [SPARK-24371][SQL] Added isInCollection in DataFrame API for Scala and Java.

## What changes were proposed in this pull request?

Implemented **`isInCollection `** in DataFrame API for both Scala and Java, so users can do

```scala
val profileDF = Seq(
  Some(1), Some(2), Some(3), Some(4),
  Some(5), Some(6), Some(7), None
).toDF("profileID")

val validUsers: Seq[Any] = Seq(6, 7.toShort, 8L, "3")

val result = profileDF.withColumn("isValid", $"profileID". isInCollection(validUsers))

result.show(10)
"""
+---------+-------+
|profileID|isValid|
+---------+-------+
|        1|  false|
|        2|  false|
|        3|   true|
|        4|  false|
|        5|  false|
|        6|   true|
|        7|   true|
|     null|   null|
+---------+-------+
 """.stripMargin
```
## How was this patch tested?

Several unit tests are added.

Author: DB Tsai <d_tsai@apple.com>

Closes #21416 from dbtsai/optimize-set.

* [SPARK-24365][SQL] Add Data Source write benchmark

## What changes were proposed in this pull request?

Add Data Source write benchmark. So that it would be easier to measure the writer performance.

Author: Gengliang Wang <gengliang.wang@databricks.com>

Closes #21409 from gengliangwang/parquetWriteBenchmark.

* [SPARK-24331][SPARKR][SQL] Adding arrays_overlap, array_repeat, map_entries to SparkR

## What changes were proposed in this pull request?

The PR adds functions `arrays_overlap`, `array_repeat`, `map_entries` to SparkR.

## How was this patch tested?

Tests added into R/pkg/tests/fulltests/test_sparkSQL.R

## Examples
### arrays_overlap
```
df <- createDataFrame(list(list(list(1L, 2L), list(3L, 1L)),
                           list(list(1L, 2L), list(3L, 4L)),
                           list(list(1L, NA), list(3L, 4L))))
collect(select(df, arrays_overlap(df[[1]], df[[2]])))
```
```
  arrays_overlap(_1, _2)
1                   TRUE
2                  FALSE
3                     NA
```
### array_repeat
```
df <- createDataFrame(list(list("a", 3L), list("b", 2L)))
collect(select(df, array_repeat(df[[1]], df[[2]])))
```
```
  array_repeat(_1, _2)
1              a, a, a
2                 b, b
```
```
collect(select(df, array_repeat(df[[1]], 2L)))
```
```
  array_repeat(_1, 2)
1                a, a
2                b, b
```
### map_entries
```
df <- createDataFrame(list(list(map = as.environment(list(x = 1, y = 2)))))
collect(select(df, map_entries(df$map)))
```
```
  map_entries(map)
1       x, 1, y, 2
```

Author: Marek Novotny <mn.mikke@gmail.com>

Closes #21434 from mn-mikke/SPARK-24331.

* [SPARK-23754][PYTHON] Re-raising StopIteration in client code

## What changes were proposed in this pull request?

Make sure that `StopIteration`s raised in users' code do not silently interrupt processing by spark, but are raised as exceptions to the users. The users' functions are wrapped in `safe_iter` (in `shuffle.py`), which re-raises `StopIteration`s as `RuntimeError`s

## How was this patch tested?

Unit tests, making sure that the exceptions are indeed raised. I am not sure how to check whether a `Py4JJavaError` contains my exception, so I simply looked for the exception message in the java exception's `toString`. Can you propose a better way?

## License

This is my original work, licensed in the same way as spark

Author: e-dorigatti <emilio.dorigatti@gmail.com>
Author: edorigatti <emilio.dorigatti@gmail.com>

Closes #21383 from e-dorigatti/fix_spark_23754.

* [SPARK-24419][BUILD] Upgrade SBT to 0.13.17 with Scala 2.10.7 for JDK9+

## What changes were proposed in this pull request?

Upgrade SBT to 0.13.17 with Scala 2.10.7 for JDK9+

## How was this patch tested?

Existing tests

Author: DB Tsai <d_tsai@apple.com>

Closes #21458 from dbtsai/sbt.

* [SPARK-24369][SQL] Correct handling for multiple distinct aggregations having the same argument set

## What changes were proposed in this pull request?
This pr fixed an issue when having multiple distinct aggregations having the same argument set, e.g.,
```
scala>: paste
val df = sql(
  s"""SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*)
     | FROM (VALUES (1, 1), (2, 2), (2, 2)) t(x, y)
   """.stripMargin)

java.lang.RuntimeException
You hit a query analyzer bug. Please report your query to Spark user mailing list.
```
The root cause is that `RewriteDistinctAggregates` can't detect multiple distinct aggregations if they have the same argument set. This pr modified code so that `RewriteDistinctAggregates` could count the number of aggregate expressions with `isDistinct=true`.

## How was this patch tested?
Added tests in `DataFrameAggregateSuite`.

Author: Takeshi Yamamuro <yamamuro@apache.org>

Closes #21443 from maropu/SPARK-24369.

* [SPARK-24384][PYTHON][SPARK SUBMIT] Add .py files correctly into PythonRunner in submit with client mode in spark-submit

## What changes were proposed in this pull request?

In client side before context initialization specifically,  .py file doesn't work in client side before context initialization when the application is a Python file. See below:

```
$ cat /home/spark/tmp.py
def testtest():
    return 1
```

This works:

```
$ cat app.py
import pyspark
pyspark.sql.SparkSession.builder.getOrCreate()
import tmp
print("************************%s" % tmp.testtest())

$ ./bin/spark-submit --master yarn --deploy-mode client --py-files /home/spark/tmp.py app.py
...
************************1
```

but this doesn't:

```
$ cat app.py
import pyspark
import tmp
pyspark.sql.SparkSession.builder.getOrCreate()
print("************************%s" % tmp.testtest())

$ ./bin/spark-submit --master yarn --deploy-mode client --py-files /home/spark/tmp.py app.py
Traceback (most recent call last):
  File "/home/spark/spark/app.py", line 2, in <module>
    import tmp
ImportError: No module named tmp
```

### How did it happen?

In client mode specifically, the paths are being added into PythonRunner as are:

https://github.com/apache/spark/blob/628c7b517969c4a7ccb26ea67ab3dd61266073ca/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L430

https://github.com/apache/spark/blob/628c7b517969c4a7ccb26ea67ab3dd61266073ca/core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala#L49-L88

The problem here is, .py file shouldn't be added as are since `PYTHONPATH` expects a directory or an archive like zip or egg.

### How does this PR fix?

We shouldn't simply just add its parent directory because other files in the parent directory could also be added into the `PYTHONPATH` in client mode before context initialization.

Therefore, we copy .py files into a temp directory for .py files and add it to `PYTHONPATH`.

## How was this patch tested?

Unit tests are added and manually tested in both standalond and yarn client modes with submit.

Author: hyukjinkwon <gurwls223@apache.org>

Closes #21426 from HyukjinKwon/SPARK-24384.

* [SPARK-23161][PYSPARK][ML] Add missing APIs to Python GBTClassifier

## What changes were proposed in this pull request?

Add featureSubsetStrategy in GBTClassifier and GBTRegressor.  Also make GBTClassificationModel inherit from JavaClassificationModel instead of prediction model so it will have numClasses.

## How was this patch tested?

Add tests in doctest

Author: Huaxin Gao <huaxing@us.ibm.com>

Closes #21413 from huaxingao/spark-23161.

* [SPARK-23901][SQL] Add masking functions

## What changes were proposed in this pull request?

The PR adds the masking function as they are described in Hive's documentation: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DataMaskingFunctions.
This means that only `string`s are accepted as parameter for the masking functions.

## How was this patch tested?

added UTs

Author: Marco Gaido <marcogaido91@gmail.com>

Closes #21246 from mgaido91/SPARK-23901.

* [SPARK-24276][SQL] Order of literals in IN should not affect semantic equality

## What changes were proposed in this pull request?

When two `In` operators are created with the same list of values, but different order, we are considering them as semantically different. This is wrong, since they have the same semantic meaning.

The PR adds a canonicalization rule which orders the literals in the `In` operator so the semantic equality works properly.

## How was this patch tested?

added UT

Author: Marco Gaido <marcogaido91@gmail.com>

Closes #21331 from mgaido91/SPARK-24276.

* [SPARK-24337][CORE] Improve error messages for Spark conf values

## What changes were proposed in this pull request?

Improve the exception messages when retrieving Spark conf values to include the key name when the value is invalid.

## How was this patch tested?

Unit tests for all get* operations in SparkConf that require a specific value format

Author: William Sheu <william.sheu@databricks.com>

Closes #21454 from PenguinToast/SPARK-24337-spark-config-errors.

* [SPARK-24146][PYSPARK][ML] spark.ml parity for sequential pattern mining - PrefixSpan: Python API

## What changes were proposed in this pull request?

spark.ml parity for sequential pattern mining - PrefixSpan: Python API

## How was this patch tested?

doctests

Author: WeichenXu <weichen.xu@databricks.com>

Closes #21265 from WeichenXu123/prefix_span_py.

* [WEBUI] Avoid possibility of script in query param keys

As discussed separately, this avoids the possibility of XSS on certain request param keys.

CC vanzin

Author: Sean Owen <srowen@gmail.com>

Closes #21464 from srowen/XSS2.

* [SPARK-24414][UI] Calculate the correct number of tasks for a stage.

This change takes into account all non-pending tasks when calculating
the number of tasks to be shown. This also means that when the stage
is pending, the task table (or, in fact, most of the data in the stage
page) will not be rendered.

I also fixed the label when the known number of tasks is larger than
the recorded number of tasks (it was inverted).

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #21457 from vanzin/SPARK-24414.

* [SPARK-24397][PYSPARK] Added TaskContext.getLocalProperty(key) in Python

## What changes were proposed in this pull request?

This adds a new API `TaskContext.getLocalProperty(key)` to the Python TaskContext. It mirrors the Java TaskContext API of returning a string value if the key exists, or None if the key does not exist.

## How was this patch tested?
New test added.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #21437 from tdas/SPARK-24397.

* [SPARK-23900][SQL] format_number support user specifed format as argument

## What changes were proposed in this pull request?

`format_number` support user specifed format as argument. For example:
```sql
spark-sql> SELECT format_number(12332.123456, '##################.###');
12332.123
```

## How was this patch tested?

unit test

Author: Yuming Wang <yumwang@ebay.com>

Closes #21010 from wangyum/SPARK-23900.

* [SPARK-24232][K8S] Add support for secret env vars

## What changes were proposed in this pull request?

* Allows to refer a secret as an env var.
* Introduces new config properties in the form: spark.kubernetes{driver,executor}.secretKeyRef.ENV_NAME=name:key
  ENV_NAME is case sensitive.

* Updates docs.
* Adds required unit tests.

## How was this patch tested?
Manually tested and confirmed that the secrets exist in driver's and executor's container env.
Also job finished successfully.
First created a secret with the following yaml:
```
apiVersion: v1
kind: Secret
metadata:
  name: test-secret
data:
  username: c3RhdnJvcwo=
  password: Mzk1MjgkdmRnN0pi

-------

$ echo -n 'stavros' | base64
c3RhdnJvcw==
$ echo -n '39528$vdg7Jb' | base64
MWYyZDFlMmU2N2Rm
```
Run a job as follows:
```./bin/spark-submit \
      --master k8s://http://localhost:9000 \
      --deploy-mode cluster \
      --name spark-pi \
      --class org.apache.spark.examples.SparkPi \
      --conf spark.executor.instances=1 \
      --conf spark.kubernetes.container.image=skonto/spark:k8envs3 \
      --conf spark.kubernetes.driver.secretKeyRef.MY_USERNAME=test-secret:username \
      --conf spark.kubernetes.driver.secretKeyRef.My_password=test-secret:password \
      --conf spark.kubernetes.executor.secretKeyRef.MY_USERNAME=test-secret:username \
      --conf spark.kubernetes.executor.secretKeyRef.My_password=test-secret:password \
      local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0-SNAPSHOT.jar 10000
```

Secret loaded correctly at the driver container:
![image](https://user-images.githubusercontent.com/7945591/40174346-7fee70c8-59dd-11e8-8705-995a5472716f.png)

Also if I log into the exec container:

kubectl exec -it spark-pi-1526555613156-exec-1 bash
bash-4.4# env

> SPARK_EXECUTOR_MEMORY=1g
> SPARK_EXECUTOR_CORES=1
> LANG=C.UTF-8
> HOSTNAME=spark-pi-1526555613156-exec-1
> SPARK_APPLICATION_ID=spark-application-1526555618626
> **MY_USERNAME=stavros**
>
> JAVA_HOME=/usr/lib/jvm/java-1.8-openjdk
> KUBERNETES_PORT_443_TCP_PROTO=tcp
> KUBERNETES_PORT_443_TCP_ADDR=10.100.0.1
> JAVA_VERSION=8u151
> KUBERNETES_PORT=tcp://10.100.0.1:443
> PWD=/opt/spark/work-dir
> HOME=/root
> SPARK_LOCAL_DIRS=/var/data/spark-b569b0ae-b7ef-4f91-bcd5-0f55535d3564
> KUBERNETES_SERVICE_PORT_HTTPS=443
> KUBERNETES_PORT_443_TCP_PORT=443
> SPARK_HOME=/opt/spark
> SPARK_DRIVER_URL=spark://CoarseGrainedSchedulerspark-pi-1526555613156-driver-svc.default.svc:7078
> KUBERNETES_PORT_443_TCP=tcp://10.100.0.1:443
> SPARK_EXECUTOR_POD_IP=9.0.9.77
> TERM=xterm
> SPARK_EXECUTOR_ID=1
> SHLVL=1
> KUBERNETES_SERVICE_PORT=443
> SPARK_CONF_DIR=/opt/spark/conf
> PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/lib/jvm/java-1.8-openjdk/jre/bin:/usr/lib/jvm/java-1.8-openjdk/bin
> JAVA_ALPINE_VERSION=8.151.12-r0
> KUBERNETES_SERVICE_HOST=10.100.0.1
> **My_password=39528$vdg7Jb**
> _=/usr/bin/env
>

Author: Stavros Kontopoulos <stavros.kontopoulos@lightbend.com>

Closes #21317 from skonto/k8s-fix-env-secrets.

* [MINOR][YARN] Add YARN-specific credential providers in debug logging message

This PR adds a debugging log for YARN-specific credential providers which is loaded by service loader mechanism.

It took me a while to debug if it's actually loaded or not. I had to explicitly set the deprecated configuration and check if that's actually being loaded.

The change scope is manually tested. Logs are like:

```
Using the following builtin delegation token providers: hadoopfs, hive, hbase.
Using the following YARN-specific credential providers: yarn-test.
```

Author: hyukjinkwon <gurwls223@apache.org>

Closes #21466 from HyukjinKwon/minor-log.

Change-Id: I18e2fb8eeb3289b148f24c47bb3130a560a881cf

* [SPARK-24330][SQL] Refactor ExecuteWriteTask and Use `while` in writing files

## What changes were proposed in this pull request?
1. Refactor ExecuteWriteTask in FileFormatWriter to reduce common logic and improve readability.
After the change, callers only need to call `commit()` or `abort` at the end of task.
Also there is less code in `SingleDirectoryWriteTask` and `DynamicPartitionWriteTask`.
Definitions of related classes are moved to a new file, and `ExecuteWriteTask` is renamed to `FileFormatDataWriter`.

2. As per code style guide: https://github.com/databricks/scala-style-guide#traversal-and-zipwithindex , we avoid using `for` for looping in [FileFormatWriter](https://github.com/apache/spark/pull/21381/files#diff-3b69eb0963b68c65cfe8075f8a42e850L536) , or `foreach` in [WriteToDataSourceV2Exec](https://github.com/apache/spark/pull/21381/files#diff-6fbe10db766049a395bae2e785e9d56eL119).
In such critical code path, using `while` is good for performance.

## How was this patch tested?
Existing unit test.
I tried the microbenchmark in https://github.com/apache/spark/pull/21409

| Workload | Before changes(Best/Avg Time(ms)) | After changes(Best/Avg Time(ms)) |
| --- | --- | -- |
|Output Single Int Column|    2018 / 2043   |    2096 / 2236 |
|Output Single Double Column| 1978 / 2043 | 2013 / 2018 |
|Output Int and String Column| 6332 / 6706 | 6162 / 6298 |
|Output Partitions| 4458 / 5094 |  3792 / 4008  |
|Output Buckets|           5695 / 6102 |    5120 / 5154 |

Also a microbenchmark on my laptop for general comparison among while/foreach/for :
```
class Writer {
  var sum = 0L
  def write(l: Long): Unit = sum += l
}

def testWhile(iterator: Iterator[Long]): Long = {
  val w = new Writer
  while (iterator.hasNext) {
    w.write(iterator.next())
  }
  w.sum
}

def testForeach(iterator: Iterator[Long]): Long = {
  val w = new Writer
  iterator.foreach(w.write)
  w.sum
}

def testFor(iterator: Iterator[Long]): Long = {
  val w = new Writer
  for (x <- iterator) {
    w.write(x)
  }
  w.sum
}

val data = 0L to 100000000L
val start = System.nanoTime
(0 to 10).foreach(_ => testWhile(data.iterator))
println("benchmark while: " + (System.nanoTime - start)/1000000)

val start2 = System.nanoTime
(0 to 10).foreach(_ => testForeach(data.iterator))
println("benchmark foreach: " + (System.nanoTime - start2)/1000000)

val start3 = System.nanoTime
(0 to 10).foreach(_ => testForeach(data.iterator))
println("benchmark for: " + (System.nanoTime - start3)/1000000)
```
Benchmark result:
`while`: 15401 ms
`foreach`: 43034 ms
`for`: 41279 ms

Author: Gengliang Wang <gengliang.wang@databricks.com>

Closes #21381 from gengliangwang/refactorExecuteWriteTask.

* [SPARK-24444][DOCS][PYTHON] Improve Pandas UDF docs to explain column assignment

## What changes were proposed in this pull request?

Added sections to pandas_udf docs, in the grouped map section, to indicate columns are assigned by position.

## How was this patch tested?

NA

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #21471 from BryanCutler/arrow-doc-pandas_udf-column_by_pos-SPARK-21427.

* [SPARK-24326][MESOS] add support for local:// scheme for the app jar

## What changes were proposed in this pull request?

* Adds support for local:// scheme like in k8s case for image based deployments where the jar is already in the image. Affects cluster mode and the mesos dispatcher.  Covers also file:// scheme. Keeps the default case where jar resolution happens on the host.

## How was this patch tested?

Dispatcher image with the patch, use it to start DC/OS Spark service:
skonto/spark-local-disp:test

Test image with my application jar located at the root folder:
skonto/spark-local:test

Dockerfile for that image.

From mesosphere/spark:2.3.0-2.2.1-2-hadoop-2.6
COPY spark-examples_2.11-2.2.1.jar /
WORKDIR /opt/spark/dist

Tests:

The following work as expected:

* local normal example
```
dcos spark run --submit-args="--conf spark.mesos.appJar.local.resolution.mode=container --conf spark.executor.memory=1g --conf spark.mesos.executor.docker.image=skonto/spark-local:test
 --conf spark.executor.cores=2 --conf spark.cores.max=8
 --class org.apache.spark.examples.SparkPi local:///spark-examples_2.11-2.2.1.jar"
```

* make sure the flag does not affect other uris
```
dcos spark run --submit-args="--conf spark.mesos.appJar.local.resolution.mode=container --conf spark.executor.memory=1g  --conf spark.executor.cores=2 --conf spark.cores.max=8
 --class org.apache.spark.examples.SparkPi https://s3-eu-west-1.amazonaws.com/fdp-stavros-test/spark-examples_2.11-2.1.1.jar"
```

* normal example no local
```
dcos spark run --submit-args="--conf spark.executor.memory=1g  --conf spark.executor.cores=2 --conf spark.cores.max=8
 --class org.apache.spark.examples.SparkPi https://s3-eu-west-1.amazonaws.com/fdp-stavros-test/spark-examples_2.11-2.1.1.jar"

```

The following fails

 * uses local with no setting, default is host.
```
dcos spark run --submit-args="--conf spark.executor.memory=1g --conf spark.mesos.executor.docker.image=skonto/spark-local:test
  --conf spark.executor.cores=2 --conf spark.cores.max=8
  --class org.apache.spark.examples.SparkPi local:///spark-examples_2.11-2.2.1.jar"
```
![image](https://user-images.githubusercontent.com/7945591/40283021-8d349762-5c80-11e8-9d62-2a61a4318fd5.png)

Author: Stavros Kontopoulos <stavros.kontopoulos@lightbend.com>

Closes #21378 from skonto/local-upstream.

* [SPARK-23920][SQL] add array_remove to remove all elements that equal element from array

## What changes were proposed in this pull request?

add array_remove to remove all elements that equal element from array

## How was this patch tested?

add unit tests

Author: Huaxin Gao <huaxing@us.ibm.com>

Closes #21069 from huaxingao/spark-23920.

* [SPARK-24351][SS] offsetLog/commitLog purge thresholdBatchId should be computed with current committed epoch but not currentBatchId in CP mode

## What changes were proposed in this pull request?
Compute the thresholdBatchId to purge metadata based on current committed epoch instead of currentBatchId in CP mode to avoid cleaning all the committed metadata in some case as described in the jira [SPARK-24351](https://issues.apache.org/jira/browse/SPARK-24351).

## How was this patch tested?
Add new unit test.

Author: Huang Tengfei <tengfei.h@gmail.com>

Closes #21400 from ivoson/branch-cp-meta.

* Revert "[SPARK-24369][SQL] Correct handling for multiple distinct aggregations having the same argument set"

This reverts commit 1e46f92f956a00d04d47340489b6125d44dbd47b.

* [INFRA] Close stale PRs.

Closes #21444

* [SPARK-24340][CORE] Clean up non-shuffle disk block manager files following executor exits on a Standalone cluster

## What changes were proposed in this pull request?

Currently we only clean up the local directories on application removed. However, when executors die and restart repeatedly, many temp files are left untouched in the local directories, which is undesired behavior and could cause disk space used up gradually.

We can detect executor death in the Worker, and clean up the non-shuffle files (files not ended with ".index" or ".data") in the local directories, we should not touch the shuffle files since they are expected to be used by the external shuffle service.

Scope of this PR is limited to only implement the cleanup logic on a Standalone cluster, we defer to experts familiar with other cluster managers(YARN/Mesos/K8s) to determine whether it's worth to add similar support.

## How was this patch tested?

Add new test suite to cover.

Author: Xingbo Jiang <xingbo.jiang@databricks.com>

Closes #21390 from jiangxb1987/cleanupNonshuffleFiles.

* [SPARK-23668][K8S] Added missing config property in running-on-kubernetes.md

## What changes were proposed in this pull request?
PR https://github.com/apache/spark/pull/20811 introduced a new Spark configuration property `spark.kubernetes.container.image.pullSecrets` for specifying image pull secrets. However, the documentation wasn't updated accordingly. This PR adds the property introduced into running-on-kubernetes.md.

## How was this patch tested?
N/A.

foxish mccheah please help merge this. Thanks!

Author: Yinan Li <ynli@google.com>

Closes #21480 from liyinan926/master.

* [SPARK-24356][CORE] Duplicate strings in File.path managed by FileSegmentManagedBuffer

This patch eliminates duplicate strings that come from the 'path' field of
java.io.File objects created by FileSegmentManagedBuffer. That is, we want
to avoid the situation when multiple File instances for the same pathname
"foo/bar" are created, each with a separate copy of the "foo/bar" String
instance. In some scenarios such duplicate strings may waste a lot of memory
(~ 10% of the heap). To avoid that, we intern the pathname with
String.intern(), and before that we make sure that it's in a normalized
form (contains no "//", "///" etc.) Otherwise, the code in java.io.File
would normalize it later, creating a new "foo/bar" String copy.
Unfortunately, the normalization code that java.io.File uses internally
is in the package-private class java.io.FileSystem, so we cannot call it
here directly.

## What changes were proposed in this pull request?

Added code to ExternalShuffleBlockResolver.getFile(), that normalizes and then interns the pathname string before passing it to the File() constructor.

## How was this patch tested?

Added unit test

Author: Misha Dmitriev <misha@cloudera.com>

Closes #21456 from countmdm/misha/spark-24356.

* [SPARK-24455][CORE] fix typo in TaskSchedulerImpl comment

change runTasks to submitTasks  in the TaskSchedulerImpl.scala 's comment

Author: xueyu <xueyu@yidian-inc.com>
Author: Xue Yu <278006819@qq.com>

Closes #21485 from xueyumusic/fixtypo1.

* [SPARK-24369][SQL] Correct handling for multiple distinct aggregations having the same argument set

## What changes were proposed in this pull request?

bring back https://github.com/apache/spark/pull/21443

This is a different approach: just change the check to count distinct columns with `toSet`

## How was this patch tested?

a new test to verify the planner behavior.

Author: Wenchen Fan <wenchen@databricks.com>
Author: Takeshi Yamamuro <yamamuro@apache.org>

Closes #21487 from cloud-fan/back.

* [SPARK-23786][SQL] Checking column names of csv headers

## What changes were proposed in this pull request?

Currently column names of headers in CSV files are not checked against provided schema of CSV data. It could cause errors like showed in the [SPARK-23786](https://issues.apache.org/jira/browse/SPARK-23786) and https://github.com/apache/spark/pull/20894#issuecomment-375957777. I introduced new CSV option - `enforceSchema`. If it is enabled (by default `true`), Spark forcibly applies provided or inferred schema to CSV files. In that case, CSV headers are ignored and not checked against the schema. If `enforceSchema` is set to `false`, additional checks can be performed. For example, if column in CSV header and in the schema have different ordering, the following exception is thrown:

```
java.lang.IllegalArgumentException: CSV file header does not contain the expected fields
 Header: depth, temperature
 Schema: temperature, depth
CSV file: marina.csv
```

## How was this patch tested?

The changes were tested by existing tests of CSVSuite and by 2 new tests.

Author: Maxim Gekk <maxim.gekk@databricks.com>
Author: Maxim Gekk <max.gekk@gmail.com>

Closes #20894 from MaxGekk/check-column-names.

* [SPARK-23903][SQL] Add support for date extract

## What changes were proposed in this pull request?

Add support for date `extract` function:
```sql
spark-sql> SELECT EXTRACT(YEAR FROM TIMESTAMP '2000-12-16 12:21:13');
2000
```
Supported field same as [Hive](https://github.com/apache/hive/blob/rel/release-2.3.3/ql/src/java/org/apache/hadoop/hive/ql/parse/IdentifiersParser.g#L308-L316): `YEAR`, `QUARTER`, `MONTH`, `WEEK`, `DAY`, `DAYOFWEEK`, `HOUR`, `MINUTE`, `SECOND`.

## How was this patch tested?

unit tests

Author: Yuming Wang <yumwang@ebay.com>

Closes #21479 from wangyum/SPARK-23903.

* [SPARK-21896][SQL] Fix StackOverflow caused by window functions inside aggregate functions

## What changes were proposed in this pull request?

This PR explicitly prohibits window functions inside aggregates. Currently, this will cause StackOverflow during analysis. See PR #19193 for previous discussion.

## How was this patch tested?

This PR comes with a dedicated unit test.

Author: aokolnychyi <anton.okolnychyi@sap.com>

Closes #21473 from aokolnychyi/fix-stackoverflow-window-funcs.

* [SPARK-24290][ML] add support for Array input for instrumentation.logNamedValue

## What changes were proposed in this pull request?

Extend instrumentation.logNamedValue to support Array input
change the logging for "clusterSizes" to new method

## How was this patch tested?

N/A

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Lu WANG <lu.wang@databricks.com>

Closes #21347 from ludatabricks/SPARK-24290.

* [SPARK-24300][ML] change the way to set seed in ml.cluster.LDASuite.generateLDAData

## What changes were proposed in this pull request?

Using different RNG in all different partitions.

## How was this patch tested?

manually

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Lu WANG <lu.wang@databricks.com>

Closes #21492 from ludatabricks/SPARK-24300.

* [SPARK-24215][PYSPARK] Implement _repr_html_ for dataframes in PySpark

## What changes were proposed in this pull request?

Implement `_repr_html_` for PySpark while in notebook and add config named "spark.sql.repl.eagerEval.enabled" to control this.

The dev list thread for context: http://apache-spark-developers-list.1001551.n3.nabble.com/eager-execution-and-debuggability-td23928.html

## How was this patch tested?

New ut in DataFrameSuite and manual test in jupyter. Some screenshot below.

**After:**
![image](https://user-images.githubusercontent.com/4833765/40268422-8db5bef0-5b9f-11e8-80f1-04bc654a4f2c.png)

**Before:**
![image](https://user-images.githubusercontent.com/4833765/40268431-9f92c1b8-5b9f-11e8-9db9-0611f0940b26.png)

Author: Yuanjian Li <xyliyuanjian@gmail.com>

Closes #21370 from xuanyuanking/SPARK-24215.

* [SPARK-16451][REPL] Fail shell if SparkSession fails to start.

Currently, in spark-shell, if the session fails to start, the
user sees a bunch of unrelated errors which are caused by code
in the shell initialization that references the "spark" variable,
which does not exist in that case. Things like:

```
<console>:14: error: not found: value spark
       import spark.sql
```

The user is also left with a non-working shell (unless they want
to just write non-Spark Scala or Python code, that is).

This change fails the whole shell session at the point where the
failure occurs, so that the last error message is the one with
the actual information about the failure.

For the python error handling, I moved the session initialization code
to session.py, so that traceback.print_exc() only shows the last error.
Otherwise, the printed exception would contain all previous exceptions
with a message "During handling of the above exception, another
exception occurred", making the actual error kinda hard to parse.

Tested with spark-shell, pyspark (with 2.7 and 3.5), by forcing an
error during SparkContext initialization.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #21368 from vanzin/SPARK-16451.

* [SPARK-15784] Add Power Iteration Clustering to spark.ml

## What changes were proposed in this pull request?

According to the discussion on JIRA. I rewrite the Power Iteration Clustering API in `spark.ml`.

## How was this patch tested?

Unit test.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: WeichenXu <weichen.xu@databricks.com>

Closes #21493 from WeichenXu123/pic_api.

* [SPARK-24453][SS] Fix error recovering from the failure in a no-data batch

## What changes were proposed in this pull request?

The error occurs when we are recovering from a failure in a no-data batch (say X) that has been planned (i.e. written to offset log) but not executed (i.e. not written to commit log). Upon recovery the following sequence of events happen.

1. `MicroBatchExecution.populateStartOffsets` sets `currentBatchId` to X. Since there was no data in the batch, the `availableOffsets` is same as `committedOffsets`, so `isNewDataAvailable` is `false`.
2. When `MicroBatchExecution.constructNextBatch` is called, ideally it should immediately return true because the next batch has already been constructed. However, the check of whether the batch has been constructed was `if (isNewDataAvailable) return true`. Since the planned batch is a no-data batch, it escaped this check and proceeded to plan the same batch X *once again*.

The solution is to have an explicit flag that signifies whether a batch has already been constructed or not. `populateStartOffsets` is going to set the flag appropriately.

## How was this patch tested?

new unit test

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #21491 from tdas/SPARK-24453.

* [SPARK-22384][SQL] Refine partition pruning when attribute is wrapped in Cast

## What changes were proposed in this pull request?

Sql below will get all partitions from metastore, which put much burden on metastore;
```
CREATE TABLE `partition_test`(`col` int) PARTITIONED BY (`pt` byte)
SELECT * FROM partition_test WHERE CAST(pt AS INT)=1
```
The reason is that the the analyzed attribute `dt` is wrapped in `Cast` and `HiveShim` fails to generate a proper partition filter.
This pr proposes to take `Cast` into consideration when generate partition filter.

## How was this patch tested?
Test added.
This pr proposes to use analyzed expressions in `HiveClientSuite`

Author: jinxing <jinxing6042@126.com>

Closes #19602 from jinxing64/SPARK-22384.

* [SPARK-24187][R][SQL] Add array_join function to SparkR

## What changes were proposed in this pull request?

This PR adds array_join function to SparkR

## How was this patch tested?

Add unit test in test_sparkSQL.R

Author: Huaxin Gao <huaxing@us.ibm.com>

Closes #21313 from huaxingao/spark-24187.

* [SPARK-23803][SQL] Support bucket pruning

## What changes were proposed in this pull request?
support bucket pruning when filtering on a single bucketed column on the following predicates -
EqualTo, EqualNullSafe, In, And/Or predicates

## How was this patch tested?
refactored unit tests to test the above.

based on gatorsmile work in https://github.com/apache/spark/commit/e3c75c6398b1241500343ff237e9bcf78b5396f9

Author: Asher Saban <asaban@palantir.com>
Author: asaban <asaban@palantir.com>

Closes #20915 from sabanas/filter-prune-buckets.

* [SPARK-24119][SQL] Add interpreted execution to SortPrefix expression

## What changes were proposed in this pull request?

Implemented eval in SortPrefix expression.

## How was this patch tested?

- ran existing sbt SQL tests
- added unit test
- ran existing Python SQL tests
- manual tests: disabling codegen -- patching code to disable beyond what spark.sql.codegen.wholeStage=false can do -- and running sbt SQL tests

Author: Bruce Robbins <bersprockets@gmail.com>

Closes #21231 from bersprockets/sortprefixeval.

* [SPARK-24224][ML-EXAMPLES] Java example code for Power Iteration Clustering in spark.ml

## What changes were proposed in this pull request?

Java example code for Power Iteration Clustering  in spark.ml

## How was this patch tested?

Locally tested

Author: Shahid <shahidki31@gmail.com>

Closes #21283 from shahidki31/JavaPicExample.

* [SPARK-24191][ML] Scala Example code for Power Iteration Clustering

## What changes were proposed in this pull request?

Added example code for Power Iteration Clustering in Spark ML examples

Author: Shahid <shahidki31@gmail.com>

Closes #21248 from shahidki31/sparkCommit.

* [SPARK-24477][SPARK-24454][ML][PYTHON] Imports submodule in ml/__init__.py and add ImageSchema into __all__

## What changes were proposed in this pull request?

This PR attaches submodules to ml's `__init__.py` module.

Also, adds `ImageSchema` into `image.py` explicitly.

## How was this patch tested?

Before:

```python
>>> from pyspark import ml
>>> ml.image
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'image'
>>> ml.image.ImageSchema
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'image'
```

```python
>>> "image" in globals()
False
>>> from pyspark.ml import *
>>> "image" in globals()
False
>>> image
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'image' is not defined
```

After:

```python
>>> from pyspark import ml
>>> ml.image
<module 'pyspark.ml.image' from '/.../spark/python/pyspark/ml/image.pyc'>
>>> ml.image.ImageSchema
<pyspark.ml.image._ImageSchema object at 0x10d973b10>
```

```python
>>> "image" in globals()
False
>>> from pyspark.ml import *
>>> "image" in globals()
True
>>> image
<module 'pyspark.ml.image' from  #'/.../spark/python/pyspark/ml/image.pyc'>
```

Author: hyukjinkwon <gurwls223@apache.org>

Closes #21483 from HyukjinKwon/SPARK-24454.

* [SPARK-23984][K8S] Initial Python Bindings for PySpark on K8s

## What changes were proposed in this pull request?

Introducing Python Bindings for PySpark.

- [x] Running PySpark Jobs
- [x] Increased Default Memory Overhead value
- [ ] Dependency Management for virtualenv/conda

## How was this patch tested?

This patch was tested with

- [x] Unit Tests
- [x] Integration tests with [this addition](https://github.com/apache-spark-on-k8s/spark-integration/pull/46)
```
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- Run SparkPi with a test secret mounted into the driver and executor pods
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Run PySpark on simple pi.py example
- Run PySpark with Python2 to test a pyfiles example
- Run PySpark with Python3 to test a pyfiles example
Run completed in 4 minutes, 28 seconds.
Total number of tests run: 11
Suites: completed 2, aborted 0
Tests: succeeded 11, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```

Author: Ilan Filonenko <if56@cornell.edu>
Author: Ilan Filonenko <ifilondz@gmail.com>

Closes #21092 from ifilonenko/master.

* [SPARK-17756][PYTHON][STREAMING] Workaround to avoid return type mismatch in PythonTransformFunction

## What changes were proposed in this pull request?

This PR proposes to wrap the transformed rdd within `TransformFunction`. `PythonTransformFunction` looks requiring to return `JavaRDD` in `_jrdd`.

https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/python/pyspark/streaming/util.py#L67

https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/streaming/src/main/scala/org/apache/spark/streaming/api/python/PythonDStream.scala#L43

However, this could be `JavaPairRDD` by some APIs, for example, `zip` in PySpark's RDD API.
`_jrdd` could be checked as below:

```python
>>> rdd.zip(rdd)._jrdd.getClass().toString()
u'class org.apache.spark.api.java.JavaPairRDD'
```

So, here, I wrapped it with `map` so that it ensures returning `JavaRDD`.

```python
>>> rdd.zip(rdd).map(lambda x: x)._jrdd.getClass().toString()
u'class org.apache.spark.api.java.JavaRDD'
```

I tried to elaborate some failure cases as below:

```python
from pyspark.streaming import StreamingContext
ssc = StreamingContext(spark.sparkContext, 10)
ssc.queueStream([sc.range(10)]) \
    .transform(lambda rdd: rdd.cartesian(rdd)) \
    .pprint()
ssc.start()
```

```python
from pyspark.streaming import StreamingContext
ssc = StreamingContext(spark.sparkContext, 10)
ssc.queueStream([sc.range(10)]).foreachRDD(lambda rdd: rdd.cartesian(rdd))
ssc.start()
```

```python
from pyspark.streaming import StreamingContext
ssc = StreamingContext(spark.sparkContext, 10)
ssc.queueStream([sc.range(10)]).foreachRDD(lambda rdd: rdd.zip(rdd))
ssc.start()
```

```python
from pyspark.streaming import StreamingContext
ssc = StreamingContext(spark.sparkContext, 10)
ssc.queueStream([sc.range(10)]).foreachRDD(lambda rdd: rdd.zip(rdd).union(rdd.zip(rdd)))
ssc.start()
```

```python
from pyspark.streaming import StreamingContext
ssc = StreamingContext(spark.sparkContext, 10)
ssc.queueStream([sc.range(10)]).foreachRDD(lambda rdd: rdd.zip(rdd).coalesce(1))
ssc.start()
```

## How was this patch tested?

Unit tests were added in `python/pyspark/streaming/tests.py` and manually tested.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #19498 from HyukjinKwon/SPARK-17756.

* [SPARK-23010][K8S] Initial checkin of k8s integration tests.

These tests were developed in the https://github.com/apache-spark-on-k8s/spark-integration repo
by several contributors. This is a copy of the current state into the main apache spark repo.
The only changes from the current spark-integration repo state are:
* Move the files from the repo root into resource-managers/kubernetes/integration-tests
* Add a reference to these tests in the root README.md
* Fix a path reference in dev/dev-run-integration-tests.sh
* Add a TODO in include/util.sh

## What changes were proposed in this pull request?

Incorporation of Kubernetes integration tests.

## How was this patch tested?

This code has its own unit tests, but the main purpose is to provide the integration tests.
I tested this on my laptop by running dev/dev-run-integration-tests.sh --spark-tgz ~/spark-2.4.0-SNAPSHOT-bin--.tgz

The spark-integration tests have already been running for months in AMPLab, here is an example:
https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-scheduled-spark-integration-master/

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Sean Suchter <sean-github@suchter.com>
Author: Sean Suchter <ssuchter@pepperdata.com>

Closes #20697 from ssuchter/ssuchter-k8s-integration-tests.

* [SPARK-24412][SQL] Adding docs about automagical type casting in `isin` and `isInCollection` APIs

## What changes were proposed in this pull request?
Update documentation for `isInCollection` API to clealy explain the "auto-casting" of elements if their types are different.

## How was this patch tested?
No-Op

Author: Thiruvasakan Paramasivan <thiru@apple.com>

Closes #21519 from trvskn/sql-doc-update.

* [SPARK-24468][SQL] Handle negative scale when adjusting precision for decimal operations

## What changes were proposed in this pull request?

In SPARK-22036 we introduced the possibility to allow precision loss in arithmetic operations (according to the SQL standard). The implementation was drawn from Hive's one, where Decimals with a negative scale are not allowed in the operations.

The PR handles the case when the scale is negative, removing the assertion that it is not.

## How was this patch tested?

added UTs

Author: Marco Gaido <marcogaido91@gmail.com>

Closes #21499 from mgaido91/SPARK-24468.

* [SPARK-23754][PYTHON][FOLLOWUP] Move UDF stop iteration wrapping from driver to executor

## What changes were proposed in this pull request?
SPARK-23754 was fixed in #21383 by changing the UDF code to wrap the user function, but this required a hack to save its argspec. This PR reverts this change and fixes the `StopIteration` bug in the worker

## How does this work?

The root of the problem is that when an user-supplied function raises a `StopIteration`, pyspark might stop processing data, if this function is used in a for-loop. The solution is to catch `StopIteration`s exceptions and re-raise them as `RuntimeError`s, so that the execution fails and the error is reported to the user. This is done using the `fail_on_stopiteration` wrapper, in different ways depending on where the function is used:
 - In RDDs, the user function is wrapped in the driver, because this function is also called in the driver itself.
 - In SQL UDFs, the function is wrapped in the worker, since all processing happens there. Moreover, the worker needs the signature of the user function, which is lost when wrapping it, but passing this signature to the worker requires a not so nice hack.

## How was this patch tested?

Same tests, plus tests for pandas UDFs

Author: edorigatti <emilio.dorigatti@gmail.com>

Closes #21467 from e-dorigatti/fix_udf_hack.

* [SPARK-19826][ML][PYTHON] add spark.ml Python API for PIC

## What changes were proposed in this pull request?

add spark.ml Python API for PIC

## How was this patch tested?

add doctest

Author: Huaxin Gao <huaxing@us.ibm.com>

Closes #21513 from huaxingao/spark--19826.

* [MINOR][CORE] Log committer class used by HadoopMapRedCommitProtocol

## What changes were proposed in this pull request?

When HadoopMapRedCommitProtocol is used (e.g., when using saveAsTextFile() or
saveAsHadoopFile() with RDDs), it's not easy to determine which output committer
class was used, so this PR simply logs the class that was used, similarly to what
is done in SQLHadoopMapReduceCommitProtocol.

## How was this patch tested?

Built Spark then manually inspected logging when calling saveAsTextFile():

```scala
scala> sc.setLogLevel("INFO")
scala> sc.textFile("README.md").saveAsTextFile("/tmp/out")
...
18/05/29 10:06:20 INFO HadoopMapRedCommitProtocol: Using output committer class org.apache.hadoop.mapred.FileOutputCommitter
```

Author: Jonathan Kelly <jonathak@amazon.com>

Closes #21452 from ejono/master.

* [SPARK-24520] Double braces in documentations

There are double braces in the markdown, which break the link.

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Fokko Driesprong <fokkodriesprong@godatadriven.com>

Closes #21528 from Fokko/patch-1.

* [SPARK-24134][DOCS] A missing full-stop in doc "Tuning Spark".

## What changes were proposed in this pull request?

In the document [Tuning Spark -> Determining Memory Consumption](https://spark.apache.org/docs/latest/tuning.html#determining-memory-consumption), a full stop was missing in the second paragraph.

It's `...use SizeEstimator’s estimate method This is useful for experimenting...`, while there is supposed to be a full stop before `This`.

Screenshot showing before change is attached below.
<img width="1033" alt="screen shot 2018-05-01 at 5 22 32 pm" src="https://user-images.githubusercontent.com/11539188/39468206-778e3d8a-4d64-11e8-8a92-38464952b54b.png">

## How was this patch tested?

This is a simple change in doc. Only one full stop was added in plain text.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Xiaodong <11539188+XD-DENG@users.noreply.github.com>

Closes #21205 from XD-DENG/patch-1.

* [SPARK-22144][SQL] ExchangeCoordinator combine the partitions of an 0 sized pre-shuffle to 0

## What changes were proposed in this pull request?
when the length of pre-shuffle's partitions is 0, the length of post-shuffle's partitions should be 0 instead of spark.sql.shuffle.partitions.

## How was this patch tested?
ExchangeCoordinator converted a  pre-shuffle that partitions is 0 to a post-shuffle that partitions is 0 instead of one that partitions is spark.sql.shuffle.partitions.

Author: liutang123 <liutang123@yeah.net>

Closes #19364 from liutang123/SPARK-22144.

* [SPARK-23732][DOCS] Fix source links in generated scaladoc.

Apply the suggestion on the bug to fix source links. Tested with
the 2.3.1 release docs.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #21521 from vanzin/SPARK-23732.

* [SPARK-24502][SQL] flaky test: UnsafeRowSerializerSuite

## What changes were proposed in this pull request?

`UnsafeRowSerializerSuite` calls `UnsafeProjection.create` which accesses `SQLConf.get`, while the current active SparkSession may already be stopped, and we may hit exception like this

```
sbt.ForkMain$ForkError: java.lang.IllegalStateException: LiveListenerBus is stopped.
	at org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:97)
	at org.apache.spark.scheduler.LiveListenerBus.addToStatusQueue(LiveListenerBus.scala:80)
	at org.apache.spark.sql.internal.SharedState.<init>(SharedState.scala:93)
	at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
	at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:120)
	at org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:119)
	at org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:286)
	at org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42)
	at org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41)
	at org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
	at org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95)
	at scala.Option.map(Option.scala:146)
	at org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:95)
	at org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:94)
	at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:126)
	at org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:54)
	at org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:157)
	at org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:150)
	at org.apache.spark.sql.execution.UnsafeRowSerializerSuite.org$apache$spark$sql$execution$UnsafeRowSerializerSuite$$unsafeRowConverter(UnsafeRowSerializerSuite.scala:54)
	at org.apache.spark.sql.execution.UnsafeRowSerializerSuite.org$apache$spark$sql$execution$UnsafeRowSerializerSuite$$toUnsafeRow(UnsafeRowSerializerSuite.scala:49)
	at org.apache.spark.sql.execution.UnsafeRowSerializerSuite$$anonfun$2.apply(UnsafeRowSerializerSuite.scala:63)
	at org.apache.spark.sql.execution.UnsafeRowSerializerSuite$$anonfun$2.apply(UnsafeRowSerializerSuite.scala:60)
...
```

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #21518 from cloud-fan/test.

* docs: fix typo

no => no[t]

## What changes were proposed in this pull request?

Fixing a typo.

## How was this patch tested?

Visual check of the docs.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tom Saleeba <tom.saleeba@gmail.com>

Closes #21496 from tomsaleeba/patch-1.

* [SPARK-15064][ML] Locale support in StopWordsRemover

## What changes were proposed in this pull request?

Add locale support for `StopWordsRemover`.

## How was this patch tested?

[Scala|Python] unit tests.

Author: Lee Dongjin <dongjin@apache.org>

Closes #21501 from dongjinleekr/feature/SPARK-15064.

* [SPARK-24531][TESTS] Remove version 2.2.0 from testing versions in HiveExternalCatalogVersionsSuite

## What changes were proposed in this pull request?

Removing version 2.2.0 from testing versions in HiveExternalCatalogVersionsSuite as it is not present anymore in the mirrors and this is blocking all the open PRs.

## How was this patch tested?

running UTs

Author: Marco Gaido <marcogaido91@gmail.com>

Closes #21540 from mgaido91/SPARK-24531.

* [SPARK-24416] Fix configuration specification for killBlacklisted executors

## What changes were proposed in this pull request?

spark.blacklist.killBlacklistedExecutors is defined as

(Experimental) If set to "true", allow Spark to automatically kill, and attempt to re-create, executors when they are blacklisted. Note that, when an entire node is added to the blacklist, all of the executors on that node will be killed.

I presume the killing of blacklisted executors only happens after the stage completes successfully and all tasks have completed or on fetch failures (updateBlacklistForFetchFailure/updateBlacklistForSuccessfulTaskSet). It is confusing because the definition states that the executor will be attempted to be recreated as soon as it is blacklisted. This is not true while the stage is in progress and an executor is blacklisted, it will not attempt to cleanup until the stage finishes.

Author: Sanket Chintapalli <schintap@yahoo-inc.com>

Closes #21475 from redsanket/SPARK-24416.

* [SPARK-23931][SQL] Adds arrays_zip function to sparksql

Signed-off-by: DylanGuedes <djmgguedesgmail.com>

## What changes were proposed in this pull request?

Addition of arrays_zip function to spark sql functions.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Unit tests that checks if the results are correct.

Author: DylanGuedes <djmgguedes@gmail.com>

Closes #21045 from DylanGuedes/SPARK-23931.

* [SPARK-24216][SQL] Spark TypedAggregateExpression uses getSimpleName that is not safe in scala

## What changes were proposed in this pull request?

When user create a aggregator object in scala and pass the aggregator to Spark Dataset's agg() method, Spark's will initialize TypedAggregateExpression with the nodeName field as aggregator.getClass.getSimpleName. However, getSimpleName is not safe in scala environment, depending on how user creates the aggregator object. For example, if the aggregator class full qualified name is "com.my.company.MyUtils$myAgg$2$", the getSimpleName will throw java.lang.InternalError "Malformed class name". This has been reported in scalatest https://github.com/scalatest/scalatest/pull/1044 and discussed in many scala upstream jiras such as SI-8110, SI-5425.

To fix this issue, we follow the solution in https://github.com/scalatest/scalatest/pull/1044 to add safer version of getSimpleName as a util method, and TypedAggregateExpression will invoke this util method rather than getClass.getSimpleName.

## How was this patch tested?
added unit test

Author: Fangshi Li <fli@linkedin.com>

Closes #21276 from fangshil/SPARK-24216.

* [SPARK-23933][SQL] Add map_from_arrays function

## What changes were proposed in this pull request?

The PR adds the SQL function `map_from_arrays`. The behavior of the function is based on Presto's `map`. Since SparkSQL already had a `map` function, we prepared the different name for this behavior.

This function returns returns a map from a pair of arrays for keys and values.

## How was this patch tested?

Added UTs

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #21258 from kiszk/SPARK-23933.

* [SPARK-23010][BUILD][FOLLOWUP] Fix java checkstyle failure of kubernetes-integration-tests

## What changes were proposed in this pull request?

Fix java checkstyle failure of kubernetes-integration-tests

## How was this patch tested?

Checked manually on my local environment.

Author: Xingbo Jiang <xingbo.jiang@databricks.com>

Closes #21545 from jiangxb1987/k8s-checkstyle.

* [SPARK-24506][UI] Add UI filters to tabs added after binding

## What changes were proposed in this pull request?

Currently, `spark.ui.filters` are not applied to the handlers added after binding the server. This means that every page which is added after starting the UI will not have the filters configured on it. This can allow unauthorized access to the pages.

The PR adds the filters also to the handlers added after the UI starts.

## How was this patch tested?

manual tests (without the patch, starting the thriftserver with `--conf spark.ui.filters=org.apache.hadoop.security.authentication.server.AuthenticationFilter --conf spark.org.apache.hadoop.security.authentication.server.AuthenticationFilter.params="type=simple"` you can access `http://localhost:4040/sqlserver`; with the patch, 401 is the response as for the other pages).

Author: Marco Gaido <marcogaido91@gmail.com>

Closes #21523 from mgaido91/SPARK-24506.

* [SPARK-22239][SQL][PYTHON] Enable grouped aggregate pandas UDFs as window functions with unbounded window frames

## What changes were proposed in this pull request?
This PR enables using a grouped aggregate pandas UDFs as window functions. The semantics is the same as using SQL aggregation function as window functions.

```
       >>> from pyspark.sql.functions import pandas_udf, PandasUDFType
       >>> from pyspark.sql import Window
       >>> df = spark.createDataFrame(
       ...     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
       ...     ("id", "v"))
       >>> pandas_udf("double", PandasUDFType.GROUPED_AGG)
       ... def mean_udf(v):
       ...     return v.mean()
       >>> w = Window.partitionBy('id')
       >>> df.withColumn('mean_v', mean_udf(df['v']).over(w)).show…
matkam pushed a commit to evie/spark that referenced this pull request Aug 3, 2018
….imagePullSecrets

Pass through the `imagePullSecrets` option to the k8s pod in order to allow user to access private image registries.

See https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/

Unit tests + manual testing.

Manual testing procedure:
1. Have private image registry.
2. Spark-submit application with no `spark.kubernetes.imagePullSecret` set. Do `kubectl describe pod ...`. See the error message:
```
Error syncing pod, skipping: failed to "StartContainer" for "spark-kubernetes-driver" with ErrImagePull: "rpc error: code = 2 desc = Error: Status 400 trying to pull repository ...: \"{\\n  \\\"errors\\\" : [ {\\n    \\\"status\\\" : 400,\\n    \\\"message\\\" : \\\"Unsupported docker v1 repository request for '...'\\\"\\n  } ]\\n}\""
```
3. Create secret `kubectl create secret docker-registry ...`
4. Spark-submit with `spark.kubernetes.imagePullSecret` set to the new secret. See that deployment was successful.

Author: Andrew Korzhuev <andrew.korzhuev@klarna.com>
Author: Andrew Korzhuev <korzhuev@andrusha.me>

Closes apache#20811 from andrusha/spark-23668-image-pull-secrets.
arjunshroff pushed a commit to arjunshroff/spark that referenced this pull request Nov 24, 2020
….imagePullSecrets (apache#355)

Pass through the `imagePullSecrets` option to the k8s pod in order to allow user to access private image registries.

See https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/

Unit tests + manual testing.

Manual testing procedure:
1. Have private image registry.
2. Spark-submit application with no `spark.kubernetes.imagePullSecret` set. Do `kubectl describe pod ...`. See the error message:
```
Error syncing pod, skipping: failed to "StartContainer" for "spark-kubernetes-driver" with ErrImagePull: "rpc error: code = 2 desc = Error: Status 400 trying to pull repository ...: \"{\\n  \\\"errors\\\" : [ {\\n    \\\"status\\\" : 400,\\n    \\\"message\\\" : \\\"Unsupported docker v1 repository request for '...'\\\"\\n  } ]\\n}\""
```
3. Create secret `kubectl create secret docker-registry ...`
4. Spark-submit with `spark.kubernetes.imagePullSecret` set to the new secret. See that deployment was successful.

Author: Andrew Korzhuev <andrew.korzhuev@klarna.com>
Author: Andrew Korzhuev <korzhuev@andrusha.me>

Closes apache#20811 from andrusha/spark-23668-image-pull-secrets.
Agirish pushed a commit to HPEEzmeral/apache-spark that referenced this pull request May 5, 2022
…cript

K8S-1077 (apache#598)

* K8S-1077 - use single k8s secret with user info

MapR [SPARK-651] Replacing joda-time-*.jar with joda-time-2.10.3.jar.

MapR [SPARK-638] Wrong permissions when creating files under directory
with GID bit set.

MapR [SPARK-627] SparkHistoryServer-2.4 is getting 403 Unauthorized home page for users(spark.ui.view.acls) via spark-submit

MapR [SPARK-639] Default headers are adding two times

MapR [SPARK-629] Spark UI for job lose CSS styles

MapR [MS-925] After upgrade to MEP 6.2 (Spark 2.4.0) can no longer
consume Kafka / MapR Streams.

MapR [SPARK-626] Update kafka dependencies for Spark 2.4.4.0 in release MEP-6.3.0

MapR [SPARK-340] Jetty web server version at Spark should be updated tp v9.4.X

MapR [SPARK-617] an't use ssl via spark beeline

MapR [SPARK-617] Can't use ssl via spark beeline

MapR [SPARK-620] Replace core dependency in Spark-2.4.4

MapR [SPARK-621] Fix multiple XML configuration initialization for (apache#575)

custom headers. Use X-XSS-Protection, X-Content-Type-Options
Content-Security-Policy and Strict-Transport-Security configuration
only in case: cluster security is enabled OR
spark.ui.security.headers.enabled set to true.

MapR [SPARK-595] Spark cannot access hs2 through zookeeper

Revert "MapR [SPARK-595] Spark cannot access hs2 through zookeeper (apache#577)"

MapR [SPARK-595] Spark cannot access hs2 through zookeeper

MapR [SPARK-620] Replace core dependency in Spark-2.4.

MapR [SPARK-619] Move absent commits from 2.4.3 branch to 2.4.4 (apache#574)

* Adding SQL API to write to kafka from Spark (apache#567)

* Branch 2.4.3 extended kafka and examples (apache#569)

* The v2 API is in its own package

- the v2 api is in a different package
- the old functionality is available in a separated package

* v2 API examples

- All the examples are using the newest API.
- I have removed the old examples since they are not relevant any more and the same functionality is shown in the new examples usin the new API.

* MapR [SPARK-619] Move absent commits from 2.4.3 branch to 2.4.4

CORE-321. Add custom http header support for jetty.

MapR [SPARK-609] Port Apache Spark-2.4.4 changes to the MapR Spark-2.4.4 branch

Adding multi table loader (apache#560)

* Adding multi table loader

- This allows us to load multiple matching tables into one Union DataFrame.

If we have the fallowing MFS structure:

```
/clients/client_1/data.table
/clients/client_2/data.table
```
we can load a union dataframe by doing `loadFromMapRDB("/clients/*/*.table")`

* Fixing the path to the reader

MapR [SPARK-588] Spark thriftserver fails when work with hive-maprdb json table

MapR [SPARK-598] Spark can't add needed properties to hive-site.xml

MAPR-SPARK-596: Change HBase compatible version for Spark 2.4.3

MapR [SPARK-592] Add possibility to use start-thriftserver.sh script with 2304 port

MapR [SPARK-584] MaprDB connector's setHintUsingIndex method doesn't work as expected

MapR [SPARK-583] MaprDB connector's loadFromMaprDB function for Java API doesn't work as expected

SPARK-579 info about ssl_trustore is added for metrics

MapR [SPARK-552] Failed to get broadcast_11_piece0 of broadcast_11

SPARK-569 Generation of SSL ceritificates for spark UI

MapR [SPARK-575] Warning messages in spark workspace after the second attempt to login to job's UI

Update zookeeper version

Adding `joinWithMapRDBTable` function (apache#529)

The related documentation of this function is here https://github.com/anicolaspp/MapRDBConnector#joinwithmaprdbtable.

The main idea is that having a dataframe (no matter how was it constructed) we can join it with a MapR-DB table. This functions looks at the join query and load only those records from MapR-DB that will join instead of loading the full table and then join in memory. In other words, we only load what we know will be joint.

Adding DataSource Reader Support (apache#525)

* Adding DataSource Reader Support

* Update SparkSessionExt.scala

* creating a package object

* Update MapRDBSpark.scala

* fully path to avoid name collition

* refactorings

MapR [SPARK-451] Spark hadoop/core dependency updates

MapR [SPARK-566] Move absent commits from 2.4.0 branch

MapR [SPARK-561] Spark 2.4.3 porting to MapR

MapR [SPARK-561] Spark 2.4.3 porting to MapR

MapR [SPARK-558] Render application UI init page if driver is not up

MapR [SPARK-541] Avoid duplication of the first unexpired record

MapR [COLD-150][K8S] Fix metrics copy

MapR [K8S-893] Hide plain text password from logs

MapR [SPARK-540] Include 'avro' artifacts

MapR [SPARK-536] PySpark streaming package for kafka-0-10 added

K8S-853: Enable spark metrics for external tenant

MapR [SPARK-531] Remove duplicating entries from classpath in ClasspathFilter

MapR [SPARK-516] Spark jobs failure using yarn mode on kerberos fixed

MapR [SPARK-462] Spark and SparkHistoryServer allow week ciphers, which can allow man in the middle attack

[SPARK-508] MapR-DB OJAI Connector for Spark isNull condition returns incorrect result

MapR [SPARK-510] nonmapr "admin" users not able to view other user logs in SHS

SPARK-460: Spark Metrics for CollectD Configuration for collecting Spark metrics

SPARK-463 MAPR_MAVEN_REPO variable for specifying mapR repository

MapR [SPARK-492] Spark 2.4.0.0 configure.sh has error messages

MapR [SPARK-515][K8S] Remove configure.sh call for k8s

MapR [SPARK-515] Move configuring spark-env.sh back to the private-pkg

MapR [SPARK-515] Move configuring spark-env.sh back to the private-pkg

MapR [SPARK-514] Recovery from checkpoint is broken

MapR [SPARK-445] Messages loss fixed by reverting [MAPR-32290] changes from kafka09 package (apache#460)

* MapR [SPARK-445] Revert "[MAPR-32290] Spark processing offsets when messages are already TTL in the first batch (apache#376)"

This reverts commit e8d59b9.

* MapR [SPARK-445] Revert "[MAPR-32290] Spark processing offsets when messages are already ttl in first batch (apache#368)"

This reverts commit b282a8b.

MapR [SPARK-445] Messages loss fixed by reverting [MAPR-32290] changes from kafka10 package

MapR [SPARK-469] Fix NPE in generated classes by reverting "[SPARK-23466][SQL] Remove redundant null checks in generated Java code by GenerateUnsafeProjection" (apache#455)

This reverts commit c5583fd.

MapR [SPARK-482] Spark streaming app fails to start by UnknownTopicOrPartitionException with checkpoint

MapR [SPARK-496] Spark HS UI doesn't work

MapR [SPARK-416] CVE-2018-1320 vulnerability in Apache Thrift

MapR [SPARK-486][K8S] Fix sasl encryption error on Kubernetes

MapR [SPARK-481] Cannot run spark configure.sh on Client node

MapR [K8S-637][K8S] Add configure.sh configuration in spark-defaults.conf for job runtime

MapR [SPARK-465] Error messages after update of spark 2.4

MapR [SPARK-465] Error messages after update of spark 2.4

MapR [SPARK-464] Can't submit spark 2.4 jobs from mapr-client

[SPARK-466] SparkR errors fixed

MapR [SPARK-456] Spark shell can't be started

SPARK-417 impersonation fixes for spark executor. Impersonation is mo… (apache#433)

* SPARK-417 impersonation fixes for spark executor. Impersonation is moved from HadoopRDD.compute() method to org.apache.spark.executor.Executor.run() method

* SPARK-363 Hive version changed to '1.2.0-mapr-spark-MEP-6.0.0'

[SPARK-449] Kafka offset commit issue fixed

MapR [SPARK-287] Move logic of creating /apps/spark folder from installer's scripts to the configure.sh

MapR [SPARK-221] Investigate possibility to move creating of the spark-env.sh from private-pkg to configure.sh

MapR [SPARK-430] PID files should be under /opt/mapr/pid

MapR [SPARK-446] Spark configure.sh doesn't start/stop Spark services

MapR [SPARK-434] Move absent commits from 2.3.2 branch (apache#425)

* MapR [SPARK-352] Spark shell fails with "NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream" if java is not available in PATH

* MapR [SPARK-350] Deprecate Spark Kafka-09 package

* MapR [SPARK-326] Investigate possibility of writing Java example for the MapRDB OJAI connector

* [SPARK-356] Merge mapr changes from kafka-09 package into the kafka-10

* SPARK-319 Fix for sparkR version check

* MapR [SPARK-349] Update OJAI client to v3 for Spark MapR-DB JSON connector

* MapR [SPARK-367] Move absent commits from 2.3.1 branch

* MapR [SPARK-137] Analyze the warning during compilation of OJAI connector

* MapR [SPARK-369] Spark 2.3.2 fails with error related to zookeeper

* [MAPR-26258] hbasecontext.HBaseDistributedScanExample fails

* [SPARK-24355] Spark external shuffle server improvement to better handle block fetch requests

* MapR [SPARK-374] Spark Hive example fails when we submit job from another(simple) cluster user

* MapR [SPARK-434] Move absent commits from 2.3.2 branch

* MapR [SPARK-434] Move absent commits from 2.3.2 branch

* MapR [SPARK-373] Unexpected behavior during job running in standalone cluster mode

* MapR [SPARK-419] Update hive-maprdb-json-handler jar for spark 2.3.2.0 and spark 2.2.1

* MapR [SPARK-396] Interface change of sendToKafka

* MapR [SPARK-357] consumer groups are prepeneded with a "service_" prefix

* MapR [SPARK-429] Changes in maprdb connector are the cause of broken backward compatibility

* MapR [SPARK-427] Update kafka in Spark-2.4.0 to the 1.1.1-mapr

* MapR [SPARK-434] Move absent commits from 2.3.2 branch

* Move absent commits from 2.3.2 branch

* MapR [SPARK-434] Move absent commits from 2.3.2 branch

* Move absent commits from 2.3.2 branch

* Move absent commits from 2.3.2 branch

MapR [SPARK-427] Update kafka in Spark-2.4.0 to the 1.1.1-mapr

MapR [SPARK-379] Spark 2.4 4-gidit version

MapR [PIC-48][K8S] Port k8s changes to 2.4.0

[PIC-48] Create user for k8s driver and executor if required

[PIC-48] Create user for k8s driver and executor if required

Revert "Remove spark.ui.filters property"

This reverts commit d8941ba36c3451cdce15d18d6c1a52991de3b971.

[SPARK-351] Copy kubernetes start scripts anyway

PIC-34: Rename default configmap name to be consistent with mapr-kubernetes

[SPARK-23668][K8S] Add config option for passing through k8s Pod.spec.imagePullSecrets (apache#355)

Pass through the `imagePullSecrets` option to the k8s pod in order to allow user to access private image registries.

See https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/

Unit tests + manual testing.

Manual testing procedure:
1. Have private image registry.
2. Spark-submit application with no `spark.kubernetes.imagePullSecret` set. Do `kubectl describe pod ...`. See the error message:
```
Error syncing pod, skipping: failed to "StartContainer" for "spark-kubernetes-driver" with ErrImagePull: "rpc error: code = 2 desc = Error: Status 400 trying to pull repository ...: \"{\\n  \\\"errors\\\" : [ {\\n    \\\"status\\\" : 400,\\n    \\\"message\\\" : \\\"Unsupported docker v1 repository request for '...'\\\"\\n  } ]\\n}\""
```
3. Create secret `kubectl create secret docker-registry ...`
4. Spark-submit with `spark.kubernetes.imagePullSecret` set to the new secret. See that deployment was successful.

Author: Andrew Korzhuev <andrew.korzhuev@klarna.com>
Author: Andrew Korzhuev <korzhuev@andrusha.me>

Closes apache#20811 from andrusha/spark-23668-image-pull-secrets.

[SPARK-321] Change default value of spark.mapr.ssl.secret.prefix property

[PIC-32] Spark on k8s with MapR secure cluster

Update entrypoint.sh with correct spark version (apache#340)

This PR has minor fix to correct the spark version string

[SPARK-274] Create home directory for user who submitted job

[MAPR-SPARK-230] Implement security for Spark on Kubernetes

Run Spark job with specify the username for driver and executor

Read cluster configs from configMap

Run configure.sh script form entrypoint.sh

Remove spark.kubernetes.driver.pod.commands property

Add Spark properties for executor and driver environment variable

MapR [SPARK-296] Structured Streaming memory leak

Revert "[MAPR-SPARK-210] Rename sprk-defaults.conf to spark-defaults.conf.tem…" (apache#252)

* Revert "[MAPR-SPARK-176] Fix Spark Project Catalyst unit tests (apache#251)"

This reverts commit 5de05075cd14abf8ac65046a57a5d76617818fbe.

* Revert "[MAPR-SPARK-210] Rename sprk-defaults.conf to spark-defaults.conf.template (apache#249)"

This reverts commit 1baa677d727e89db7c605ffbae9a9eba00337ad0.

[MAPR-SPARK-210] Rename sprk-defaults.conf to spark-defaults.conf.template

MapR [SPARK-379] Port Spark to 2.4.0

MapR [SPARK-341] Spark 2.3.2 porting

[MAPR-32290] Spark processing offsets when messages are already TTL in the first batch

* Bug 32263 - Seek called on unsubscribed partitions

[MSPARK-331] Remove snapshot versions of mapr dependencies from Spark-2.3.1

[MAPR-32290] Spark processing offsets when messages are already ttl in first batch

MapR [SPARK-325] Add examples for work with the MapRDB JSON connector into the Spark project

[ATS-449] Unit test for EBF 32013 created.

MAPR-SPARK-311: Spark beeline uses default ssl truststore instead of mapr ssl truststore

Bug 32355 - Executor tab empty on Spark UI

[SPARK-318] Submitting Spark jobs from Oozie fails due to ClassNotFoundException

Bug 32014 - Spark Consumer fails with java.lang.AssertionError

Revert "[SPARK-306] Kafka clients 1.0.1 present in jars directory for Spark 2.3.1" (apache#341)

* Revert "[SPARK-306] Kafka clients 1.0.1 present in jars directory for Spark 2.3.1 (apache#335)"

This reverts commit 832411e.

Bug 32014 - Spark Consumer fails with java.lang.AssertionError (apache#326) (apache#336)

* MapR [32014] Spark Consumer fails with java.lang.AssertionError

[SPARK-306] Kafka clients 1.0.1 present in jars directory for Spark 2.3.1

DEVOPS-2768 temporarily removed curl for file downloading

[SPARK-302] Local privilege escalation

MapR [SPARK-297] Added unit test for empty value conversion

MapR [SPARK-297] Empty values are loaded as non-null

MapR [SPARK-296] Structured Streaming memory leak

2.3.1 spark 289 (apache#318)

* MapR [SPARK-289] Fix unit test for Spark-2.3.1

[SPARK-130] MapRDB connector - NPE while saving Pair RDD with 'null' values

MapR [SPARK-283] Unit tests fail during initialization SSL properties.

[SPARK-212] SparkHiveExample fails when we run it twice

MapR [SPARK-282] Remove maprfs and hadoop jars from mapr spark package

MapR [SPARK-278] Spark submit fails for jobs with python

MapR [SPARK-279] Can't connect to spark thrift server with new spark and hive packages

MapR [SPARK-276] Update zookeeper dependency to v.3.4.11 for spark 2.3.1

MapR [SPARK-272] Use only client passwords from ssl-client.xml

MapR [SPARK-266] Spark jobs can't finish correctly, when there is an error during job running

MapR [SPARK-263] Add possibility to use keyPassword which is different from keyStorePassword

[MSPARK-31632] RM UI showing broken page for Spark jobs

MapR [SPARK-261] Use mapr-security-web for getting passwords.

MapR [SPARK-259] Spark application doesn't finish correctly

MapR [SPARK-268] Update Spark version for Warden

change project version to 2.3.1-mapr-SNAPSHOT

MapR [SPARK-256] Spark doesn't work on yarn mode

MapR [SPARK-255] Installer fresh install 610/600 secure fails to start "mapr-spark-thriftserver", "mapr-spark-historyserver"

Mapr [SPARK-248] MapRDBTableScanRDD fails to convert to Scala Dataframe when using where clause

MapR [SPARK-225] Hadoop credentials provider usage for hiding passwords at spark-defaults

MapR [SPARK-214] Hive-2.1 poperties can't be read from a hive-site.xml as Spark uses Hive-1.2

MapR [SPARK-216] Spark thriftserver fails when work with hive-maprdb json table

SPARK-244 (apache#278)

Provide ability to use MapR-Negotiation authentication for Spark HistoryServer

MapR [SPARK-226] Spark - pySpark Security Vulnerability

MapR [SPARK-220] SparkR fails with UDF functions bug fixed

MapR [SPARK-227] KafkaUtils.createDirectStream fails with kafka-09

MapR [SPARK-183] Spark Integration for Kafka 0.10 unit tests disabled

MapR [SPARK-182] Spark Project External Kafka Producer v09 unit tests fixed

MapR [SPARK-179] Spark Integration for Kafka 0.9 unit tests fixed

MapR [SPARK-181] Kafka 0.10 Structured Streaming unit tests fixed

[MSPARK-31305] Spark History server NOT loading applications submitted by users other than 'mapr'

MapR [SPARK-175] Fix Spark Project Streaming unit tests

[MAPR-SPARK-176] Fix Spark Project Catalyst unit tests

[MAPR-SPARK-178] Fix Spark Project Hive unit tests

MapR [SPARK-174] Spark Core unit tests fixed

Changed version for spark-kafka connector.

MapR [SPARK-202] Update MapR Spark to 2.3.0

Fixed compile time errors in tests

Change project version

[SPARK-198] Update hadoop dependency version to 2.7.0-mapr-1803 for Spark 2.2.1

MapR [SPARK-188] Couldn't connect to thrift server via spark beeline on kerberos cluster

MapR [SPARK-143] Spark History Server does not require login for secured-by-default clusters

MapR [SPARK-186] Update OJAI versions to the latest for Spark-2.2.1 OJAI Connector

MapR [SPARK-191] Incorrect work of MapR-DB Sink 'complete' and 'update' modes fixed

MapR [SPARK-170] StackOverflowException in equals method in DBMapValue

2.2.1 build fixed (apache#231)

* MapR [SPARK-164] Update Kafka version to 1.0.1-mapr in Spark Kafka Producer module

MapR [SPARK-161] Include Kafka Structured streaming jar to Spark package.

MapR [SPARK-155] Change Spark Master port from 8080

MapR [SPARK-153] Exception in spark job with configured labels on yarn-client mode

MapR [SPARK-152] Incorrect date string parsing fixed

MapR [SPARK-21] Structured Streaming MapR-DB Sink created

MapR [SPARK-135]  Spark 2.2 with MapR Streams ( Kafka 1.0) (apache#218)

* MapR [SPARK-135] Spark 2.2 with MapR Streams (Kafka 1.0)
Added functionality of MapR-Streams specific EOF handling.

MapR [SPARK-143] Spark History Server does not require login for secured-by-default clusters

Disable build failing if scalastyle checking is fall.

MapR [SPARK-16] Change Spark version in Warden files and configure.sh

MapR [SPARK-144] Add insertToMapRDB method for rdd for Java API

[MAPR-30536]  Spark SQL queries on Map column fails after upgrade

MapR [SPARK-139] Remove "update" related APIs from connector

MapR [SPARK-140] Change the option name "tableName" to "tablePath" in the Spark/MapR-DB connectors.

MapR [SPARK-121] Spark OJAI JAVA: update functionality removed

MapR [SPARK-118] Spark OJAI Python: missed DataFrame import while moving imports in order to fix MapR [ZEP-101] interpreter issue

MapR [SPARK-118] Spark OJAI Python: move MapR DB Connector class importing in order to fix MapR [ZEP-101] interpreter issue

MapR [SPARK-117] Spark OJAI Python: Save functionality implementation

MapR [SPARK-131] Exception when try to save JSON table with Binary _id field

Spark OJAI JAVA: load to RDD, save from RDD implementation (apache#195)

* MapR [SPARK-124] Loading to JavaRDD implemented
* MapR [SPARK-124] MapRDBJavaSparkContext constructor changed
* MapR [SPARK-124] implemented RDD[Row] saving

MapR [SPARK-118] Spark OJAI Python: Read implementation

MapR [SPARK-128] MapRDB connector - wrong handle of null fields when nullable is false

* MapR [SPARK-121] Spark OJAI JAVA: Read to Dataset functionality implementation
* Minor refactoring

MapR [SPARK-125] Default value of idFieldPath parameter is not handle

MapR [SPARK-113] Hit java.lang.UnsupportedOperationException: empty.reduceLeft during loadFromMapRDB

Spark Mapr-DB connector was refactored according to Scala style
Removed code duplication

[MSPARK-107]idField information is lost in MapRDBDataFrameWriterFunctions.saveToMapRDB

configure.sh takes options to change ports

Kafka client excluded from package because correct version is located in "mapr classpath"

Changed Kafka version in Kafka producer module.

Branch spark 69 (apache#170)

* Fixing the wrong type casting of TimeStamp to OTimeStamp when read from spark dataFrame.

* SPARK-69: Problem with license when we try to read from json and write to maprdb

remove creatin /usr/local/spark link from configure.sh. This link will be creates by private-pkg

remove include-maprdb from default profiles

added profiles in maprdb pom file instead of two pom files

Fixed maprdb connector dependencies.

Fixing the wrong type casting of TimeStamp to OTimeStamp when read from spark dataFrame.

changed port for spark-thriftserver as it conflicts with hive server

changed port for spark-thriftserver as it conflicts with hive server

remove .not_configured_yet file after success

Ojai connector fixed required java version

[MSPARK-45] Move Spark-OJAI connector code to Spark github repo (apache#132)

* SPARK-45 Move Spark-OJAI connector code to Spark github repo

* Fixing pom versions for maprdb spark connector.

* Changes made to the connector code to be compatible with 5.2.* and 6.0 clients.

Spark 2.1.0 mapr 29106 (apache#150)

* [SPARK-20922][CORE] Add whitelist of classes that can be deserialized by the launcher.

Blindly deserializing classes using Java serialization opens the code up to
issues in other libraries, since just deserializing data from a stream may
end up execution code (think readObject()).

Since the launcher protocol is pretty self-contained, there's just a handful
of classes it legitimately needs to deserialize, and they're in just two
packages, so add a filter that throws errors if classes from any other
package show up in the stream.

This also maintains backwards compatibility (the updated launcher code can
still communicate with the backend code in older Spark releases).

Tested with new and existing unit tests.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes apache#18166 from vanzin/SPARK-20922.

(cherry picked from commit 8efc6e9)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

(cherry picked from commit 772a9b9)

* [SPARK-20922][CORE][HOTFIX] Don't use Java 8 lambdas in older branches.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes apache#18178 from vanzin/SPARK-20922-hotfix.

Added security by default for historyserver

use waitForConsumerAssignment() instead of consumer.poll(0) for spark-29052

change MAPR_HADOOP_CLASSPATH in configure.sh for creating it by mapr-classpath.sh

change MAPR_HADOOP_CLASSPATH in configure.sh for creating it by mapr-classpath.sh

changes for mapr-classpath.sh

changes for mapr-classpath.sh

configure.sh changes

[SPARK-39] Classpath filter was added

Fixed impersonation when data read from MapR-DB via Spark-Hive.

added configure.sh and warden.spark-thriftserver.conf

hive-hbase-handler added to Spark jars

Fixed "Single message comes late"

28339 bug fixed

Spark streaming skipped message with zero offset from Kafka 0.9

[MSPARK-9] Initial fix for Spark unit tests

Bump dependencies after ECO-1703 release

[SPARK-33] Streaming example fixed

[MAPR-26060] Fixed case when mapr-streams make gaps in offsets

ported features from kafka 10 to kafka 9

[MAPR-26289][SPARK-2.1] Streaming general improvements (apache#93)

* Added include-kafka-09 profile to Assembly
* Set default poll timeout to 120s

Set default HBase verison to 1.1.8

Changes from Kafka10  package were ported to Kafka09 package.

[MAPR-26053] Include MapR Classes to the default value of spark.sql.hive.metastore.sharedPrefixes

[MAPR-25807] Spark-Warehouse path computes incorrectly

Add MapR-SASL support for Thrift Server

Adding scala library.

[MAPR-25713] Spark might try to load MapR Class Loader multiple times and fail

[MAPR-25311] Bump Spark dependencies after ECO-1611 release

[MINOR] Fix spark-jars.sh script

[MAPR-24603] Could not launch beeline shell after starting spark thrift server

fixed syntax error in V09DirectKafkaWordCount example

Spark 2.0.1 MAPR-streams Python API

[MAPR-24415] SPARK_JAVA_OPTS is deprecated

Kafka streaming producer added.

Minor fix for previous commit

Added script for MAPR-24374

Some minor changes to spark-defaults.conf

Changed default HBase version to 1.1.1 in compatibility.version

Streaming example was refactored

[MAPR-24470] HiveFromSpark test fails in yarn-cluster mode

Added MapR Repo

[MAPR-22940] Failed to connect spark beeline (after spark thrift server is started) on Kerberos cluster

[MAPR-18865] Unable to submit spark apps from Windows client

Skip maven clean task on the parent module

New: Issue with running Hive commands in Spark

This is fixed in SPARK-7819
Isolated Hive Client Loader appears to cause Native Library
libMapRClient.4.0.2-mapr.so already loaded in another classloader error

Spark warden.services.conf should have dependency on cldb

Remove DFS shuffle settings.

These settings are not used right now.

Copy every file in the conf directory into the distribution package.

Create spark-defaults.conf for MapR

Settings to enable DFS shuffle on MapR.

Support hbase classpath computation in util script.

Adding external conf and scripts.

Enable SPARK_HIVE mode while building.

This is needed to bundle datanucleus jars needed for hive table creation.

Build Spark on MapR.
- make-distribution.sh takes an environment variable to enable profiles -
  MVN_PROFILE_ARG
- Added warden conf files under ext-conf.
- Updated pom.xml to use right set of jars and version.

Spark Master failed to start in HA mode

Updated Apache Curator version

Added spark streaming integration with kafka 0.9 and mapr-streams

Added MapR Repo
udaynpusa pushed a commit to mapr/spark that referenced this pull request Jan 30, 2024
…cript

K8S-1077 (apache#598)

* K8S-1077 - use single k8s secret with user info

MapR [SPARK-651] Replacing joda-time-*.jar with joda-time-2.10.3.jar.

MapR [SPARK-638] Wrong permissions when creating files under directory
with GID bit set.

MapR [SPARK-627] SparkHistoryServer-2.4 is getting 403 Unauthorized home page for users(spark.ui.view.acls) via spark-submit

MapR [SPARK-639] Default headers are adding two times

MapR [SPARK-629] Spark UI for job lose CSS styles

MapR [MS-925] After upgrade to MEP 6.2 (Spark 2.4.0) can no longer
consume Kafka / MapR Streams.

MapR [SPARK-626] Update kafka dependencies for Spark 2.4.4.0 in release MEP-6.3.0

MapR [SPARK-340] Jetty web server version at Spark should be updated tp v9.4.X

MapR [SPARK-617] an't use ssl via spark beeline

MapR [SPARK-617] Can't use ssl via spark beeline

MapR [SPARK-620] Replace core dependency in Spark-2.4.4

MapR [SPARK-621] Fix multiple XML configuration initialization for (apache#575)

custom headers. Use X-XSS-Protection, X-Content-Type-Options
Content-Security-Policy and Strict-Transport-Security configuration
only in case: cluster security is enabled OR
spark.ui.security.headers.enabled set to true.

MapR [SPARK-595] Spark cannot access hs2 through zookeeper

Revert "MapR [SPARK-595] Spark cannot access hs2 through zookeeper (apache#577)"

MapR [SPARK-595] Spark cannot access hs2 through zookeeper

MapR [SPARK-620] Replace core dependency in Spark-2.4.

MapR [SPARK-619] Move absent commits from 2.4.3 branch to 2.4.4 (apache#574)

* Adding SQL API to write to kafka from Spark (apache#567)

* Branch 2.4.3 extended kafka and examples (apache#569)

* The v2 API is in its own package

- the v2 api is in a different package
- the old functionality is available in a separated package

* v2 API examples

- All the examples are using the newest API.
- I have removed the old examples since they are not relevant any more and the same functionality is shown in the new examples usin the new API.

* MapR [SPARK-619] Move absent commits from 2.4.3 branch to 2.4.4

CORE-321. Add custom http header support for jetty.

MapR [SPARK-609] Port Apache Spark-2.4.4 changes to the MapR Spark-2.4.4 branch

Adding multi table loader (apache#560)

* Adding multi table loader

- This allows us to load multiple matching tables into one Union DataFrame.

If we have the fallowing MFS structure:

```
/clients/client_1/data.table
/clients/client_2/data.table
```
we can load a union dataframe by doing `loadFromMapRDB("/clients/*/*.table")`

* Fixing the path to the reader

MapR [SPARK-588] Spark thriftserver fails when work with hive-maprdb json table

MapR [SPARK-598] Spark can't add needed properties to hive-site.xml

MAPR-SPARK-596: Change HBase compatible version for Spark 2.4.3

MapR [SPARK-592] Add possibility to use start-thriftserver.sh script with 2304 port

MapR [SPARK-584] MaprDB connector's setHintUsingIndex method doesn't work as expected

MapR [SPARK-583] MaprDB connector's loadFromMaprDB function for Java API doesn't work as expected

SPARK-579 info about ssl_trustore is added for metrics

MapR [SPARK-552] Failed to get broadcast_11_piece0 of broadcast_11

SPARK-569 Generation of SSL ceritificates for spark UI

MapR [SPARK-575] Warning messages in spark workspace after the second attempt to login to job's UI

Update zookeeper version

Adding `joinWithMapRDBTable` function (apache#529)

The related documentation of this function is here https://github.com/anicolaspp/MapRDBConnector#joinwithmaprdbtable.

The main idea is that having a dataframe (no matter how was it constructed) we can join it with a MapR-DB table. This functions looks at the join query and load only those records from MapR-DB that will join instead of loading the full table and then join in memory. In other words, we only load what we know will be joint.

Adding DataSource Reader Support (apache#525)

* Adding DataSource Reader Support

* Update SparkSessionExt.scala

* creating a package object

* Update MapRDBSpark.scala

* fully path to avoid name collition

* refactorings

MapR [SPARK-451] Spark hadoop/core dependency updates

MapR [SPARK-566] Move absent commits from 2.4.0 branch

MapR [SPARK-561] Spark 2.4.3 porting to MapR

MapR [SPARK-561] Spark 2.4.3 porting to MapR

MapR [SPARK-558] Render application UI init page if driver is not up

MapR [SPARK-541] Avoid duplication of the first unexpired record

MapR [COLD-150][K8S] Fix metrics copy

MapR [K8S-893] Hide plain text password from logs

MapR [SPARK-540] Include 'avro' artifacts

MapR [SPARK-536] PySpark streaming package for kafka-0-10 added

K8S-853: Enable spark metrics for external tenant

MapR [SPARK-531] Remove duplicating entries from classpath in ClasspathFilter

MapR [SPARK-516] Spark jobs failure using yarn mode on kerberos fixed

MapR [SPARK-462] Spark and SparkHistoryServer allow week ciphers, which can allow man in the middle attack

[SPARK-508] MapR-DB OJAI Connector for Spark isNull condition returns incorrect result

MapR [SPARK-510] nonmapr "admin" users not able to view other user logs in SHS

SPARK-460: Spark Metrics for CollectD Configuration for collecting Spark metrics

SPARK-463 MAPR_MAVEN_REPO variable for specifying mapR repository

MapR [SPARK-492] Spark 2.4.0.0 configure.sh has error messages

MapR [SPARK-515][K8S] Remove configure.sh call for k8s

MapR [SPARK-515] Move configuring spark-env.sh back to the private-pkg

MapR [SPARK-515] Move configuring spark-env.sh back to the private-pkg

MapR [SPARK-514] Recovery from checkpoint is broken

MapR [SPARK-445] Messages loss fixed by reverting [MAPR-32290] changes from kafka09 package (apache#460)

* MapR [SPARK-445] Revert "[MAPR-32290] Spark processing offsets when messages are already TTL in the first batch (apache#376)"

This reverts commit e8d59b9.

* MapR [SPARK-445] Revert "[MAPR-32290] Spark processing offsets when messages are already ttl in first batch (apache#368)"

This reverts commit b282a8b.

MapR [SPARK-445] Messages loss fixed by reverting [MAPR-32290] changes from kafka10 package

MapR [SPARK-469] Fix NPE in generated classes by reverting "[SPARK-23466][SQL] Remove redundant null checks in generated Java code by GenerateUnsafeProjection" (apache#455)

This reverts commit c5583fd.

MapR [SPARK-482] Spark streaming app fails to start by UnknownTopicOrPartitionException with checkpoint

MapR [SPARK-496] Spark HS UI doesn't work

MapR [SPARK-416] CVE-2018-1320 vulnerability in Apache Thrift

MapR [SPARK-486][K8S] Fix sasl encryption error on Kubernetes

MapR [SPARK-481] Cannot run spark configure.sh on Client node

MapR [K8S-637][K8S] Add configure.sh configuration in spark-defaults.conf for job runtime

MapR [SPARK-465] Error messages after update of spark 2.4

MapR [SPARK-465] Error messages after update of spark 2.4

MapR [SPARK-464] Can't submit spark 2.4 jobs from mapr-client

[SPARK-466] SparkR errors fixed

MapR [SPARK-456] Spark shell can't be started

SPARK-417 impersonation fixes for spark executor. Impersonation is mo… (apache#433)

* SPARK-417 impersonation fixes for spark executor. Impersonation is moved from HadoopRDD.compute() method to org.apache.spark.executor.Executor.run() method

* SPARK-363 Hive version changed to '1.2.0-mapr-spark-MEP-6.0.0'

[SPARK-449] Kafka offset commit issue fixed

MapR [SPARK-287] Move logic of creating /apps/spark folder from installer's scripts to the configure.sh

MapR [SPARK-221] Investigate possibility to move creating of the spark-env.sh from private-pkg to configure.sh

MapR [SPARK-430] PID files should be under /opt/mapr/pid

MapR [SPARK-446] Spark configure.sh doesn't start/stop Spark services

MapR [SPARK-434] Move absent commits from 2.3.2 branch (apache#425)

* MapR [SPARK-352] Spark shell fails with "NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream" if java is not available in PATH

* MapR [SPARK-350] Deprecate Spark Kafka-09 package

* MapR [SPARK-326] Investigate possibility of writing Java example for the MapRDB OJAI connector

* [SPARK-356] Merge mapr changes from kafka-09 package into the kafka-10

* SPARK-319 Fix for sparkR version check

* MapR [SPARK-349] Update OJAI client to v3 for Spark MapR-DB JSON connector

* MapR [SPARK-367] Move absent commits from 2.3.1 branch

* MapR [SPARK-137] Analyze the warning during compilation of OJAI connector

* MapR [SPARK-369] Spark 2.3.2 fails with error related to zookeeper

* [MAPR-26258] hbasecontext.HBaseDistributedScanExample fails

* [SPARK-24355] Spark external shuffle server improvement to better handle block fetch requests

* MapR [SPARK-374] Spark Hive example fails when we submit job from another(simple) cluster user

* MapR [SPARK-434] Move absent commits from 2.3.2 branch

* MapR [SPARK-434] Move absent commits from 2.3.2 branch

* MapR [SPARK-373] Unexpected behavior during job running in standalone cluster mode

* MapR [SPARK-419] Update hive-maprdb-json-handler jar for spark 2.3.2.0 and spark 2.2.1

* MapR [SPARK-396] Interface change of sendToKafka

* MapR [SPARK-357] consumer groups are prepeneded with a "service_" prefix

* MapR [SPARK-429] Changes in maprdb connector are the cause of broken backward compatibility

* MapR [SPARK-427] Update kafka in Spark-2.4.0 to the 1.1.1-mapr

* MapR [SPARK-434] Move absent commits from 2.3.2 branch

* Move absent commits from 2.3.2 branch

* MapR [SPARK-434] Move absent commits from 2.3.2 branch

* Move absent commits from 2.3.2 branch

* Move absent commits from 2.3.2 branch

MapR [SPARK-427] Update kafka in Spark-2.4.0 to the 1.1.1-mapr

MapR [SPARK-379] Spark 2.4 4-gidit version

MapR [PIC-48][K8S] Port k8s changes to 2.4.0

[PIC-48] Create user for k8s driver and executor if required

[PIC-48] Create user for k8s driver and executor if required

Revert "Remove spark.ui.filters property"

This reverts commit d8941ba36c3451cdce15d18d6c1a52991de3b971.

[SPARK-351] Copy kubernetes start scripts anyway

PIC-34: Rename default configmap name to be consistent with mapr-kubernetes

[SPARK-23668][K8S] Add config option for passing through k8s Pod.spec.imagePullSecrets (apache#355)

Pass through the `imagePullSecrets` option to the k8s pod in order to allow user to access private image registries.

See https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/

Unit tests + manual testing.

Manual testing procedure:
1. Have private image registry.
2. Spark-submit application with no `spark.kubernetes.imagePullSecret` set. Do `kubectl describe pod ...`. See the error message:
```
Error syncing pod, skipping: failed to "StartContainer" for "spark-kubernetes-driver" with ErrImagePull: "rpc error: code = 2 desc = Error: Status 400 trying to pull repository ...: \"{\\n  \\\"errors\\\" : [ {\\n    \\\"status\\\" : 400,\\n    \\\"message\\\" : \\\"Unsupported docker v1 repository request for '...'\\\"\\n  } ]\\n}\""
```
3. Create secret `kubectl create secret docker-registry ...`
4. Spark-submit with `spark.kubernetes.imagePullSecret` set to the new secret. See that deployment was successful.

Author: Andrew Korzhuev <andrew.korzhuev@klarna.com>
Author: Andrew Korzhuev <korzhuev@andrusha.me>

Closes apache#20811 from andrusha/spark-23668-image-pull-secrets.

[SPARK-321] Change default value of spark.mapr.ssl.secret.prefix property

[PIC-32] Spark on k8s with MapR secure cluster

Update entrypoint.sh with correct spark version (apache#340)

This PR has minor fix to correct the spark version string

[SPARK-274] Create home directory for user who submitted job

[MAPR-SPARK-230] Implement security for Spark on Kubernetes

Run Spark job with specify the username for driver and executor

Read cluster configs from configMap

Run configure.sh script form entrypoint.sh

Remove spark.kubernetes.driver.pod.commands property

Add Spark properties for executor and driver environment variable

MapR [SPARK-296] Structured Streaming memory leak

Revert "[MAPR-SPARK-210] Rename sprk-defaults.conf to spark-defaults.conf.tem…" (apache#252)

* Revert "[MAPR-SPARK-176] Fix Spark Project Catalyst unit tests (apache#251)"

This reverts commit 5de05075cd14abf8ac65046a57a5d76617818fbe.

* Revert "[MAPR-SPARK-210] Rename sprk-defaults.conf to spark-defaults.conf.template (apache#249)"

This reverts commit 1baa677d727e89db7c605ffbae9a9eba00337ad0.

[MAPR-SPARK-210] Rename sprk-defaults.conf to spark-defaults.conf.template

MapR [SPARK-379] Port Spark to 2.4.0

MapR [SPARK-341] Spark 2.3.2 porting

[MAPR-32290] Spark processing offsets when messages are already TTL in the first batch

* Bug 32263 - Seek called on unsubscribed partitions

[MSPARK-331] Remove snapshot versions of mapr dependencies from Spark-2.3.1

[MAPR-32290] Spark processing offsets when messages are already ttl in first batch

MapR [SPARK-325] Add examples for work with the MapRDB JSON connector into the Spark project

[ATS-449] Unit test for EBF 32013 created.

MAPR-SPARK-311: Spark beeline uses default ssl truststore instead of mapr ssl truststore

Bug 32355 - Executor tab empty on Spark UI

[SPARK-318] Submitting Spark jobs from Oozie fails due to ClassNotFoundException

Bug 32014 - Spark Consumer fails with java.lang.AssertionError

Revert "[SPARK-306] Kafka clients 1.0.1 present in jars directory for Spark 2.3.1" (apache#341)

* Revert "[SPARK-306] Kafka clients 1.0.1 present in jars directory for Spark 2.3.1 (apache#335)"

This reverts commit 832411e.

Bug 32014 - Spark Consumer fails with java.lang.AssertionError (apache#326) (apache#336)

* MapR [32014] Spark Consumer fails with java.lang.AssertionError

[SPARK-306] Kafka clients 1.0.1 present in jars directory for Spark 2.3.1

DEVOPS-2768 temporarily removed curl for file downloading

[SPARK-302] Local privilege escalation

MapR [SPARK-297] Added unit test for empty value conversion

MapR [SPARK-297] Empty values are loaded as non-null

MapR [SPARK-296] Structured Streaming memory leak

2.3.1 spark 289 (apache#318)

* MapR [SPARK-289] Fix unit test for Spark-2.3.1

[SPARK-130] MapRDB connector - NPE while saving Pair RDD with 'null' values

MapR [SPARK-283] Unit tests fail during initialization SSL properties.

[SPARK-212] SparkHiveExample fails when we run it twice

MapR [SPARK-282] Remove maprfs and hadoop jars from mapr spark package

MapR [SPARK-278] Spark submit fails for jobs with python

MapR [SPARK-279] Can't connect to spark thrift server with new spark and hive packages

MapR [SPARK-276] Update zookeeper dependency to v.3.4.11 for spark 2.3.1

MapR [SPARK-272] Use only client passwords from ssl-client.xml

MapR [SPARK-266] Spark jobs can't finish correctly, when there is an error during job running

MapR [SPARK-263] Add possibility to use keyPassword which is different from keyStorePassword

[MSPARK-31632] RM UI showing broken page for Spark jobs

MapR [SPARK-261] Use mapr-security-web for getting passwords.

MapR [SPARK-259] Spark application doesn't finish correctly

MapR [SPARK-268] Update Spark version for Warden

change project version to 2.3.1-mapr-SNAPSHOT

MapR [SPARK-256] Spark doesn't work on yarn mode

MapR [SPARK-255] Installer fresh install 610/600 secure fails to start "mapr-spark-thriftserver", "mapr-spark-historyserver"

Mapr [SPARK-248] MapRDBTableScanRDD fails to convert to Scala Dataframe when using where clause

MapR [SPARK-225] Hadoop credentials provider usage for hiding passwords at spark-defaults

MapR [SPARK-214] Hive-2.1 poperties can't be read from a hive-site.xml as Spark uses Hive-1.2

MapR [SPARK-216] Spark thriftserver fails when work with hive-maprdb json table

SPARK-244 (apache#278)

Provide ability to use MapR-Negotiation authentication for Spark HistoryServer

MapR [SPARK-226] Spark - pySpark Security Vulnerability

MapR [SPARK-220] SparkR fails with UDF functions bug fixed

MapR [SPARK-227] KafkaUtils.createDirectStream fails with kafka-09

MapR [SPARK-183] Spark Integration for Kafka 0.10 unit tests disabled

MapR [SPARK-182] Spark Project External Kafka Producer v09 unit tests fixed

MapR [SPARK-179] Spark Integration for Kafka 0.9 unit tests fixed

MapR [SPARK-181] Kafka 0.10 Structured Streaming unit tests fixed

[MSPARK-31305] Spark History server NOT loading applications submitted by users other than 'mapr'

MapR [SPARK-175] Fix Spark Project Streaming unit tests

[MAPR-SPARK-176] Fix Spark Project Catalyst unit tests

[MAPR-SPARK-178] Fix Spark Project Hive unit tests

MapR [SPARK-174] Spark Core unit tests fixed

Changed version for spark-kafka connector.

MapR [SPARK-202] Update MapR Spark to 2.3.0

Fixed compile time errors in tests

Change project version

[SPARK-198] Update hadoop dependency version to 2.7.0-mapr-1803 for Spark 2.2.1

MapR [SPARK-188] Couldn't connect to thrift server via spark beeline on kerberos cluster

MapR [SPARK-143] Spark History Server does not require login for secured-by-default clusters

MapR [SPARK-186] Update OJAI versions to the latest for Spark-2.2.1 OJAI Connector

MapR [SPARK-191] Incorrect work of MapR-DB Sink 'complete' and 'update' modes fixed

MapR [SPARK-170] StackOverflowException in equals method in DBMapValue

2.2.1 build fixed (apache#231)

* MapR [SPARK-164] Update Kafka version to 1.0.1-mapr in Spark Kafka Producer module

MapR [SPARK-161] Include Kafka Structured streaming jar to Spark package.

MapR [SPARK-155] Change Spark Master port from 8080

MapR [SPARK-153] Exception in spark job with configured labels on yarn-client mode

MapR [SPARK-152] Incorrect date string parsing fixed

MapR [SPARK-21] Structured Streaming MapR-DB Sink created

MapR [SPARK-135]  Spark 2.2 with MapR Streams ( Kafka 1.0) (apache#218)

* MapR [SPARK-135] Spark 2.2 with MapR Streams (Kafka 1.0)
Added functionality of MapR-Streams specific EOF handling.

MapR [SPARK-143] Spark History Server does not require login for secured-by-default clusters

Disable build failing if scalastyle checking is fall.

MapR [SPARK-16] Change Spark version in Warden files and configure.sh

MapR [SPARK-144] Add insertToMapRDB method for rdd for Java API

[MAPR-30536]  Spark SQL queries on Map column fails after upgrade

MapR [SPARK-139] Remove "update" related APIs from connector

MapR [SPARK-140] Change the option name "tableName" to "tablePath" in the Spark/MapR-DB connectors.

MapR [SPARK-121] Spark OJAI JAVA: update functionality removed

MapR [SPARK-118] Spark OJAI Python: missed DataFrame import while moving imports in order to fix MapR [ZEP-101] interpreter issue

MapR [SPARK-118] Spark OJAI Python: move MapR DB Connector class importing in order to fix MapR [ZEP-101] interpreter issue

MapR [SPARK-117] Spark OJAI Python: Save functionality implementation

MapR [SPARK-131] Exception when try to save JSON table with Binary _id field

Spark OJAI JAVA: load to RDD, save from RDD implementation (apache#195)

* MapR [SPARK-124] Loading to JavaRDD implemented
* MapR [SPARK-124] MapRDBJavaSparkContext constructor changed
* MapR [SPARK-124] implemented RDD[Row] saving

MapR [SPARK-118] Spark OJAI Python: Read implementation

MapR [SPARK-128] MapRDB connector - wrong handle of null fields when nullable is false

* MapR [SPARK-121] Spark OJAI JAVA: Read to Dataset functionality implementation
* Minor refactoring

MapR [SPARK-125] Default value of idFieldPath parameter is not handle

MapR [SPARK-113] Hit java.lang.UnsupportedOperationException: empty.reduceLeft during loadFromMapRDB

Spark Mapr-DB connector was refactored according to Scala style
Removed code duplication

[MSPARK-107]idField information is lost in MapRDBDataFrameWriterFunctions.saveToMapRDB

configure.sh takes options to change ports

Kafka client excluded from package because correct version is located in "mapr classpath"

Changed Kafka version in Kafka producer module.

Branch spark 69 (apache#170)

* Fixing the wrong type casting of TimeStamp to OTimeStamp when read from spark dataFrame.

* SPARK-69: Problem with license when we try to read from json and write to maprdb

remove creatin /usr/local/spark link from configure.sh. This link will be creates by private-pkg

remove include-maprdb from default profiles

added profiles in maprdb pom file instead of two pom files

Fixed maprdb connector dependencies.

Fixing the wrong type casting of TimeStamp to OTimeStamp when read from spark dataFrame.

changed port for spark-thriftserver as it conflicts with hive server

changed port for spark-thriftserver as it conflicts with hive server

remove .not_configured_yet file after success

Ojai connector fixed required java version

[MSPARK-45] Move Spark-OJAI connector code to Spark github repo (apache#132)

* SPARK-45 Move Spark-OJAI connector code to Spark github repo

* Fixing pom versions for maprdb spark connector.

* Changes made to the connector code to be compatible with 5.2.* and 6.0 clients.

Spark 2.1.0 mapr 29106 (apache#150)

* [SPARK-20922][CORE] Add whitelist of classes that can be deserialized by the launcher.

Blindly deserializing classes using Java serialization opens the code up to
issues in other libraries, since just deserializing data from a stream may
end up execution code (think readObject()).

Since the launcher protocol is pretty self-contained, there's just a handful
of classes it legitimately needs to deserialize, and they're in just two
packages, so add a filter that throws errors if classes from any other
package show up in the stream.

This also maintains backwards compatibility (the updated launcher code can
still communicate with the backend code in older Spark releases).

Tested with new and existing unit tests.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes apache#18166 from vanzin/SPARK-20922.

(cherry picked from commit 8efc6e9)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

(cherry picked from commit 772a9b9)

* [SPARK-20922][CORE][HOTFIX] Don't use Java 8 lambdas in older branches.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes apache#18178 from vanzin/SPARK-20922-hotfix.

Added security by default for historyserver

use waitForConsumerAssignment() instead of consumer.poll(0) for spark-29052

change MAPR_HADOOP_CLASSPATH in configure.sh for creating it by mapr-classpath.sh

change MAPR_HADOOP_CLASSPATH in configure.sh for creating it by mapr-classpath.sh

changes for mapr-classpath.sh

changes for mapr-classpath.sh

configure.sh changes

[SPARK-39] Classpath filter was added

Fixed impersonation when data read from MapR-DB via Spark-Hive.

added configure.sh and warden.spark-thriftserver.conf

hive-hbase-handler added to Spark jars

Fixed "Single message comes late"

28339 bug fixed

Spark streaming skipped message with zero offset from Kafka 0.9

[MSPARK-9] Initial fix for Spark unit tests

Bump dependencies after ECO-1703 release

[SPARK-33] Streaming example fixed

[MAPR-26060] Fixed case when mapr-streams make gaps in offsets

ported features from kafka 10 to kafka 9

[MAPR-26289][SPARK-2.1] Streaming general improvements (apache#93)

* Added include-kafka-09 profile to Assembly
* Set default poll timeout to 120s

Set default HBase verison to 1.1.8

Changes from Kafka10  package were ported to Kafka09 package.

[MAPR-26053] Include MapR Classes to the default value of spark.sql.hive.metastore.sharedPrefixes

[MAPR-25807] Spark-Warehouse path computes incorrectly

Add MapR-SASL support for Thrift Server

Adding scala library.

[MAPR-25713] Spark might try to load MapR Class Loader multiple times and fail

[MAPR-25311] Bump Spark dependencies after ECO-1611 release

[MINOR] Fix spark-jars.sh script

[MAPR-24603] Could not launch beeline shell after starting spark thrift server

fixed syntax error in V09DirectKafkaWordCount example

Spark 2.0.1 MAPR-streams Python API

[MAPR-24415] SPARK_JAVA_OPTS is deprecated

Kafka streaming producer added.

Minor fix for previous commit

Added script for MAPR-24374

Some minor changes to spark-defaults.conf

Changed default HBase version to 1.1.1 in compatibility.version

Streaming example was refactored

[MAPR-24470] HiveFromSpark test fails in yarn-cluster mode

Added MapR Repo

[MAPR-22940] Failed to connect spark beeline (after spark thrift server is started) on Kerberos cluster

[MAPR-18865] Unable to submit spark apps from Windows client

Skip maven clean task on the parent module

New: Issue with running Hive commands in Spark

This is fixed in SPARK-7819
Isolated Hive Client Loader appears to cause Native Library
libMapRClient.4.0.2-mapr.so already loaded in another classloader error

Spark warden.services.conf should have dependency on cldb

Remove DFS shuffle settings.

These settings are not used right now.

Copy every file in the conf directory into the distribution package.

Create spark-defaults.conf for MapR

Settings to enable DFS shuffle on MapR.

Support hbase classpath computation in util script.

Adding external conf and scripts.

Enable SPARK_HIVE mode while building.

This is needed to bundle datanucleus jars needed for hive table creation.

Build Spark on MapR.
- make-distribution.sh takes an environment variable to enable profiles -
  MVN_PROFILE_ARG
- Added warden conf files under ext-conf.
- Updated pom.xml to use right set of jars and version.

Spark Master failed to start in HA mode

Updated Apache Curator version

Added spark streaming integration with kafka 0.9 and mapr-streams

Added MapR Repo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants