Skip to content

Commit

Permalink
[MINOR][DOCS] Miscellaneous documentation improvements
Browse files Browse the repository at this point in the history
### What changes were proposed in this pull request?

- Improve the formatting of various code snippets.
- Fix some broken links in the documentation.
- Clarify the non-intuitive behavior of `displayValue` in `getAllDefinedConfs()`.

### Why are the changes needed?

These are minor quality of life improvements for users and developers alike.

### Does this PR introduce _any_ user-facing change?

Yes, it tweaks some of the links in user-facing documentation.

### How was this patch tested?

Not tested beyond CI.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#44919 from nchammas/misc-doc-fixes.

Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
  • Loading branch information
nchammas authored and HyukjinKwon committed Jan 29, 2024
1 parent 901850c commit f078998
Show file tree
Hide file tree
Showing 5 changed files with 26 additions and 12 deletions.
16 changes: 10 additions & 6 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,10 +88,14 @@ val sc = new SparkContext(new SparkConf())
{% endhighlight %}

Then, you can supply configuration values at runtime:
{% highlight bash %}
./bin/spark-submit --name "My app" --master local[4] --conf spark.eventLog.enabled=false
--conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar
{% endhighlight %}
```sh
./bin/spark-submit \
--name "My app" \
--master local[4] \
--conf spark.eventLog.enabled=false \
--conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \
myApp.jar
```

The Spark shell and [`spark-submit`](submitting-applications.html)
tool support two ways to load configurations dynamically. The first is command line options,
Expand Down Expand Up @@ -3708,9 +3712,9 @@ Also, you can modify or add configurations at runtime:
GPUs and other accelerators have been widely used for accelerating special workloads, e.g.,
deep learning and signal processing. Spark now supports requesting and scheduling generic resources, such as GPUs, with a few caveats. The current implementation requires that the resource have addresses that can be allocated by the scheduler. It requires your cluster manager to support and be properly configured with the resources.

There are configurations available to request resources for the driver: <code>spark.driver.resource.{resourceName}.amount</code>, request resources for the executor(s): <code>spark.executor.resource.{resourceName}.amount</code> and specify the requirements for each task: <code>spark.task.resource.{resourceName}.amount</code>. The <code>spark.driver.resource.{resourceName}.discoveryScript</code> config is required on YARN, Kubernetes and a client side Driver on Spark Standalone. <code>spark.executor.resource.{resourceName}.discoveryScript</code> config is required for YARN and Kubernetes. Kubernetes also requires <code>spark.driver.resource.{resourceName}.vendor</code> and/or <code>spark.executor.resource.{resourceName}.vendor</code>. See the config descriptions above for more information on each.
There are configurations available to request resources for the driver: `spark.driver.resource.{resourceName}.amount`, request resources for the executor(s): `spark.executor.resource.{resourceName}.amount` and specify the requirements for each task: `spark.task.resource.{resourceName}.amount`. The `spark.driver.resource.{resourceName}.discoveryScript` config is required on YARN, Kubernetes and a client side Driver on Spark Standalone. `spark.executor.resource.{resourceName}.discoveryScript` config is required for YARN and Kubernetes. Kubernetes also requires `spark.driver.resource.{resourceName}.vendor` and/or `spark.executor.resource.{resourceName}.vendor`. See the config descriptions above for more information on each.

Spark will use the configurations specified to first request containers with the corresponding resources from the cluster manager. Once it gets the container, Spark launches an Executor in that container which will discover what resources the container has and the addresses associated with each resource. The Executor will register with the Driver and report back the resources available to that Executor. The Spark scheduler can then schedule tasks to each Executor and assign specific resource addresses based on the resource requirements the user specified. The user can see the resources assigned to a task using the <code>TaskContext.get().resources</code> api. On the driver, the user can see the resources assigned with the SparkContext <code>resources</code> call. It's then up to the user to use the assignedaddresses to do the processing they want or pass those into the ML/AI framework they are using.
Spark will use the configurations specified to first request containers with the corresponding resources from the cluster manager. Once it gets the container, Spark launches an Executor in that container which will discover what resources the container has and the addresses associated with each resource. The Executor will register with the Driver and report back the resources available to that Executor. The Spark scheduler can then schedule tasks to each Executor and assign specific resource addresses based on the resource requirements the user specified. The user can see the resources assigned to a task using the `TaskContext.get().resources` api. On the driver, the user can see the resources assigned with the SparkContext `resources` call. It's then up to the user to use the assigned addresses to do the processing they want or pass those into the ML/AI framework they are using.

See your cluster manager specific page for requirements and details on each of - [YARN](running-on-yarn.html#resource-allocation-and-configuration-overview), [Kubernetes](running-on-kubernetes.html#resource-allocation-and-configuration-overview) and [Standalone Mode](spark-standalone.html#resource-allocation-and-configuration-overview). It is currently not available with local mode. And please also note that local-cluster mode with multiple workers is not supported(see Standalone documentation).

Expand Down
4 changes: 3 additions & 1 deletion docs/mllib-dimensionality-reduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,10 +66,12 @@ first and then compute its top eigenvalues and eigenvectors locally on the drive
This requires a single pass with $O(n^2)$ storage on each executor and on the driver, and
$O(n^2 k)$ time on the driver.
* Otherwise, we compute $(A^T A) v$ in a distributive way and send it to
<a href="http://www.caam.rice.edu/software/ARPACK/">ARPACK</a> to
[ARPACK][arpack] to
compute $(A^T A)$'s top eigenvalues and eigenvectors on the driver node. This requires $O(k)$
passes, $O(n)$ storage on each executor, and $O(n k)$ storage on the driver.

[arpack]: https://web.archive.org/web/20210503024933/http://www.caam.rice.edu/software/ARPACK

### SVD Example

`spark.mllib` provides SVD functionality to row-oriented matrices, provided in the
Expand Down
6 changes: 4 additions & 2 deletions docs/rdd-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -877,11 +877,13 @@ The most common ones are distributed "shuffle" operations, such as grouping or a
by a key.

In Scala, these operations are automatically available on RDDs containing
[Tuple2](http://www.scala-lang.org/api/{{site.SCALA_VERSION}}/index.html#scala.Tuple2) objects
[Tuple2][tuple2] objects
(the built-in tuples in the language, created by simply writing `(a, b)`). The key-value pair operations are available in the
[PairRDDFunctions](api/scala/org/apache/spark/rdd/PairRDDFunctions.html) class,
which automatically wraps around an RDD of tuples.

[tuple2]: https://www.scala-lang.org/api/{{site.SCALA_VERSION}}/scala/Tuple2.html

For example, the following code uses the `reduceByKey` operation on key-value pairs to count how
many times each line of text occurs in a file:

Expand Down Expand Up @@ -909,7 +911,7 @@ The most common ones are distributed "shuffle" operations, such as grouping or a
by a key.

In Java, key-value pairs are represented using the
[scala.Tuple2](http://www.scala-lang.org/api/{{site.SCALA_VERSION}}/index.html#scala.Tuple2) class
[scala.Tuple2][tuple2] class
from the Scala standard library. You can simply call `new Tuple2(a, b)` to create a tuple, and access
its fields later with `tuple._1()` and `tuple._2()`.

Expand Down
5 changes: 3 additions & 2 deletions docs/sql-data-sources-avro.md
Original file line number Diff line number Diff line change
Expand Up @@ -438,10 +438,11 @@ built-in but external module, both implicit classes are removed. Please use `.fo

If you prefer using your own build of `spark-avro` jar file, you can simply disable the configuration
`spark.sql.legacy.replaceDatabricksSparkAvro.enabled`, and use the option `--jars` on deploying your
applications. Read the [Advanced Dependency Management](https://spark.apache
.org/docs/latest/submitting-applications.html#advanced-dependency-management) section in Application
applications. Read the [Advanced Dependency Management][adm] section in the Application
Submission Guide for more details.

[adm]: submitting-applications.html#advanced-dependency-management

## Supported types for Avro -> Spark SQL conversion
Currently Spark supports reading all [primitive types](https://avro.apache.org/docs/1.11.3/specification/#primitive-types) and [complex types](https://avro.apache.org/docs/1.11.3/specification/#complex-types) under records of Avro.
<table>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5684,7 +5684,12 @@ class SQLConf extends Serializable with Logging with SqlApiConf {
def getAllDefinedConfs: Seq[(String, String, String, String)] = {
loadDefinedConfs()
getConfigEntries().asScala.filter(_.isPublic).map { entry =>
val displayValue = Option(getConfString(entry.key, null)).getOrElse(entry.defaultValueString)
val displayValue =
// We get the display value in this way rather than call getConfString(entry.key)
// because we want the default _definition_ and not the computed value.
// e.g. `<undefined>` instead of `null`
// e.g. `<value of spark.buffer.size>` instead of `65536`
Option(getConfString(entry.key, null)).getOrElse(entry.defaultValueString)
(entry.key, displayValue, entry.doc, entry.version)
}.toSeq
}
Expand Down

0 comments on commit f078998

Please sign in to comment.