Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-44997][DOCS] Align example order (Python -> Scala/Java -> R) in all Spark Doc Content #42712

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,8 @@ whichever version of Spark you currently have checked out of revision control.

## Prerequisites

The Spark documentation build uses a number of tools to build HTML docs and API docs in Scala, Java,
Python, R and SQL.
The Spark documentation build uses a number of tools to build HTML docs and API docs in Python, Scala,
Java, R and SQL.

You need to have [Ruby](https://www.ruby-lang.org/en/documentation/installation/) and
[Python](https://docs.python.org/2/using/unix.html#getting-and-installing-the-latest-version-of-python)
Expand Down Expand Up @@ -129,6 +129,6 @@ The jekyll plugin also generates the PySpark docs using [Sphinx](http://sphinx-d
using [roxygen2](https://cran.r-project.org/web/packages/roxygen2/index.html) and SQL docs
using [MkDocs](https://www.mkdocs.org/).

NOTE: To skip the step of building and copying over the Scala, Java, Python, R and SQL API docs, run `SKIP_API=1
NOTE: To skip the step of building and copying over the Python, Scala, Java, R and SQL API docs, run `SKIP_API=1
bundle exec jekyll build`. In addition, `SKIP_SCALADOC=1`, `SKIP_PYTHONDOC=1`, `SKIP_RDOC=1` and `SKIP_SQLDOC=1` can be used
to skip a single step of the corresponding language. `SKIP_SCALADOC` indicates skipping both the Scala and Java docs.
4 changes: 2 additions & 2 deletions docs/_layouts/global.html
Original file line number Diff line number Diff line change
Expand Up @@ -71,9 +71,9 @@
<li class="nav-item dropdown">
<a href="#" class="nav-link dropdown-toggle" id="navbarAPIDocs" role="button" data-toggle="dropdown" aria-haspopup="true" aria-expanded="false">API Docs</a>
<div class="dropdown-menu" aria-labelledby="navbarAPIDocs">
<a class="dropdown-item" href="api/python/index.html">Python</a>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before:
image

After:
image

<a class="dropdown-item" href="api/scala/org/apache/spark/index.html">Scala</a>
<a class="dropdown-item" href="api/java/index.html">Java</a>
<a class="dropdown-item" href="api/python/index.html">Python</a>
<a class="dropdown-item" href="api/R/index.html">R</a>
<a class="dropdown-item" href="api/sql/index.html">SQL, Built-in Functions</a>
</div>
Expand Down Expand Up @@ -128,7 +128,7 @@ <h1 style="max-width: 680px;">Apache Spark - A Unified engine for large-scale da
<div class="row mt-5">
<div class="col-12 col-lg-6 no-gutters">
Apache Spark is a unified analytics engine for large-scale data processing.
It provides high-level APIs in Java, Scala, Python and R,
It provides high-level APIs in Python, Scala, Java and R,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'm -1 with this.

and an optimized engine that supports general execution graphs.
It also supports a rich set of higher-level tools including
<a href="sql-programming-guide.html">Spark SQL</a> for SQL and structured data processing,
Expand Down
9 changes: 5 additions & 4 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ source, visit [Building Spark](building-spark.html).

Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS), and it should run on any platform that runs a supported version of Java. This should include JVMs on x86_64 and ARM64. It's easy to run locally on one machine --- all you need is to have `java` installed on your system `PATH`, or the `JAVA_HOME` environment variable pointing to a Java installation.

Spark runs on Java 8/11/17, Scala 2.12/2.13, Python 3.8+, and R 3.5+.
Spark runs on Python 3.8+, Scala 2.12/2.13, Java 8/11/17 and R 3.5+.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 for this change.

Java 8 prior to version 8u371 support is deprecated as of Spark 3.5.0.
When using the Scala API, it is necessary for applications to use the same version of Scala that Spark was compiled for.
For example, when using Scala 2.13, use Spark compiled for 2.13, and compile code/applications for Scala 2.13 as well.
Expand Down Expand Up @@ -120,9 +120,9 @@ options for deployment:

**API Docs:**

* [Spark Python API (Sphinx)](api/python/index.html)
* [Spark Scala API (Scaladoc)](api/scala/org/apache/spark/index.html)
* [Spark Java API (Javadoc)](api/java/index.html)
* [Spark Python API (Sphinx)](api/python/index.html)
* [Spark R API (Roxygen2)](api/R/index.html)
* [Spark SQL, Built-in Functions (MkDocs)](api/sql/index.html)

Expand Down Expand Up @@ -163,7 +163,8 @@ options for deployment:
* AMP Camps: a series of training camps at UC Berkeley that featured talks and
exercises about Spark, Spark Streaming, Mesos, and more. [Videos](https://www.youtube.com/user/BerkeleyAMPLab/search?query=amp%20camp),
are available online for free.
* [Code Examples](https://spark.apache.org/examples.html): more are also available in the `examples` subfolder of Spark ([Scala]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/examples),
[Java]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/java/org/apache/spark/examples),
* [Code Examples](https://spark.apache.org/examples.html): more are also available in the `examples` subfolder of Spark (
[Python]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python),
[Scala]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/examples),
[Java]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/java/org/apache/spark/examples),
[R]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/r))
8 changes: 4 additions & 4 deletions docs/ml-pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -208,7 +208,7 @@ This is useful if there are two algorithms with the `maxIter` parameter in a `Pi
Often times it is worth it to save a model or a pipeline to disk for later use. In Spark 1.6, a model import/export functionality was added to the Pipeline API.
As of Spark 2.3, the DataFrame-based API in `spark.ml` and `pyspark.ml` has complete coverage.

ML persistence works across Scala, Java and Python. However, R currently uses a modified format,
ML persistence works across Python, Scala and Java. However, R currently uses a modified format,
so models saved in R can only be loaded back in R; this should be fixed in the future and is
tracked in [SPARK-15572](https://issues.apache.org/jira/browse/SPARK-15572).

Expand Down Expand Up @@ -238,9 +238,9 @@ notes, then it should be treated as a bug to be fixed.

This section gives code examples illustrating the functionality discussed above.
For more info, please refer to the API documentation
([Scala](api/scala/org/apache/spark/ml/package.html),
[Java](api/java/org/apache/spark/ml/package-summary.html),
and [Python](api/python/reference/pyspark.ml.html)).
([Python](api/python/reference/pyspark.ml.html),
[Scala](api/scala/org/apache/spark/ml/package.html),
and [Java](api/java/org/apache/spark/ml/package-summary.html)).

## Example: Estimator, Transformer, and Param

Expand Down
10 changes: 5 additions & 5 deletions docs/quick-start.md
Original file line number Diff line number Diff line change
Expand Up @@ -470,19 +470,19 @@ Congratulations on running your first Spark application!
* For an in-depth overview of the API, start with the [RDD programming guide](rdd-programming-guide.html) and the [SQL programming guide](sql-programming-guide.html), or see "Programming Guides" menu for other components.
* For running applications on a cluster, head to the [deployment overview](cluster-overview.html).
* Finally, Spark includes several samples in the `examples` directory
([Scala]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/examples),
([Python]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python),
[Scala]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/examples),
[Java]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/java/org/apache/spark/examples),
[Python]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python),
[R]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/r)).
You can run them as follows:

{% highlight bash %}
# For Scala and Java, use run-example:
./bin/run-example SparkPi

# For Python examples, use spark-submit directly:
./bin/spark-submit examples/src/main/python/pi.py

# For Scala and Java, use run-example:
./bin/run-example SparkPi

# For R examples, use spark-submit directly:
./bin/spark-submit examples/src/main/r/dataframe.R
{% endhighlight %}
20 changes: 10 additions & 10 deletions docs/rdd-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -945,9 +945,9 @@ documentation](https://docs.oracle.com/javase/8/docs/api/java/lang/Object.html#h

The following table lists some of the common transformations supported by Spark. Refer to the
RDD API doc
([Scala](api/scala/org/apache/spark/rdd/RDD.html),
([Python](api/python/reference/api/pyspark.RDD.html#pyspark.RDD),
[Scala](api/scala/org/apache/spark/rdd/RDD.html),
[Java](api/java/index.html?org/apache/spark/api/java/JavaRDD.html),
[Python](api/python/reference/api/pyspark.RDD.html#pyspark.RDD),
[R](api/R/reference/index.html))
and pair RDD functions doc
([Scala](api/scala/org/apache/spark/rdd/PairRDDFunctions.html),
Expand Down Expand Up @@ -1059,9 +1059,9 @@ for details.

The following table lists some of the common actions supported by Spark. Refer to the
RDD API doc
([Scala](api/scala/org/apache/spark/rdd/RDD.html),
([Python](api/python/reference/api/pyspark.RDD.html#pyspark.RDD),
[Scala](api/scala/org/apache/spark/rdd/RDD.html),
[Java](api/java/index.html?org/apache/spark/api/java/JavaRDD.html),
[Python](api/python/reference/api/pyspark.RDD.html#pyspark.RDD),
[R](api/R/reference/index.html))

and pair RDD functions doc
Expand Down Expand Up @@ -1207,9 +1207,9 @@ In addition, each persisted RDD can be stored using a different *storage level*,
to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space),
replicate it across nodes.
These levels are set by passing a
`StorageLevel` object ([Scala](api/scala/org/apache/spark/storage/StorageLevel.html),
[Java](api/java/index.html?org/apache/spark/storage/StorageLevel.html),
[Python](api/python/reference/api/pyspark.StorageLevel.html#pyspark.StorageLevel))
`StorageLevel` object ([Python](api/python/reference/api/pyspark.StorageLevel.html#pyspark.StorageLevel),
[Scala](api/scala/org/apache/spark/storage/StorageLevel.html),
[Java](api/java/index.html?org/apache/spark/storage/StorageLevel.html))
to `persist()`. The `cache()` method is a shorthand for using the default storage level,
which is `StorageLevel.MEMORY_ONLY` (store deserialized objects in memory). The full set of
storage levels is:
Expand Down Expand Up @@ -1596,9 +1596,9 @@ as Spark does not support two contexts running concurrently in the same program.

You can see some [example Spark programs](https://spark.apache.org/examples.html) on the Spark website.
In addition, Spark includes several samples in the `examples` directory
([Scala]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/examples),
([Python]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python),
[Scala]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/examples),
[Java]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/java/org/apache/spark/examples),
[Python]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python),
[R]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/r)).
You can run Java and Scala examples by passing the class name to Spark's `bin/run-example` script; for instance:

Expand All @@ -1619,4 +1619,4 @@ For help on deploying, the [cluster mode overview](cluster-overview.html) descri
in distributed operation and supported cluster managers.

Finally, full API documentation is available in
[Scala](api/scala/org/apache/spark/), [Java](api/java/), [Python](api/python/) and [R](api/R/).
[Python](api/python/), [Scala](api/scala/org/apache/spark/), [Java](api/java/) and [R](api/R/).
2 changes: 1 addition & 1 deletion docs/sql-getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ As an example, the following creates a DataFrame based on the content of a JSON

## Untyped Dataset Operations (aka DataFrame Operations)

DataFrames provide a domain-specific language for structured data manipulation in [Scala](api/scala/org/apache/spark/sql/Dataset.html), [Java](api/java/index.html?org/apache/spark/sql/Dataset.html), [Python](api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.html) and [R](api/R/reference/SparkDataFrame.html).
DataFrames provide a domain-specific language for structured data manipulation in [Python](api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.html), [Scala](api/scala/org/apache/spark/sql/Dataset.html), [Java](api/java/index.html?org/apache/spark/sql/Dataset.html) and [R](api/R/reference/SparkDataFrame.html).

As mentioned above, in Spark 2.0, DataFrames are just Dataset of `Row`s in Scala and Java API. These operations are also referred as "untyped transformations" in contrast to "typed transformations" come with strongly typed Scala/Java Datasets.

Expand Down
5 changes: 3 additions & 2 deletions docs/sql-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,8 +55,9 @@ A DataFrame is a *Dataset* organized into named columns. It is conceptually
equivalent to a table in a relational database or a data frame in R/Python, but with richer
optimizations under the hood. DataFrames can be constructed from a wide array of [sources](sql-data-sources.html) such
as: structured data files, tables in Hive, external databases, or existing RDDs.
The DataFrame API is available in Scala,
Java, [Python](api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.html#pyspark.sql.DataFrame), and [R](api/R/index.html).
The DataFrame API is available in
[Python](api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.html#pyspark.sql.DataFrame), Scala,
Java and [R](api/R/index.html).
In Scala and Java, a DataFrame is represented by a Dataset of `Row`s.
In [the Scala API][scala-datasets], `DataFrame` is simply a type alias of `Dataset[Row]`.
While, in [Java API][java-datasets], users need to use `Dataset<Row>` to represent a `DataFrame`.
Expand Down
24 changes: 12 additions & 12 deletions docs/streaming-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ operations on other DStreams. Internally, a DStream is represented as a sequence
[RDDs](api/scala/org/apache/spark/rdd/RDD.html).

This guide shows you how to start writing Spark Streaming programs with DStreams. You can
write Spark Streaming programs in Scala, Java or Python (introduced in Spark 1.2),
write Spark Streaming programs in Python (introduced in Spark 1.2), Scala or Java,
all of which are presented in this guide.
You will find tabs throughout this guide that let you choose between code snippets of
different languages.
Expand Down Expand Up @@ -762,9 +762,9 @@ DStreams can be created with data streams received through custom receivers. See
For testing a Spark Streaming application with test data, one can also create a DStream based on a queue of RDDs, using `streamingContext.queueStream(queueOfRDDs)`. Each RDD pushed into the queue will be treated as a batch of data in the DStream, and processed like a stream.

For more details on streams from sockets and files, see the API documentations of the relevant functions in
[StreamingContext](api/scala/org/apache/spark/streaming/StreamingContext.html) for
Scala, [JavaStreamingContext](api/java/index.html?org/apache/spark/streaming/api/java/JavaStreamingContext.html)
for Java, and [StreamingContext](api/python/reference/api/pyspark.streaming.StreamingContext.html#pyspark.streaming.StreamingContext) for Python.
[StreamingContext](api/python/reference/api/pyspark.streaming.StreamingContext.html#pyspark.streaming.StreamingContext) for Python,
[StreamingContext](api/scala/org/apache/spark/streaming/StreamingContext.html) for Scala,
and [JavaStreamingContext](api/java/index.html?org/apache/spark/streaming/api/java/JavaStreamingContext.html) for Java.

### Advanced Sources
{:.no_toc}
Expand Down Expand Up @@ -1265,12 +1265,12 @@ JavaPairDStream<String, String> joinedStream = windowedStream.transform(rdd -> r

In fact, you can also dynamically change the dataset you want to join against. The function provided to `transform` is evaluated every batch interval and therefore will use the current dataset that `dataset` reference points to.

The complete list of DStream transformations is available in the API documentation. For the Scala API,
see [DStream](api/scala/org/apache/spark/streaming/dstream/DStream.html)
The complete list of DStream transformations is available in the API documentation. For the Python API,
see [DStream](api/python/reference/api/pyspark.streaming.DStream.html#pyspark.streaming.DStream).
For the Scala API, see [DStream](api/scala/org/apache/spark/streaming/dstream/DStream.html)
and [PairDStreamFunctions](api/scala/org/apache/spark/streaming/dstream/PairDStreamFunctions.html).
For the Java API, see [JavaDStream](api/java/index.html?org/apache/spark/streaming/api/java/JavaDStream.html)
and [JavaPairDStream](api/java/index.html?org/apache/spark/streaming/api/java/JavaPairDStream.html).
For the Python API, see [DStream](api/python/reference/api/pyspark.streaming.DStream.html#pyspark.streaming.DStream).

***

Expand Down Expand Up @@ -2150,7 +2150,7 @@ application left off. Note that this can be done only with input sources that su
(like Kafka) as data needs to be buffered while the previous application was down and
the upgraded application is not yet up. And restarting from earlier checkpoint
information of pre-upgrade code cannot be done. The checkpoint information essentially
contains serialized Scala/Java/Python objects and trying to deserialize objects with new,
contains serialized Python/Scala/Java objects and trying to deserialize objects with new,
modified classes may lead to errors. In this case, either start the upgraded app with a different
checkpoint directory, or delete the previous checkpoint directory.

Expand Down Expand Up @@ -2564,6 +2564,8 @@ additional effort may be necessary to achieve exactly-once semantics. There are
- [Custom Receiver Guide](streaming-custom-receivers.html)
* Third-party DStream data sources can be found in [Third Party Projects](https://spark.apache.org/third-party-projects.html)
* API documentation
- Python docs
* [StreamingContext](api/python/reference/api/pyspark.streaming.StreamingContext.html#pyspark.streaming.StreamingContext) and [DStream](api/python/reference/api/pyspark.streaming.DStream.html#pyspark.streaming.DStream)
- Scala docs
* [StreamingContext](api/scala/org/apache/spark/streaming/StreamingContext.html) and
[DStream](api/scala/org/apache/spark/streaming/dstream/DStream.html)
Expand All @@ -2575,10 +2577,8 @@ additional effort may be necessary to achieve exactly-once semantics. There are
[JavaPairDStream](api/java/index.html?org/apache/spark/streaming/api/java/JavaPairDStream.html)
* [KafkaUtils](api/java/index.html?org/apache/spark/streaming/kafka/KafkaUtils.html),
[KinesisUtils](api/java/index.html?org/apache/spark/streaming/kinesis/KinesisInputDStream.html)
- Python docs
* [StreamingContext](api/python/reference/api/pyspark.streaming.StreamingContext.html#pyspark.streaming.StreamingContext) and [DStream](api/python/reference/api/pyspark.streaming.DStream.html#pyspark.streaming.DStream)

* More examples in [Scala]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/examples/streaming)
* More examples in [Python]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python/streaming)
and [Scala]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/examples/streaming)
and [Java]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/java/org/apache/spark/examples/streaming)
and [Python]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python/streaming)
* [Paper](http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf) and [video](http://youtu.be/g171ndOHgJ0) describing Spark Streaming.
Loading