From ecdb38e2d0592987991cde46081c4de9b3c0fe64 Mon Sep 17 00:00:00 2001 From: Nicholas Chammas Date: Wed, 31 Jan 2024 09:47:33 +0900 Subject: [PATCH] [SPARK-46923][DOCS] Limit width of configuration tables ### What changes were proposed in this pull request? - Assign all config tables in the documentation to the new CSS class `spark-config`. - Migrate the config table in `docs/sql-ref-ansi-compliance.md` from Markdown to HTML and assign it to this new CSS class as well. - Limit the width of the config tables to the width of the main content, and force words to break and wrap if necessary. - Remove a styling workaround for the documentation for `spark.worker.resourcesFile` that is not needed anymore thanks to these changes. - Remove some `.global` CSS rules that, due to their [specificity][specificity] interfere with our ability to assign simple rules that apply directly to elements. [specificity]: https://developer.mozilla.org/en-US/docs/Web/CSS/Specificity ### Why are the changes needed? Many configs and config defaults have very long names that normally cannot wrap. This causes tables to overflow the viewport. An egregious example of this is `spark.scheduler.listenerbus.eventqueue.executorManagement.capacity`, which has a default of `spark.scheduler.listenerbus.eventqueue.capacity`. This change will force these long strings to break and wrap, which will keep the table widths limited to the width of the overall content. Because we are hard-coding the column widths, some tables will look slightly worse with this new layout due to extra whitespace. I couldn't figure out a practical way to prevent that while also solving the main problem of table overflow. In #44755 or #44756 (whichever approach gets accepted), these config tables will be generated automatically. This will give us the opportunity to improve the styling further by setting the column width dynamically based on the content. (This should be possible in CSS, but table styling in CSS is limited and we cannot use properties like `max-width`.) We will also be able to insert [word break opportunities][wbo] so that config names wrap in a more visually pleasing manner. [wbo]: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/wbr ### Does this PR introduce _any_ user-facing change? Yes, it changes the presentation of tables, especially config tables, in the main documentation. ### How was this patch tested? I built the docs and compared them visually across `master` (left) and this branch (right). `sql-ref-ansi-compliance.html`: `configuration.html#scheduling`: `configuration.html#barrier-execution-mode`: `spark-standalone.html`: `structured-streaming-kafka-integration.html#configuration`: ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44955 from nchammas/table-styling. Authored-by: Nicholas Chammas Signed-off-by: Hyukjin Kwon --- connector/profiler/README.md | 2 +- docs/configuration.md | 38 ++++++------- docs/css/custom.css | 56 +++++++++++++++---- docs/monitoring.md | 2 +- docs/running-on-kubernetes.md | 2 +- docs/running-on-yarn.md | 6 +- docs/security.md | 18 +++--- docs/spark-standalone.md | 20 +++---- docs/sql-data-sources-avro.md | 8 +-- docs/sql-data-sources-hive-tables.md | 2 +- docs/sql-data-sources-orc.md | 4 +- docs/sql-data-sources-parquet.md | 2 +- docs/sql-performance-tuning.md | 16 +++--- docs/sql-ref-ansi-compliance.md | 38 +++++++++++-- .../structured-streaming-kafka-integration.md | 8 +-- sql/gen-sql-config-docs.py | 4 +- 16 files changed, 143 insertions(+), 83 deletions(-) diff --git a/connector/profiler/README.md b/connector/profiler/README.md index 3512dadb07913..527f8b487d4d4 100644 --- a/connector/profiler/README.md +++ b/connector/profiler/README.md @@ -40,7 +40,7 @@ Then enable the profiling in the configuration. ### Code profiling configuration - +
diff --git a/docs/configuration.md b/docs/configuration.md index 7fef09781a15a..0f80a892c0679 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -139,7 +139,7 @@ of the most common options to set are: ### Application Properties -
Property NameDefaultMeaningSince Version
spark.executor.profiling.enabled
+
@@ -553,7 +553,7 @@ Apart from these, the following properties are also available, and may be useful ### Runtime Environment -
Property NameDefaultMeaningSince Version
spark.app.name
+
@@ -940,7 +940,7 @@ Apart from these, the following properties are also available, and may be useful ### Shuffle Behavior -
Property NameDefaultMeaningSince Version
spark.driver.extraClassPath
+
@@ -1315,7 +1315,7 @@ Apart from these, the following properties are also available, and may be useful ### Spark UI -
Property NameDefaultMeaningSince Version
spark.reducer.maxSizeInFlight
+
@@ -1755,7 +1755,7 @@ Apart from these, the following properties are also available, and may be useful ### Compression and Serialization -
Property NameDefaultMeaningSince Version
spark.eventLog.logBlockUpdates.enabled
+
@@ -1972,7 +1972,7 @@ Apart from these, the following properties are also available, and may be useful ### Memory Management -
Property NameDefaultMeaningSince Version
spark.broadcast.compress
+
@@ -2097,7 +2097,7 @@ Apart from these, the following properties are also available, and may be useful ### Execution Behavior -
Property NameDefaultMeaningSince Version
spark.memory.fraction
+
@@ -2342,7 +2342,7 @@ Apart from these, the following properties are also available, and may be useful ### Executor Metrics -
Property NameDefaultMeaningSince Version
spark.broadcast.blockSize
+
@@ -2410,7 +2410,7 @@ Apart from these, the following properties are also available, and may be useful ### Networking -
Property NameDefaultMeaningSince Version
spark.eventLog.logStageExecutorMetrics
+
@@ -2573,7 +2573,7 @@ Apart from these, the following properties are also available, and may be useful ### Scheduling -
Property NameDefaultMeaningSince Version
spark.rpc.message.maxSize
+
@@ -3054,7 +3054,7 @@ Apart from these, the following properties are also available, and may be useful ### Barrier Execution Mode -
Property NameDefaultMeaningSince Version
spark.cores.max
+
@@ -3101,7 +3101,7 @@ Apart from these, the following properties are also available, and may be useful ### Dynamic Allocation -
Property NameDefaultMeaningSince Version
spark.barrier.sync.timeout
+
@@ -3243,7 +3243,7 @@ finer granularity starting from driver and executor. Take RPC module as example like shuffle, just replace "rpc" with "shuffle" in the property names except spark.{driver|executor}.rpc.netty.dispatcher.numThreads, which is only for RPC module. -
Property NameDefaultMeaningSince Version
spark.dynamicAllocation.enabled
+
@@ -3281,7 +3281,7 @@ the driver or executor, or, in the absence of that value, the number of cores av Server configurations are set in Spark Connect server, for example, when you start the Spark Connect server with `./sbin/start-connect-server.sh`. They are typically set via the config file and command-lineoptions with `--conf/-c`. -
Property NameDefaultMeaningSince Version
spark.{driver|executor}.rpc.io.serverThreads
+
@@ -3373,7 +3373,7 @@ External users can query the static sql config values via `SparkSession.conf` or ### Spark Streaming -
Property NameDefaultMeaningSince Version
spark.connect.grpc.binding.port
+
@@ -3505,7 +3505,7 @@ External users can query the static sql config values via `SparkSession.conf` or ### SparkR -
Property NameDefaultMeaningSince Version
spark.streaming.backpressure.enabled
+
@@ -3561,7 +3561,7 @@ External users can query the static sql config values via `SparkSession.conf` or ### GraphX -
Property NameDefaultMeaningSince Version
spark.r.numRBackendThreads
+
@@ -3735,7 +3735,7 @@ Push-based shuffle helps improve the reliability and performance of spark shuffl ### External Shuffle service(server) side configuration options -
Property NameDefaultMeaningSince Version
spark.graphx.pregel.checkpointInterval
+
@@ -3769,7 +3769,7 @@ Push-based shuffle helps improve the reliability and performance of spark shuffl ### Client side configuration options -
Property NameDefaultMeaningSince Version
spark.shuffle.push.server.mergedShuffleFileManagerImpl
+
diff --git a/docs/css/custom.css b/docs/css/custom.css index 1fabf7be3ac8a..22175068023b7 100644 --- a/docs/css/custom.css +++ b/docs/css/custom.css @@ -557,7 +557,6 @@ pre { border-radius: 4px; } -code, pre { font: 1em Menlo, Monaco, Consolas, "Courier New", monospace; } @@ -741,7 +740,6 @@ h3 { margin: 0; } -.global code, .global pre { font: 1em Menlo, Monaco, Consolas, "Courier New", monospace; } @@ -761,15 +759,6 @@ h3 { border-radius: 4px; } -.global code { - font: 90% "Menlo", "Lucida Console", Consolas, monospace; - white-space: nowrap; - background: transparent; - border-radius: 4px; - padding: 0; - color: inherit; -} - .global pre code { padding: 0; font-size: inherit; @@ -936,8 +925,14 @@ img { table { width: 100%; - overflow-wrap: normal; + overflow-wrap: break-word; border-collapse: collapse; + white-space: normal; +} + +table code { + overflow-wrap: break-word; + white-space: normal; } table th, @@ -956,3 +951,40 @@ table tr { table tr:nth-child(2n) { background-color: #F1F4F5; } + +table.spark-config { + width: 100%; + table-layout: fixed; + white-space: normal; + overflow-wrap: break-word; +} + +/* We have long config names and formulas that often show up in tables. To prevent + * any table column from become super wide, we allow the browser to break words at + * any point. + */ +table.spark-config code, +table.spark-config th, +table.spark-config td { + white-space: normal; + overflow-wrap: break-word; +} + +/* CSS does not respect max-width on tables or table parts (like cells, columns, etc.), + so we have to pick a fixed width for each column. + See: https://stackoverflow.com/a/8465980 + */ +table.spark-config th:nth-child(1), +table.spark-config td:nth-child(1) { + width: 30%; +} + +table.spark-config th:nth-child(2), +table.spark-config td:nth-child(2) { + width: 20%; +} + +table.spark-config th:nth-child(4), +table.spark-config td:nth-child(4) { + width: 90px; +} diff --git a/docs/monitoring.md b/docs/monitoring.md index 8d3dbe375b82c..79bbb93e50d17 100644 --- a/docs/monitoring.md +++ b/docs/monitoring.md @@ -145,7 +145,7 @@ Use it with caution. Security options for the Spark History Server are covered more detail in the [Security](security.html#web-ui) page. -
Property NameDefaultMeaningSince Version
spark.shuffle.push.enabled
+
diff --git a/docs/running-on-kubernetes.md b/docs/running-on-kubernetes.md index 4b4dc9d304fbd..4cdb450ffd743 100644 --- a/docs/running-on-kubernetes.md +++ b/docs/running-on-kubernetes.md @@ -592,7 +592,7 @@ See the [configuration page](configuration.html) for information on Spark config #### Spark Properties -
Property Name
+
diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md index 02547b30d2e50..aab8ee60a256c 100644 --- a/docs/running-on-yarn.md +++ b/docs/running-on-yarn.md @@ -143,7 +143,7 @@ To use a custom metrics.properties for the application master and executors, upd #### Spark Properties -
Property NameDefaultMeaningSince Version
spark.kubernetes.context
+
@@ -766,7 +766,7 @@ staging directory of the Spark application. ## YARN-specific Kerberos Configuration -
Property NameDefaultMeaningSince Version
spark.yarn.am.memory
+
@@ -865,7 +865,7 @@ to avoid garbage collection issues during shuffle. The following extra configuration options are available when the shuffle service is running on YARN: -
Property NameDefaultMeaningSince Version
spark.kerberos.keytab
+
diff --git a/docs/security.md b/docs/security.md index 00e35ce2f4990..61d5bf8e9d3ae 100644 --- a/docs/security.md +++ b/docs/security.md @@ -60,7 +60,7 @@ distributing the shared secret. Each application will use a unique shared secret the case of YARN, this feature relies on YARN RPC encryption being enabled for the distribution of secrets to be secure. -
Property NameDefaultMeaningSince Version
spark.yarn.shuffle.stopOnFailure
+
@@ -82,7 +82,7 @@ that any user that can list pods in the namespace where the Spark application is also see their authentication secret. Access control rules should be properly set up by the Kubernetes admin to ensure that Spark authentication is secure. -
Property NameDefaultMeaningSince Version
spark.yarn.shuffle.server.recovery.disabled
+
@@ -103,7 +103,7 @@ Kubernetes admin to ensure that Spark authentication is secure. Alternatively, one can mount authentication secrets using files and Kubernetes secrets that the user mounts into their pods. -
Property NameDefaultMeaningSince Version
spark.authenticate
+
@@ -178,7 +178,7 @@ is still required when talking to shuffle services from Spark versions older tha The following table describes the different options available for configuring this feature. -
Property NameDefaultMeaningSince Version
spark.authenticate.secret.file
+
@@ -249,7 +249,7 @@ encrypting output data generated by applications with APIs such as `saveAsHadoop The following settings cover enabling encryption for data written to disk: -
Property NameDefaultMeaningSince Version
spark.network.crypto.enabled
+
@@ -317,7 +317,7 @@ below. The following options control the authentication of Web UIs: -
Property NameDefaultMeaningSince Version
spark.io.encryption.enabled
+
@@ -421,7 +421,7 @@ servlet filters. To enable authorization in the SHS, a few extra options are used: -
Property NameDefaultMeaningSince Version
spark.ui.allowFramingFrom
+
@@ -734,7 +734,7 @@ Apache Spark can be configured to include HTTP headers to aid in preventing Cros (XSS), Cross-Frame Scripting (XFS), MIME-Sniffing, and also to enforce HTTP Strict Transport Security. -
Property NameDefaultMeaningSince Version
spark.history.ui.acls.enable
+
@@ -917,7 +917,7 @@ deployment-specific page for more information. The following options provides finer-grained control for this feature: -
Property NameDefaultMeaningSince Version
spark.ui.xXssProtection
+
diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md index fdc28aac934dc..a21d16419fd10 100644 --- a/docs/spark-standalone.md +++ b/docs/spark-standalone.md @@ -200,7 +200,7 @@ You can optionally configure the cluster further by setting environment variable SPARK_MASTER_OPTS supports the following system properties: -
Property NameDefaultMeaningSince Version
spark.security.credentials.${service}.enabled
+
@@ -403,9 +403,7 @@ SPARK_MASTER_OPTS supports the following system properties:
Property NameDefaultMeaningSince Version
spark.master.ui.port Path to resources file which is used to find various resources while worker starting up. The content of resources file should be formatted like - [{"id":{"componentName": - "spark.worker", "resourceName":"gpu"}, - "addresses":["0","1","2"]}]. + [{"id":{"componentName": "spark.worker", "resourceName":"gpu"}, "addresses":["0","1","2"]}]. If a particular resource is not found in the resources file, the discovery script would be used to find that resource. If the discovery script also does not find the resources, the worker will fail to start up. @@ -416,7 +414,7 @@ SPARK_MASTER_OPTS supports the following system properties: SPARK_WORKER_OPTS supports the following system properties: - +
@@ -549,8 +547,8 @@ You can also pass an option `--total-executor-cores ` to control the n Spark applications supports the following configuration properties specific to standalone mode: -
Property NameDefaultMeaningSince Version
spark.worker.initialRegistrationRetries
- +
Property NameDefault ValueMeaningSince Version
+ @@ -599,8 +597,8 @@ via http://[host:port]/[version]/submissions/[action] where version is a protocol version, v1 as of today, and action is one of the following supported actions. -
Property NameDefault ValueMeaningSince Version
spark.standalone.submit.waitAppCompletion false
- +
CommandDescriptionHTTP METHODSince Version
+ @@ -778,8 +776,8 @@ ZooKeeper is the best way to go for production-level high availability, but if y In order to enable this recovery mode, you can set SPARK_DAEMON_JAVA_OPTS in spark-env using this configuration: -
CommandDescriptionHTTP METHODSince Version
create Create a Spark driver via cluster mode.
- +
System propertyDefault ValueMeaningSince Version
+ diff --git a/docs/sql-data-sources-avro.md b/docs/sql-data-sources-avro.md index ddfdc89370b1f..cbc3367e5f852 100644 --- a/docs/sql-data-sources-avro.md +++ b/docs/sql-data-sources-avro.md @@ -233,8 +233,8 @@ Data source options of Avro can be set via: * the `.option` method on `DataFrameReader` or `DataFrameWriter`. * the `options` parameter in function `from_avro`. -
System propertyDefault ValueMeaningSince Version
spark.deploy.recoveryMode NONE
- +
Property NameDefaultMeaningScopeSince Version
+ @@ -331,8 +331,8 @@ Data source options of Avro can be set via: ## Configuration Configuration of Avro can be done via `spark.conf.set` or by running `SET key=value` commands using SQL. -
Property NameDefaultMeaningScopeSince Version
avroSchema None
- +
Property NameDefaultMeaningSince Version
+ diff --git a/docs/sql-data-sources-hive-tables.md b/docs/sql-data-sources-hive-tables.md index 0d16272ed6f86..b51cde53bd8fd 100644 --- a/docs/sql-data-sources-hive-tables.md +++ b/docs/sql-data-sources-hive-tables.md @@ -123,7 +123,7 @@ will compile against built-in Hive and use those classes for internal execution The following options can be used to configure the version of Hive that is used to retrieve metadata: -
Property NameDefaultMeaningSince Version
spark.sql.legacy.replaceDatabricksSparkAvro.enabled true
+
diff --git a/docs/sql-data-sources-orc.md b/docs/sql-data-sources-orc.md index abd1901d24e4b..8267d39e949e5 100644 --- a/docs/sql-data-sources-orc.md +++ b/docs/sql-data-sources-orc.md @@ -129,8 +129,8 @@ When reading from Hive metastore ORC tables and inserting to Hive metastore ORC ### Configuration -
Property NameDefaultMeaningSince Version
spark.sql.hive.metastore.version
- +
Property NameDefaultMeaningSince Version
+ diff --git a/docs/sql-data-sources-parquet.md b/docs/sql-data-sources-parquet.md index 7d80343214815..e944db24d76be 100644 --- a/docs/sql-data-sources-parquet.md +++ b/docs/sql-data-sources-parquet.md @@ -434,7 +434,7 @@ Other generic options can be found in diff --git a/docs/sql-performance-tuning.md b/docs/sql-performance-tuning.md index 4ede18d1938bf..1dbe1bb7e1a26 100644 --- a/docs/sql-performance-tuning.md +++ b/docs/sql-performance-tuning.md @@ -34,7 +34,7 @@ memory usage and GC pressure. You can call `spark.catalog.uncacheTable("tableNam Configuration of in-memory caching can be done via `spark.conf.set` or by running `SET key=value` commands using SQL. -
Property NameDefaultMeaningSince Version
spark.sql.orc.impl native
Property NameDefaultMeaningSince Version
spark.sql.parquet.binaryAsString
+
@@ -62,7 +62,7 @@ Configuration of in-memory caching can be done via `spark.conf.set` or by runnin The following options can also be used to tune the performance of query execution. It is possible that these options will be deprecated in future release as more optimizations are performed automatically. -
Property NameDefaultMeaningSince Version
spark.sql.inMemoryColumnarStorage.compressed
+
@@ -253,7 +253,7 @@ Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that ma ### Coalescing Post Shuffle Partitions This feature coalesces the post shuffle partitions based on the map output statistics when both `spark.sql.adaptive.enabled` and `spark.sql.adaptive.coalescePartitions.enabled` configurations are true. This feature simplifies the tuning of shuffle partition number when running queries. You do not need to set a proper shuffle partition number to fit your dataset. Spark can pick the proper shuffle partition number at runtime once you set a large enough initial number of shuffle partitions via `spark.sql.adaptive.coalescePartitions.initialPartitionNum` configuration. -
Property NameDefaultMeaningSince Version
spark.sql.files.maxPartitionBytes
+
@@ -298,7 +298,7 @@ This feature coalesces the post shuffle partitions based on the map output stati
Property NameDefaultMeaningSince Version
spark.sql.adaptive.coalescePartitions.enabled
### Splitting skewed shuffle partitions - +
@@ -320,7 +320,7 @@ This feature coalesces the post shuffle partitions based on the map output stati ### Converting sort-merge join to broadcast join AQE converts sort-merge join to broadcast hash join when the runtime statistics of any join side is smaller than the adaptive broadcast hash join threshold. This is not as efficient as planning a broadcast hash join in the first place, but it's better than keep doing the sort-merge join, as we can save the sorting of both the join sides, and read shuffle files locally to save network traffic(if `spark.sql.adaptive.localShuffleReader.enabled` is true) -
Property NameDefaultMeaningSince Version
spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled
+
@@ -342,7 +342,7 @@ AQE converts sort-merge join to broadcast hash join when the runtime statistics ### Converting sort-merge join to shuffled hash join AQE converts sort-merge join to shuffled hash join when all post shuffle partitions are smaller than a threshold, the max threshold can see the config `spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold`. -
Property NameDefaultMeaningSince Version
spark.sql.adaptive.autoBroadcastJoinThreshold
+
@@ -356,7 +356,7 @@ AQE converts sort-merge join to shuffled hash join when all post shuffle partiti ### Optimizing Skew Join Data skew can severely downgrade the performance of join queries. This feature dynamically handles skew in sort-merge join by splitting (and replicating if needed) skewed tasks into roughly evenly sized tasks. It takes effect when both `spark.sql.adaptive.enabled` and `spark.sql.adaptive.skewJoin.enabled` configurations are enabled. -
Property NameDefaultMeaningSince Version
spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold
+
@@ -393,7 +393,7 @@ Data skew can severely downgrade the performance of join queries. This feature d
Property NameDefaultMeaningSince Version
spark.sql.adaptive.skewJoin.enabled
### Misc - +
diff --git a/docs/sql-ref-ansi-compliance.md b/docs/sql-ref-ansi-compliance.md index 93af3e6698474..9b933ec1f65c1 100644 --- a/docs/sql-ref-ansi-compliance.md +++ b/docs/sql-ref-ansi-compliance.md @@ -28,10 +28,40 @@ The casting behaviours are defined as store assignment rules in the standard. When `spark.sql.storeAssignmentPolicy` is set to `ANSI`, Spark SQL complies with the ANSI store assignment rules. This is a separate configuration because its default value is `ANSI`, while the configuration `spark.sql.ansi.enabled` is disabled by default. -|Property Name|Default| Meaning |Since Version| -|-------------|-------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------| -|`spark.sql.ansi.enabled`|false| When true, Spark tries to conform to the ANSI SQL specification:
1. Spark SQL will throw runtime exceptions on invalid operations, including integer overflow errors, string parsing errors, etc.
2. Spark will use different type coercion rules for resolving conflicts among data types. The rules are consistently based on data type precedence. |3.0.0| -|`spark.sql.storeAssignmentPolicy`|ANSI| When inserting a value into a column with different data type, Spark will perform type conversion. Currently, we support 3 policies for the type coercion rules: ANSI, legacy and strict.
1. With ANSI policy, Spark performs the type coercion as per ANSI SQL. In practice, the behavior is mostly the same as PostgreSQL. It disallows certain unreasonable type conversions such as converting string to int or double to boolean. On inserting a numeric type column, an overflow error will be thrown if the value is out of the target data type's range.
2. With legacy policy, Spark allows the type coercion as long as it is a valid Cast, which is very loose. e.g. converting string to int or double to boolean is allowed. It is also the only behavior in Spark 2.x and it is compatible with Hive.
3. With strict policy, Spark doesn't allow any possible precision loss or data truncation in type coercion, e.g. converting double to int or decimal to double is not allowed. |3.0.0| +
Property NameDefaultMeaningSince Version
spark.sql.adaptive.optimizer.excludedRules
+ + + + + + + + + + + + + +
Property NameDefaultMeaningSince Version
spark.sql.ansi.enabledfalse + When true, Spark tries to conform to the ANSI SQL specification:
+ 1. Spark SQL will throw runtime exceptions on invalid operations, including integer overflow + errors, string parsing errors, etc.
+ 2. Spark will use different type coercion rules for resolving conflicts among data types. + The rules are consistently based on data type precedence. +
3.0.0
spark.sql.storeAssignmentPolicyANSI + When inserting a value into a column with different data type, Spark will perform type + conversion. Currently, we support 3 policies for the type coercion rules: ANSI, legacy and + strict.
+ 1. With ANSI policy, Spark performs the type coercion as per ANSI SQL. In practice, the behavior + is mostly the same as PostgreSQL. It disallows certain unreasonable type conversions such as + converting string to int or double to boolean. On inserting a numeric type column, an overflow + error will be thrown if the value is out of the target data type's range.
+ 2. With legacy policy, Spark allows the type coercion as long as it is a valid Cast, which is + very loose. e.g. converting string to int or double to boolean is allowed. It is also the only + behavior in Spark 2.x and it is compatible with Hive.
+ 3. With strict policy, Spark doesn't allow any possible precision loss or data truncation in + type coercion, e.g. converting double to int or decimal to double is not allowed. +
3.0.0
The following subsections present behaviour changes in arithmetic operations, type conversions, and SQL parsing when the ANSI mode enabled. For type conversions in Spark SQL, there are three kinds of them and this article will introduce them one by one: cast, store assignment and type coercion. diff --git a/docs/structured-streaming-kafka-integration.md b/docs/structured-streaming-kafka-integration.md index c5ffdf025b173..37846216fc758 100644 --- a/docs/structured-streaming-kafka-integration.md +++ b/docs/structured-streaming-kafka-integration.md @@ -607,7 +607,7 @@ The caching key is built up from the following information: The following properties are available to configure the consumer pool: - +
@@ -657,7 +657,7 @@ Note that it doesn't leverage Apache Commons Pool due to the difference of chara The following properties are available to configure the fetched data pool: -
Property NameDefaultMeaningSince Version
spark.kafka.consumer.cache.capacity
+
@@ -912,7 +912,7 @@ It will use different Kafka producer when delegation token is renewed; Kafka pro The following properties are available to configure the producer pool: -
Property NameDefaultMeaningSince Version
spark.kafka.consumer.fetchedData.cache.timeout
+
@@ -1039,7 +1039,7 @@ When none of the above applies then unsecure connection assumed. Delegation tokens can be obtained from multiple clusters and ${cluster} is an arbitrary unique identifier which helps to group different configurations. -
Property NameDefaultMeaningSince Version
spark.kafka.producer.cache.timeout
+
diff --git a/sql/gen-sql-config-docs.py b/sql/gen-sql-config-docs.py index 83334b6a1f539..b69a903b44f90 100644 --- a/sql/gen-sql-config-docs.py +++ b/sql/gen-sql-config-docs.py @@ -56,7 +56,7 @@ def generate_sql_configs_table_html(sql_configs, path): The table will look something like this: ```html -
Property NameDefaultMeaningSince Version
spark.kafka.clusters.${cluster}.auth.bootstrap.servers
+
@@ -76,7 +76,7 @@ def generate_sql_configs_table_html(sql_configs, path): with open(path, 'w') as f: f.write(dedent( """ -
Property NameDefaultMeaningSince Version
+
""" ))
Property NameDefaultMeaningSince Version