forked from dotnet/spark
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge remote-tracking branch 'upstream/master'
- Loading branch information
Showing
133 changed files
with
5,520 additions
and
1,254 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -367,3 +367,6 @@ hs_err_pid* | |
|
||
# The target folder contains the output of building | ||
**/target/** | ||
|
||
# F# vs code | ||
.ionide/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,92 @@ | ||
# Guide to using Broadcast Variables | ||
|
||
This is a guide to show how to use broadcast variables in .NET for Apache Spark. | ||
|
||
## What are Broadcast Variables | ||
|
||
[Broadcast variables in Apache Spark](https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#broadcast-variables) are a mechanism for sharing variables across executors that are meant to be read-only. They allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. | ||
|
||
### How to use broadcast variables in .NET for Apache Spark | ||
|
||
Broadcast variables are created from a variable `v` by calling `SparkContext.Broadcast(v)`. The broadcast variable is a wrapper around `v`, and its value can be accessed by calling the `Value()` method. | ||
|
||
Example: | ||
|
||
```csharp | ||
string v = "Variable to be broadcasted"; | ||
Broadcast<string> bv = SparkContext.Broadcast(v); | ||
|
||
// Using the broadcast variable in a UDF: | ||
Func<Column, Column> udf = Udf<string, string>( | ||
str => $"{str}: {bv.Value()}"); | ||
``` | ||
|
||
The type parameter for `Broadcast` should be the type of the variable being broadcasted. | ||
|
||
### Deleting broadcast variables | ||
|
||
The broadcast variable can be deleted from all executors by calling the `Destroy()` method on it. | ||
|
||
```csharp | ||
// Destroying the broadcast variable bv: | ||
bv.Destroy(); | ||
``` | ||
|
||
> Note: `Destroy()` deletes all data and metadata related to the broadcast variable. Use this with caution - once a broadcast variable has been destroyed, it cannot be used again. | ||
#### Caveat of using Destroy | ||
|
||
One important thing to keep in mind while using broadcast variables in UDFs is to limit the scope of the variable to only the UDF that is referencing it. The [guide to using UDFs](udf-guide.md) describes this phenomenon in detail. This is especially crucial when calling `Destroy` on the broadcast variable. If the broadcast variable that has been destroyed is visible to or accessible from other UDFs, it gets picked up for serialization by all those UDFs, even if it is not being referenced by them. This will throw an error as .NET for Apache Spark is not able to serialize the destroyed broadcast variable. | ||
|
||
Example to demonstrate: | ||
|
||
```csharp | ||
string v = "Variable to be broadcasted"; | ||
Broadcast<string> bv = SparkContext.Broadcast(v); | ||
|
||
// Using the broadcast variable in a UDF: | ||
Func<Column, Column> udf1 = Udf<string, string>( | ||
str => $"{str}: {bv.Value()}"); | ||
|
||
// Destroying bv | ||
bv.Destroy(); | ||
|
||
// Calling udf1 after destroying bv throws the following expected exception: | ||
// org.apache.spark.SparkException: Attempted to use Broadcast(0) after it was destroyed | ||
df.Select(udf1(df["_1"])).Show(); | ||
|
||
// Different UDF udf2 that is not referencing bv | ||
Func<Column, Column> udf2 = Udf<string, string>( | ||
str => $"{str}: not referencing broadcast variable"); | ||
|
||
// Calling udf2 throws the following (unexpected) exception: | ||
// [Error] [JvmBridge] org.apache.spark.SparkException: Task not serializable | ||
df.Select(udf2(df["_1"])).Show(); | ||
``` | ||
|
||
The recommended way of implementing above desired behavior: | ||
|
||
```csharp | ||
string v = "Variable to be broadcasted"; | ||
// Restricting the visibility of bv to only the UDF referencing it | ||
{ | ||
Broadcast<string> bv = SparkContext.Broadcast(v); | ||
|
||
// Using the broadcast variable in a UDF: | ||
Func<Column, Column> udf1 = Udf<string, string>( | ||
str => $"{str}: {bv.Value()}"); | ||
|
||
// Destroying bv | ||
bv.Destroy(); | ||
} | ||
|
||
// Different UDF udf2 that is not referencing bv | ||
Func<Column, Column> udf2 = Udf<string, string>( | ||
str => $"{str}: not referencing broadcast variable"); | ||
|
||
// Calling udf2 works fine as expected | ||
df.Select(udf2(df["_1"])).Show(); | ||
``` | ||
This ensures that destroying `bv` doesn't affect calling `udf2` because of unexpected serialization behavior. | ||
|
||
Broadcast variables are useful for transmitting read-only data to all executors, as the data is sent only once and this can give performance benefits when compared with using local variables that get shipped to the executors with each task. Please refer to the [official documentation](https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#broadcast-variables) to get a deeper understanding of broadcast variables and why they are used. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,110 @@ | ||
# .NET for Apache Spark 0.12.1 Release Notes | ||
|
||
### New Features/Improvements | ||
|
||
* Expose `JvmException` to capture JVM error messages separately ([#566](https://github.com/dotnet/spark/pull/566)) | ||
|
||
### Bug Fixes | ||
|
||
* AssemblyLoader should use absolute assembly path when loading assemblies ([570](https://github.com/dotnet/spark/pull/570)) | ||
|
||
### Infrastructure / Documentation / Etc. | ||
|
||
* None | ||
|
||
### Breaking Changes | ||
|
||
* None | ||
|
||
### Known Issues | ||
|
||
* Broadcast variables do not work with [dotnet-interactive](https://github.com/dotnet/interactive) ([#561](https://github.com/dotnet/spark/pull/561)) | ||
|
||
### Compatibility | ||
|
||
#### Backward compatibility | ||
|
||
The following table describes the oldest version of the worker that the current version is compatible with, along with new features that are incompatible with the worker. | ||
|
||
<table> | ||
<thead> | ||
<tr> | ||
<th>Oldest compatible Microsoft.Spark.Worker version</th> | ||
<th>Incompatible features</th> | ||
</tr> | ||
</thead> | ||
<tbody align="center"> | ||
<tr> | ||
<td rowspan=4>v0.9.0</td> | ||
<td>DataFrame with Grouped Map UDF <a href="https://github.com/dotnet/spark/pull/277">(#277)</a></td> | ||
</tr> | ||
<tr> | ||
<td>DataFrame with Vector UDF <a href="https://github.com/dotnet/spark/pull/277">(#277)</a></td> | ||
</tr> | ||
<tr> | ||
<td>Support for Broadcast Variables <a href="https://github.com/dotnet/spark/pull/414">(#414)</a></td> | ||
</tr> | ||
<tr> | ||
<td>Support for TimestampType <a href="https://github.com/dotnet/spark/pull/428">(#428)</a></td> | ||
</tr> | ||
</tbody> | ||
</table> | ||
|
||
#### Forward compatibility | ||
|
||
The following table describes the oldest version of .NET for Apache Spark release that the current worker is compatible with. | ||
|
||
<table> | ||
<thead> | ||
<tr> | ||
<th>Oldest compatible .NET for Apache Spark release version</th> | ||
</tr> | ||
</thead> | ||
<tbody align="center"> | ||
<tr> | ||
<td>v0.9.0</td> | ||
</tr> | ||
</tbody> | ||
</table> | ||
|
||
### Supported Spark Versions | ||
|
||
The following table outlines the supported Spark versions along with the microsoft-spark JAR to use with: | ||
|
||
<table> | ||
<thead> | ||
<tr> | ||
<th>Spark Version</th> | ||
<th>microsoft-spark JAR</th> | ||
</tr> | ||
</thead> | ||
<tbody align="center"> | ||
<tr> | ||
<td>2.3.*</td> | ||
<td>microsoft-spark-2.3.x-0.12.1.jar</td> | ||
</tr> | ||
<tr> | ||
<td>2.4.0</td> | ||
<td rowspan=6>microsoft-spark-2.4.x-0.12.1.jar</td> | ||
</tr> | ||
<tr> | ||
<td>2.4.1</td> | ||
</tr> | ||
<tr> | ||
<td>2.4.3</td> | ||
</tr> | ||
<tr> | ||
<td>2.4.4</td> | ||
</tr> | ||
<tr> | ||
<td>2.4.5</td> | ||
</tr> | ||
<tr> | ||
<td>2.4.6</td> | ||
</tr> | ||
<tr> | ||
<td>2.4.2</td> | ||
<td><a href="https://github.com/dotnet/spark/issues/60">Not supported</a></td> | ||
</tr> | ||
</tbody> | ||
</table> |
Oops, something went wrong.