-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Alluxio auto mount feature #5925
Add Alluxio auto mount feature #5925
Conversation
eeebe92
to
ba58787
Compare
This change is to simplify the usage of Alluxio with Spark-Rapids when the Alluxio master node is the same node which runs Spark Driver app. E.g. When you run the notebook on Databricks, the spark driver will be ran on the master node and the Alluxio master is also installed on the master node. Prerequisites: For a quick way to use this feature, you just need to set spark.rapids.alluxio.automount.enabled=true (and don't set spark.rapids.alluxio.pathsToReplace). |
Forgot to consider the case when driver context restarted, and the bucket has been mounted, it'll fail when mounting again. Here we can still use command line, another option may use Alluxio java class but it'll depend on Alluxio jar. |
build |
Get mounted point by parsing the output of "alluxio fs mount" |
docs/configs.md
Outdated
<a name="alluxio.pathsToReplace"></a>spark.rapids.alluxio.pathsToReplace|List of paths to be replaced with corresponding alluxio scheme. Eg, when configureis set to "s3:/foo->alluxio://0.1.2.3:19998/foo,gcs:/bar->alluxio://0.1.2.3:19998/bar", which means: s3:/foo/a.csv will be replaced to alluxio://0.1.2.3:19998/foo/a.csv and gcs:/bar/b.csv will be replaced to alluxio://0.1.2.3:19998/bar/b.csv|None | ||
<a name="alluxio.automount.enabled"></a>spark.rapids.alluxio.automount.enabled|Enable the feature of auto mounting the cloud storage to Alluxio. It requires the Alluxio master is the same node of Spark driver node. When it's true, it requires an environment variable ALLUXIO_HOME be set properly. The Alluxio master's host and port will be read from alluxio.master.hostname and alluxio.master.rpc.port(default: 19998) from ALLUXIO_HOME/conf/alluxio-site.properties, then replace a cloud path which matches spark.rapids.alluxio.bucket.regex like "s3://bar/b.csv" to "alluxio://0.1.2.3:19998/bar/b.csv", and the bucket "s3://bar" will be mounted to "/bar" in Alluxio automatically.|false | ||
<a name="alluxio.bucket.regex"></a>spark.rapids.alluxio.bucket.regex|A regex to decide which bucket should be auto-mounted to Alluxio. E.g. when setting as "^s3://bucket.*", the bucket which starts with "s3://bucket" will be mounted to Alluxio and the path "s3://bucket-foo/a.csv" will be replaced to "alluxio://0.1.2.3:19998/bucket-foo/a.csv". It's only valid when setting spark.rapids.alluxio.automount.enabled=true. The default value matches all the buckets in "s3://" or "s3a://" scheme.|^s3a{0,1}://.* | ||
<a name="alluxio.cmd"></a>spark.rapids.alluxio.cmd|Provide the Alluxio command, which is used to mount or get information. E.g. "su,ubuntu,-c,/opt/alluxio-2.8.0/bin/alluxio", it means: run Process(Seq("su", "ubuntu", "-c", "/opt/alluxio-2.8.0/bin/alluxio fs mount --readonly /bucket-foo s3://bucket-foo")), to mount s3://bucket-foo to /bucket-foo. the delimiter "," is used to convert to Seq[String] when you need to use a special user to run the mount command.|None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It allows to define a customized command. We may remove it for a security concern.
Will these configurations display on the Spark UI? and if yes, is it ok for that? |
var alluxio_master: String = null | ||
var buffered_source: Source = null | ||
try { | ||
buffered_source = Source.fromFile(alluxio_home + "/conf/alluxio-site.properties") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here, maybe we can use the withResources in Arm to wrap a closable object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considered once, but seems not worth to create an AutoCloseable class just for using here.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AlluxioUtils.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AlluxioUtils.scala
Outdated
Show resolved
Hide resolved
// This function will only read once from ALLUXIO/conf. | ||
private def initAlluxioInfo(conf: RapidsConf): Unit = { | ||
this.synchronized { | ||
alluxio_home = scala.util.Properties.envOrElse("ALLUXIO_HOME", "/opt/alluxio-2.8.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any reason why not to put the alluxio_home/alluxio_Cmd into the if (!isInit)
block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just consider for the case that user changed the spark.rapids.alluxio.cmd at runtime.
So, it can be read in and use the new command. Mainly easier for debug case.
alluxioMasterHost = Some(alluxio_master + ":" + alluxio_port) | ||
// load mounted point by call Alluxio mount command. | ||
// We also can get from REST API http://alluxio_master:alluxio_web_port/api/v1/master/info. | ||
val (ret, output) = runAlluxioCmd(" fs mount") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw some exceptions will be thrown when failed to detect alluxio configuration or something else. But here if the "fs mount" failed to run, do we need to throw the exception also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I want to get the mounted path from alluxio command, but if failed to get, I regard it as no mounted path, and the exception will be threw when mounting a new path later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its weird to have space in front of "fs"... have the runAlluxioCmd do it if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alluxio command seq is like ("su", "ubuntu", "-c", "/opt/alluxio-2.8.0/bin/alluxio"),
The string of " fs mount" is supposed to append to the last item in the seq to generate like ("su", "ubuntu", "-c", "/opt/alluxio-2.8.0/bin/alluxio fs mount").
The original command should be su ubuntu -c "/opt/alluxio-2.8.0/bin/alluxio fs mount".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of tests done in local workstation,here are my thoughts:
- The error is like
java.lang.RuntimeException: Mount bucket s3a://mybucket/ to /mybucket failed 1
. Here the1
is thestdout
which does not give much debug info. I hope we can print stderr and stdout to show the reason why mount failed. - In my env, there is no need to "su" because the Alluxio cluster and Spark users are the same.
So if I tried to set:
spark.conf.set("spark.rapids.alluxio.cmd", "/home/xxx/alluxio-2.8.0/bin/alluxio")
Then it will fail with :
java.io.IOException: Cannot run program "/home/xxx/alluxio-2.8.0/bin/alluxio fs mount --readonly --option s3a.accessKeyId=xxx --option s3a.secretKey=yyy /mybucket s3a://mybucket/": error=2, No such file or directory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the latest code has fixed above issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think runAlluxioCmd should deal with putting a space in between it rather then having caller do it. we can do it later as followup though too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to add the space in runAlluxioCmd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GaryShen2008 I just confirmed that now the issue regarding spark.rapids.alluxio.cmd
got fixed based on my test on my local workstation. Now it does not need to su
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AlluxioUtils.scala
Outdated
Show resolved
Hide resolved
``` shell | ||
--conf spark.rapids.alluxio.automount.enabled=true | ||
``` | ||
If Alluxio is not installed in /opt/alluxio-2.8.0, you should set the environment variable `ALLUXIO_HOME`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It hink we should rephrase this just to state alluxio must be installed and ALLUXIO_HOME must be set to the installation location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just hope the user doesn't need to set it explicitly when alluxio is installed in /opt/alluxio-2.8.0.
Otherwise, the user must remember to set it when creating the DB cluster.
More user-friendly, I think.
@viadea What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with the current setting.
In the future when the latest alluxio is 2.9 for example, we can update the default value+docs as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this line, since now we only read the alluxio command path from the alluxio.cmd config.
The ALLUXIO_HOME is only used for reading Alluxio configurations.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AlluxioUtils.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AlluxioUtils.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AlluxioUtils.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AlluxioUtils.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AlluxioUtils.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AlluxioUtils.scala
Outdated
Show resolved
Hide resolved
one question, val df = spark.read.parquet("s3:/bucket1/xxx")
df.show()
val df2 = spark.read.parquet("s3:/bucket2/xxx")
df2.show() will this PR auto-mount s3:/bucket1 and s3:/bucket2 ? |
43a06c9
to
c32d5e7
Compare
is this all working and ready for another review? |
Yes, I think so. |
docs/configs.md
Outdated
@@ -29,7 +29,10 @@ scala> spark.conf.set("spark.rapids.sql.incompatibleOps.enabled", true) | |||
|
|||
Name | Description | Default Value | |||
-----|-------------|-------------- | |||
<a name="alluxio.pathsToReplace"></a>spark.rapids.alluxio.pathsToReplace|List of paths to be replaced with corresponding alluxio scheme. Eg, when configureis set to "s3:/foo->alluxio://0.1.2.3:19998/foo,gcs:/bar->alluxio://0.1.2.3:19998/bar", which means: s3:/foo/a.csv will be replaced to alluxio://0.1.2.3:19998/foo/a.csv and gcs:/bar/b.csv will be replaced to alluxio://0.1.2.3:19998/bar/b.csv|None | |||
<a name="alluxio.automount.enabled"></a>spark.rapids.alluxio.automount.enabled|Enable the feature of auto mounting the cloud storage to Alluxio. It requires the Alluxio master is the same node of Spark driver node. When it's true, it requires an environment variable ALLUXIO_HOME be set properly. The Alluxio master's host and port will be read from alluxio.master.hostname and alluxio.master.rpc.port(default: 19998) from ALLUXIO_HOME/conf/alluxio-site.properties, then replace a cloud path which matches spark.rapids.alluxio.bucket.regex like "s3://bar/b.csv" to "alluxio://0.1.2.3:19998/bar/b.csv", and the bucket "s3://bar" will be mounted to "/bar" in Alluxio automatically.|false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how should one set ALLUXIO_HOME? for instance does it need to be in the spark-env.sh or can be set on command line when launching spark-submit/spark-shell. On yarn cluster mode might need to use spark.yarn.appMasterEnv..... I'm fine with leaving these details about but woudl be nice to put in docs at some point later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think both can work for ALLUXIO_HOME. We read it in driver side. So, ALLUXIO_HOME should be set in driver environment. On Databricks, it's added into the environment variables under spark config. Let me update the doc for the suggested way to set it.
alluxioMasterHost = Some(alluxio_master + ":" + alluxio_port) | ||
// load mounted point by call Alluxio mount command. | ||
// We also can get from REST API http://alluxio_master:alluxio_web_port/api/v1/master/info. | ||
val (ret, output) = runAlluxioCmd(" fs mount") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think runAlluxioCmd should deal with putting a space in between it rather then having caller do it. we can do it later as followup though too
// And we'll append --option to set access_key and secret_key if existing. | ||
// Suppose the key doesn't exist when using like Databricks's instance profile | ||
private def autoMountBucket(scheme: String, bucket: String, | ||
access_key: Option[String], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation should be 4 spaces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AlluxioUtils.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Gary Shen <gashen@nvidia.com>
Signed-off-by: Gary Shen <gashen@nvidia.com>
Signed-off-by: Gary Shen <gashen@nvidia.com>
Signed-off-by: Gary Shen <gashen@nvidia.com>
Signed-off-by: Gary Shen <gashen@nvidia.com>
Signed-off-by: Gary Shen <gashen@nvidia.com>
Signed-off-by: Gary Shen <gashen@nvidia.com>
Signed-off-by: Gary Shen <gashen@nvidia.com>
Signed-off-by: Gary Shen <gashen@nvidia.com>
Signed-off-by: Gary Shen <gashen@nvidia.com>
Check both access key and secret Update document to refer to auto mount section Explain more about limitation Use /bucket in mountedBucket to match fs mount output Use camel case to name variable Use URI to parse the fs mount output Signed-off-by: Gary Shen <gashen@nvidia.com>
Use logDebug Write new functions to return the replaceFunc Use URI to parse the scheme and bucket Signed-off-by: Gary Shen <gashen@nvidia.com>
Support to run the alluxio command without su by Process(String) Signed-off-by: Gary Shen <gashen@nvidia.com>
Signed-off-by: Gary Shen <gashen@nvidia.com>
….scala Remove risk log
Update docs Add a space in runAlluxioCmd Signed-off-by: Gary Shen <gashen@nvidia.com>
Signed-off-by: Gary Shen <gashen@nvidia.com>
cf06c7a
to
e0174b1
Compare
Fix a bug in scheme replacement Signed-off-by: Gary Shen <gashen@nvidia.com>
correct indent Signed-off-by: Gary Shen <gashen@nvidia.com>
build |
// And we'll append --option to set access_key and secret_key if existing. | ||
// Suppose the key doesn't exist when using like Databricks's instance profile | ||
private def autoMountBucket(scheme: String, bucket: String, | ||
access_key: Option[String], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation still off here, shoudl be 4 spaces from left
} | ||
|
||
private def genFuncForPathReplacement(replaceMapOption: Option[Map[String, String]] | ||
) : Option[Path => Path] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spacing off here as well
} | ||
|
||
private def genFuncForAutoMountReplacement(conf: RapidsConf, relation: HadoopFsRelation, | ||
alluxioBucketRegex: String) : Option[Path => Path] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s;acing off
Mount the cloud bucket to Alluxio when driver converts FileSourceScanExec to GPU plan
The Alluxio master should be the same node as Spark driver node when using this feature
Introduce new configs:
spark.rapids.alluxio.automount.enabled
spark.rapids.alluxio.bucket.regex
spark.rapids.alluxio.cmd
Signed-off-by: Gary Shen gashen@nvidia.com
Close #5872
Close #5890