-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-7481] [build] Add spark-hadoop-cloud module to pull in object store access. #17834
[SPARK-7481] [build] Add spark-hadoop-cloud module to pull in object store access. #17834
Conversation
…azure, hadoop-aws, hadoop-openstack and implicitly their transitive dependencies are resolved. They don't verify all dependency setup, specifically that Jackson versions are consistent; that needs integration testing.
…eam tests which pulled in spark-cloud but not spark-hive were ending up with inconsistent versions. Add a test for the missing class being there too.
…s; link to remaning docs from top level.
…CP is set up consistently for the later versions of the AWS SDK
…ct references to AWS classes failing. Cut them and rely on transitive load through FS class instantation to force the load. All that happens is that failures to link will be slightly less easy to debug.
…ack JARs, transitive dependencies on aws and azure SDKs
…ranch-2, s3 ireland
Change-Id: I3dea2544f089615493163f0fae482992873f9c35
Change-Id: Ibd6d40df2bc8a2edf19a058c458bea233ba414fd
… as well as 2.7, keeping azure the 2.7.x dependency. All dependencies are scoped @ hadoop.scope Change-Id: I80bd95fd48e21cf2eb4d94907ac99081cd3bd375
Change-Id: I91f764aeed7d832df1538453d869a7fd83964d65
Change-Id: I12ea6ed72ffa9edee964c90c862ff4c45bc4f47f
…urity, spark-specific options + links elsewhere Change-Id: I7e9efe20d116802a403af875b241b91178078d78
Change-Id: I1923a4b6a959d86aa2c5b3d71faaaf2541d3ba85
Change-Id: I6a0b0b9f06a4adcdf55ef75161dc1039961bc7a1
Change-Id: Ic4804667af8e52b7be11fb00621ad8b69a1d2569
Change-Id: I2b75a2722f0082b916b9be20bd23a0bdc2d36615
OK, but does this address the comments from the other PR? it didn't look like it (yet) from a quick look. |
Test build #76390 has finished for PR 17834 at commit
|
The last one was on all the doc comments, and believe I've addressed them both with the little typos and by focusing the docs on the main points for Spark users: how stores differ from filesystems, and what it means The big issue for spark users is "the commit problem", where I've listed the object store behaviours and said "this means things may not work —consult the docs". I'm not being explicit as what works/doesn't work as that's the moving target. Right now, I don't trust commits with S3a using V1 or V2 FileOutputCommitter algorithms to work 100% of the time, because they rely on a list consistency which Amazon S3 doesn't guarantee. I could make that a lot clearer, something like You cannot reliably use the FileOutputCommitter to commit work to Amazon S3 or Openstack Swift unless there is some form of consistency layer on top. That probably is the core concept people need to know: it's not safe without something (EMR, S3Guard, Databricks Commit Service) to give you that consistent view. Then add pointers to the talks by myself and Eric Liang on the topic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK still a few comments but yes you're right most were addressed
assembly/pom.xml
Outdated
Pull in spark-hadoop-cloud and its associated JARs, | ||
--> | ||
<profile> | ||
<id>cloud</id> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Continuing from #12004 (comment) : I still think this should be hadoop-cloud
just like the artifact name is spark-hadoop-cloud
. The profile name itself doesn't cause an artifact to be generated called hadoop-cloud
.
pom.xml
Outdated
the AWS module pulls in jackson; its transitive dependencies can create | ||
intra-jackson-module version problems. | ||
--> | ||
<dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Continuing #12004 (comment) - I think it's useful to put things in the parent when they're used across multiple modules. For something only referenced in one child POM, it is probably less overhead to just put it the dependency in the child rather than split and repeat part of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'll do that. It's easily enough revisited in future.
The one thing that will probably need revisiting jackson dependency, as that's got the highest risk that someone will add it to another module. Though as it is driven by the same jackson.version variable as everywhere else, a duplicate declaration elsewhere isn't going to cause any harm
cloud/pom.xml
Outdated
@@ -0,0 +1,106 @@ | |||
<?xml version="1.0" encoding="UTF-8"?> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, just noticed this -- should this perhaps be in a directory hadoop-cloud
to match the profile and module name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
docs/cloud-integration.md
Outdated
|
||
This uses the "version 2" algorithm for committing files, which does less | ||
renaming than the "version 1" algorithm, though as it still uses `rename()` | ||
to commit files, it may be unsafe to use. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's hard to give advice here, as we discussed. It's possible to link to external documentation. Here, however, it sounds like you're saying version 2 is recommended, then immediately saying it could be unsafe to use. What does the reader make of that? what's the upside, what's the nature and likelihood of failure?
If there's no clear answer, is this really a reliable recommendation?
Should this perhaps call out the issue and link to a larger discussion instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to think of a better phrasing, saying "if your object store is consistent enough use v2 for speed".
* module is renamed hadoop-cloud in POMs, sbt, docs * hadoop-aws/azure/openstack declarations pushed down to hadoop-cloud pom, along with jackson-cbor * docs around the commit algorithm option make clear that you should only worry about v1 vs v2 if the blobstore is consistent Change-Id: Ia114bc8cd2ef731d54a83774d9dc2cf9e4c6e7d4
I've just pushed up an update which changes the module name; tested in maven and SBT; hadoop cloud JAR dependencies pulled down. A JAR is created, it's just a stub one. As a result, when you build the assembly with -Phadoop-cloud, you get something from spark mixed in with the hadoop modules
Like I've said before, that's a bit dangerous, especially with a hadoop-cloud-projects POM module now upstream. We won't ever produce a hadoop-cloud-projects JAR, so there won't be direct conflict, more potential confusion if people see a jar beginning hadoop-* with different version info. |
Test build #76426 has finished for PR 17834 at commit
|
That's because you also changed the artifact name to |
OK, now I understand. let me revert that bit of the patch |
…oud. Added an extra callout to the docs "without some form of consistency layer, Amazon S3 cannot be safely used as the direct destination of work with the normal rename-based committer" That holds for all Spark-in-EC deployments; EMR has a consistency option, as do/will others; different committers can work reliably. Change-Id: Ibbf8b1b9de10b5bb83b647cc663ceb970f70ff2d Testing: SBT and maven, with & without the hadoop-2.7 option.
Test build #76457 has finished for PR 17834 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some minor changes now, this looks fine
docs/cloud-integration.md
Outdated
* Rename operations may be very slow and, on failure, leave the store in an unknown state. | ||
* Seeking within a file may require new HTTP calls, hurting performance. | ||
|
||
How does affect Spark? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noticed a small typo: How does this affect Spark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, "how does this affect Spark?"
docs/cloud-integration.md
Outdated
They cannot be used as a direct replacement for a cluster filesystem such as HDFS | ||
*except where this is explicitly stated*. | ||
|
||
Key differences are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I'd end the line with a colon to make it clear it's not dangling
docs/cloud-integration.md
Outdated
### Important: Cloud Object Stores are Not Real Filesystems | ||
|
||
While the stores appear to be filesystems, underneath | ||
they are still object stores, [and the difference is significant](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/introduction.html) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we're down to tiny nits, I'd link everywhere to https:// URLs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch. Done throughout the file except in the apache license header. Rendered HTML and clicked through the links.
hadoop-cloud/pom.xml
Outdated
|
||
<artifactId>spark-hadoop-cloud_2.11</artifactId> | ||
<packaging>jar</packaging> | ||
<name>Spark Project Cloud Integration</name> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe Spark Project Cloud Object Store Integration for Hadoop? I feel like getting "Hadoop" in there is useful to somehow emphasize this is adding connectivity from HDFS APIs to object stores in particular
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it works just as well standalone & on mesos, I'll try " through Hadoop Libraries" & see how that reads.
…uple of superflous blank lines Change-Id: Iee9f0e0527de7bb875d1c2a805a0847702bb4e11
Test build #76492 has finished for PR 17834 at commit
|
Merged to master |
…tore access. ## What changes were proposed in this pull request? Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson. It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`. There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector. (this is the successor to apache#12004; I can't re-open it) ## How was this patch tested? Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples) Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well. Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile. SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package` maven build `mvn install -Phadoop-cloud -Phadoop-2.7` This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible. Author: Steve Loughran <stevel@apache.org> Author: Steve Loughran <stevel@hortonworks.com> Closes apache#17834 from steveloughran/cloud/SPARK-7481-current.
…tore access. Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson. It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`. There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector. (this is the successor to apache#12004; I can't re-open it) Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples) Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well. Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile. SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package` maven build `mvn install -Phadoop-cloud -Phadoop-2.7` This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible. Author: Steve Loughran <stevel@apache.org> Author: Steve Loughran <stevel@hortonworks.com> Closes apache#17834 from steveloughran/cloud/SPARK-7481-current.
…tore access. ## What changes were proposed in this pull request? Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson. It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`. There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector. (this is the successor to apache#12004; I can't re-open it) ## How was this patch tested? Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples) Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well. Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile. SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package` maven build `mvn install -Phadoop-cloud -Phadoop-2.7` This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible. Author: Steve Loughran <stevel@apache.org> Author: Steve Loughran <stevel@hortonworks.com> Closes apache#17834 from steveloughran/cloud/SPARK-7481-current. (cherry picked from commit 2cf83c4)
… `jvm-profiler` modules ### What changes were proposed in this pull request? This PR aims to fix `dev/scalastyle` to check `hadoop-cloud` and `jam-profiler` modules. Also, the detected scalastyle issues are fixed. ### Why are the changes needed? To prevent future scalastyle issues. Scala style violation was introduced here, but we missed because we didn't check all optional modules. - #46022 `jvm-profiler` module was added newly at Apache Spark 4.0.0 but we missed to add this to `dev/scalastyle`. Note that there was no scala style issues in that `module` at that time. - #44021 `hadoop-cloud` module was added at Apache Spark 2.3.0. - #17834 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs with newly revised `dev/scalastyle`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46376 from dongjoon-hyun/SPARK-48127. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
… `jvm-profiler` modules ### What changes were proposed in this pull request? This PR aims to fix `dev/scalastyle` to check `hadoop-cloud` and `jam-profiler` modules. Also, the detected scalastyle issues are fixed. ### Why are the changes needed? To prevent future scalastyle issues. Scala style violation was introduced here, but we missed because we didn't check all optional modules. - apache#46022 `jvm-profiler` module was added newly at Apache Spark 4.0.0 but we missed to add this to `dev/scalastyle`. Note that there was no scala style issues in that `module` at that time. - apache#44021 `hadoop-cloud` module was added at Apache Spark 2.3.0. - apache#17834 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs with newly revised `dev/scalastyle`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46376 from dongjoon-hyun/SPARK-48127. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
… `jvm-profiler` modules ### What changes were proposed in this pull request? This PR aims to fix `dev/scalastyle` to check `hadoop-cloud` and `jam-profiler` modules. Also, the detected scalastyle issues are fixed. ### Why are the changes needed? To prevent future scalastyle issues. Scala style violation was introduced here, but we missed because we didn't check all optional modules. - apache#46022 `jvm-profiler` module was added newly at Apache Spark 4.0.0 but we missed to add this to `dev/scalastyle`. Note that there was no scala style issues in that `module` at that time. - apache#44021 `hadoop-cloud` module was added at Apache Spark 2.3.0. - apache#17834 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs with newly revised `dev/scalastyle`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46376 from dongjoon-hyun/SPARK-48127. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
Add a new
spark-hadoop-cloud
module and maven profile to pull in object store support fromhadoop-openstack
,hadoop-aws
andhadoop-azure
(Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson.It restores
s3n://
access to S3, adds itss3a://
replacement, OpenStackswift://
and azurewasb://
.There's a documentation page,
cloud_integration.md
, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector.(this is the successor to #12004; I can't re-open it)
How was this patch tested?
Downstream tests exist in https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples
Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well.
Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile.
SBT build:
build/sbt -Phadoop-cloud -Phadoop-2.7 package
maven build
mvn install -Phadoop-cloud -Phadoop-2.7
This PR does not update
dev/deps/spark-deps-hadoop-2.7
ordev/deps/spark-deps-hadoop-2.6
, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible.