-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FileSystem extension skeleton #787
Conversation
/// An abstract base class for a fairly generic filesystem. It may be implemented as a distributed filesystem, or | ||
/// as a "local" one that reflects the locally-connected disk. The local version exists for small Hadoop instances | ||
/// and for testing. | ||
/// | ||
/// All user code that may potentially use the Hadoop Distributed File System should be written to use an FileSystem | ||
/// object. The Hadoop DFS is a multi-machine system that appears as a single disk.It's useful because of its fault | ||
/// tolerance and potentially very large capacity. | ||
/// </summary> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please reformat to keep each line within the 110 character limit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I thought we increased it to 120 for some reason. Fixed.
src/csharp/Extensions/Microsoft.Spark.Extensions.FileSystem/FileSystem.cs
Outdated
Show resolved
Hide resolved
src/csharp/Extensions/Microsoft.Spark.Extensions.FileSystem/FileSystem.cs
Outdated
Show resolved
Hide resolved
JvmObjectReference hadoopConfiguration = (JvmObjectReference) | ||
((IJvmObjectReferenceProvider)sparkContext).Reference.Invoke("hadoopConfiguration"); | ||
|
||
return new JvmReferenceFileSystem( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need JvmReferenceFileSystem
to encapsulate the JVM object, why can't we do that within FileSystem
itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the JVM implementation, FileSystem
is an abstract class. I wanted to keep that same pattern here.
/// Delete a file. | ||
/// </summary> | ||
/// <param name="path">The path to delete.</param> | ||
/// <param name="recursive">If path is a directory and set to true, the directory is deleted else throws an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exceeding 110 character limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
src/csharp/Extensions/Microsoft.Spark.Extensions.FileSystem/JvmReferenceFileSystem.cs
Outdated
Show resolved
Hide resolved
/// <summary> | ||
/// Constants related to the FileSystem test suite. | ||
/// </summary> | ||
internal class Constants |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this file needed? What else are we expecting to be added here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the pattern we're using in all the test projects.
Assert.True(fs.Delete(path, true)); | ||
Assert.False(fs.Delete(path, true)); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add another test that validates the Delete
API by checking if the file got deleted from the file system?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed TestSignatures()
to only test the signatures, and moved the functional testing into a new test that validates that the file is properly deleted.
|
||
<PropertyGroup> | ||
<TargetFramework>netcoreapp3.1</TargetFramework> | ||
<IsPackable>false</IsPackable> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we setting this to false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, I just copied this from the other extension E2ETest
projects. I will remove it everywhere.
...oft.Spark.Extensions.FileSystem.E2ETest/Microsoft.Spark.Extensions.FileSystem.E2ETest.csproj
Outdated
Show resolved
Hide resolved
/// </summary> | ||
/// <param name="sparkContext">The SparkContext whose configuration will be used.</param> | ||
/// <returns>The FileSystem.</returns> | ||
public static FileSystem Get(SparkContext sparkContext) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is weird. Is this like a factory? How can I create a new type of FileSystem?
Why not just expose the Hadoop FileSystem directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the pattern from the Hadoop FileSystem
class: https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#get-org.apache.hadoop.conf.Configuration-
FileSystem
is an abstract class with static get()
factory methods that return concrete implementations based on the configuration parameters.
For my .NET implementation, I've added an override of Get()
that takes SparkContext
so that we don't have to expose SparkContext.hadoopConfiguration
.
If we expose the class Configuration
in the future, we can expose SparkContext.hadoopConfiguration
and add an override FileSystem Get(Configuration conf)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are mimicking the hadoop FileSystem
, shall we follow the same signature (expose minimal Configuration
)? Also, we should add Hadoop
to the namespace? (and let's add the link to the comment as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added Hadoop
to the namespace.
For this PR, I just wanted to provide an MVP skeleton so that it would be easy for community members to contribute APIs in additional PRs. I'm thinking that we can invite others to contribute Configuration
if they want to. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we cannot define Configuration
class in the extension package, since SparkContext.hadoopConfiguration
will be inside Microsoft.Spark
..
What if we add Hadoop
directory under https://github.com/dotnet/spark/tree/master/src/csharp/Microsoft.Spark and add FileSystem.cs
and Configuration.cs
? Note that we don't have to expose any of the APIs for Configuration
. I just want to be able to create FileSystem
by FileSystem.Get(sparkContext.HadoopConfiguration)
. Since we are at 1.0, we want to avoid breaking public APIs if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts @rapoth? I know you wanted to put FileSystem
in an extension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…mReferenceFileSystem.cs Co-authored-by: Niharika Dutta <nidutta@microsoft.com>
…leSystem.cs Co-authored-by: Niharika Dutta <nidutta@microsoft.com>
…leSystem.cs Co-authored-by: Niharika Dutta <nidutta@microsoft.com>
/// <summary> | ||
/// <see cref="FileSystem"/> implementation that wraps a corresponding FileSystem object in the JVM. | ||
/// </summary> | ||
public class JvmReferenceFileSystem : FileSystem, IJvmObjectReferenceProvider |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is JvmReference
public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking that we should just put APIs into FileSystem
since get
handles the getting the right concrete class. In what scenarios, do you see we need a concrete implementation of FileSystem
other than hadoop filesystem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And just name the package as Microsoft.Spark.Extensions.Hadoop.FileSystem
so that there is no confusion; i.e., we are just wrapping org.apache.hadoop.fs.FileSystem
more or less.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll just make FileSystem
concrete. I was just trying to keep it as similar to the JVM implementation as possible. But I suppose we won't have any POCO FileSystem
implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made FileSystem
concrete and removed JvmReferenceFileSystem
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed implementation to use Configuration
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small nits but otherwise LGTM. Thanks @AFFogarty !
src/csharp/Extensions/Microsoft.Spark.Extensions.Hadoop.FileSystem/FileSystem.cs
Outdated
Show resolved
Hide resolved
/// Returns the configured FileSystem implementation. | ||
/// </summary> | ||
/// <param name="sparkContext">The SparkContext whose configuration will be used.</param> | ||
/// <returns>The FileSystem.</returns> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: The FileSystem object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a reference.
/// A default Hadoop Configuration for the Hadoop code (e.g. file systems) that we reuse. | ||
/// </summary> | ||
/// <returns>The Hadoop Configuration.</returns> | ||
public Configuration HadoopConfiguration() => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@imback82 Not sure if this should be a function or a property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is def
on the Scala side, this will be a method on C# side. So the current approach is good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (few minor comments), thanks @AFFogarty!
/// A default Hadoop Configuration for the Hadoop code (e.g. file systems) that we reuse. | ||
/// </summary> | ||
/// <returns>The Hadoop Configuration.</returns> | ||
public Configuration HadoopConfiguration() => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is def
on the Scala side, this will be a method on C# side. So the current approach is good.
/// <returns>The FileSystem.</returns> | ||
public static FileSystem Get(Configuration conf) => | ||
new FileSystem((JvmObjectReference)SparkEnvironment.JvmBridge.CallStaticJavaMethod( | ||
"org.apache.hadoop.fs.FileSystem", "get", conf)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I believe you need to break for each param.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
{ | ||
using var tempDirectory = new TemporaryDirectory(); | ||
|
||
using FileSystem fs = Assert.IsAssignableFrom<FileSystem>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can remove IsAssignableFrom
here since it's not going to compile if it doesn't return assignable FileSystem
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
@AFFogarty Can you also update the title/description now that this PR provides bindings to Hadoop Filesystem? |
Co-authored-by: Terry Kim <yuminkim@gmail.com>
…spark into anfog/1201_filesystem
src/csharp/TestApp/Program.cs
Outdated
using Microsoft.Spark.Extensions.Hadoop.FileSystem; | ||
using Microsoft.Spark.Sql; | ||
using System; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, didn't mean to commit that. Removed.
Addressed nits and updated description. |
@AFFogarty can you check the failed tests if it's transient? |
@imback82 I merged with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @AFFogarty!
This PR provides the skeleton for an extension that implements
FileSystem
. This PR only implements theDelete()
API, which allows users to delete files. The rest of the APIs are intentionally omitted so that they can be implemented in future PRs.This PR also provides the skeleton for
Configuration
, which is returned by a new methodSparkContext.HadoopConfiguration()
. ThisConfiguration
skeleton does not currently implement any methods, but can be passed toFileSystem.Get()
to get theFileSystem
for a givenSparkContext
.Example 1: Constructing the
FileSystem
objectExample 2: Deleting a file
Example 3: Deleting a directory and its contents
This PR relates to #328.