Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FileSystem extension skeleton #787

Merged
merged 33 commits into from
Jan 21, 2021
Merged

Conversation

AFFogarty
Copy link
Contributor

@AFFogarty AFFogarty commented Dec 2, 2020

This PR provides the skeleton for an extension that implements FileSystem. This PR only implements the Delete() API, which allows users to delete files. The rest of the APIs are intentionally omitted so that they can be implemented in future PRs.

This PR also provides the skeleton for Configuration, which is returned by a new method SparkContext.HadoopConfiguration(). This Configuration skeleton does not currently implement any methods, but can be passed to FileSystem.Get() to get the FileSystem for a given SparkContext.

Example 1: Constructing the FileSystem object

SparkSession spark = ...

FileSystem fs = FileSystem.Get(spark.SparkContext.HadoopConfiguration());

Example 2: Deleting a file

FileSystem fs = ...

fs.Delete("abfss://mycontainer@mydatalake.dfs.core.windows.net/myfolder/myfile.csv");

Example 3: Deleting a directory and its contents

FileSystem fs = ...

fs.Delete("abfss://mycontainer@mydatalake.dfs.core.windows.net/myfolder/", true);

This PR relates to #328.

@AFFogarty AFFogarty changed the title [WIP] FileSystem implementation skeleton [WIP] FileSystem extension skeleton Dec 2, 2020
@AFFogarty AFFogarty changed the title [WIP] FileSystem extension skeleton FileSystem extension skeleton Dec 2, 2020
@AFFogarty AFFogarty marked this pull request as ready for review December 2, 2020 05:45
Comment on lines 12 to 19
/// An abstract base class for a fairly generic filesystem. It may be implemented as a distributed filesystem, or
/// as a "local" one that reflects the locally-connected disk. The local version exists for small Hadoop instances
/// and for testing.
///
/// All user code that may potentially use the Hadoop Distributed File System should be written to use an FileSystem
/// object. The Hadoop DFS is a multi-machine system that appears as a single disk.It's useful because of its fault
/// tolerance and potentially very large capacity.
/// </summary>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reformat to keep each line within the 110 character limit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I thought we increased it to 120 for some reason. Fixed.

JvmObjectReference hadoopConfiguration = (JvmObjectReference)
((IJvmObjectReferenceProvider)sparkContext).Reference.Invoke("hadoopConfiguration");

return new JvmReferenceFileSystem(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need JvmReferenceFileSystem to encapsulate the JVM object, why can't we do that within FileSystem itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the JVM implementation, FileSystem is an abstract class. I wanted to keep that same pattern here.

/// Delete a file.
/// </summary>
/// <param name="path">The path to delete.</param>
/// <param name="recursive">If path is a directory and set to true, the directory is deleted else throws an
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exceeding 110 character limit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

/// <summary>
/// Constants related to the FileSystem test suite.
/// </summary>
internal class Constants
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file needed? What else are we expecting to be added here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the pattern we're using in all the test projects.

Assert.True(fs.Delete(path, true));
Assert.False(fs.Delete(path, true));
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add another test that validates the Delete API by checking if the file got deleted from the file system?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed TestSignatures() to only test the signatures, and moved the functional testing into a new test that validates that the file is properly deleted.


<PropertyGroup>
<TargetFramework>netcoreapp3.1</TargetFramework>
<IsPackable>false</IsPackable>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we setting this to false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I just copied this from the other extension E2ETest projects. I will remove it everywhere.

/// </summary>
/// <param name="sparkContext">The SparkContext whose configuration will be used.</param>
/// <returns>The FileSystem.</returns>
public static FileSystem Get(SparkContext sparkContext)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird. Is this like a factory? How can I create a new type of FileSystem?

Why not just expose the Hadoop FileSystem directly?

Copy link
Contributor Author

@AFFogarty AFFogarty Dec 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the pattern from the Hadoop FileSystem class: https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#get-org.apache.hadoop.conf.Configuration-

FileSystem is an abstract class with static get() factory methods that return concrete implementations based on the configuration parameters.

For my .NET implementation, I've added an override of Get() that takes SparkContext so that we don't have to expose SparkContext.hadoopConfiguration.

If we expose the class Configuration in the future, we can expose SparkContext.hadoopConfiguration and add an override FileSystem Get(Configuration conf).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are mimicking the hadoop FileSystem, shall we follow the same signature (expose minimal Configuration)? Also, we should add Hadoop to the namespace? (and let's add the link to the comment as well)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added Hadoop to the namespace.

For this PR, I just wanted to provide an MVP skeleton so that it would be easy for community members to contribute APIs in additional PRs. I'm thinking that we can invite others to contribute Configuration if they want to. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we cannot define Configuration class in the extension package, since SparkContext.hadoopConfiguration will be inside Microsoft.Spark..

What if we add Hadoop directory under https://github.com/dotnet/spark/tree/master/src/csharp/Microsoft.Spark and add FileSystem.cs and Configuration.cs? Note that we don't have to expose any of the APIs for Configuration. I just want to be able to create FileSystem by FileSystem.Get(sparkContext.HadoopConfiguration). Since we are at 1.0, we want to avoid breaking public APIs if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts @rapoth? I know you wanted to put FileSystem in an extension.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synced with @rapoth. We will go with @imback82 's approach.

/// <summary>
/// <see cref="FileSystem"/> implementation that wraps a corresponding FileSystem object in the JVM.
/// </summary>
public class JvmReferenceFileSystem : FileSystem, IJvmObjectReferenceProvider
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is JvmReference public?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking that we should just put APIs into FileSystem since get handles the getting the right concrete class. In what scenarios, do you see we need a concrete implementation of FileSystem other than hadoop filesystem?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And just name the package as Microsoft.Spark.Extensions.Hadoop.FileSystem so that there is no confusion; i.e., we are just wrapping org.apache.hadoop.fs.FileSystem more or less.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll just make FileSystem concrete. I was just trying to keep it as similar to the JVM implementation as possible. But I suppose we won't have any POCO FileSystem implementations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made FileSystem concrete and removed JvmReferenceFileSystem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed implementation to use Configuration.

Copy link
Collaborator

@Niharikadutta Niharikadutta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small nits but otherwise LGTM. Thanks @AFFogarty !

/// Returns the configured FileSystem implementation.
/// </summary>
/// <param name="sparkContext">The SparkContext whose configuration will be used.</param>
/// <returns>The FileSystem.</returns>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The FileSystem object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a reference.

/// A default Hadoop Configuration for the Hadoop code (e.g. file systems) that we reuse.
/// </summary>
/// <returns>The Hadoop Configuration.</returns>
public Configuration HadoopConfiguration() =>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imback82 Not sure if this should be a function or a property.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is def on the Scala side, this will be a method on C# side. So the current approach is good.

Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (few minor comments), thanks @AFFogarty!

/// A default Hadoop Configuration for the Hadoop code (e.g. file systems) that we reuse.
/// </summary>
/// <returns>The Hadoop Configuration.</returns>
public Configuration HadoopConfiguration() =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is def on the Scala side, this will be a method on C# side. So the current approach is good.

src/csharp/Microsoft.Spark/Hadoop/FS/FileSystem.cs Outdated Show resolved Hide resolved
/// <returns>The FileSystem.</returns>
public static FileSystem Get(Configuration conf) =>
new FileSystem((JvmObjectReference)SparkEnvironment.JvmBridge.CallStaticJavaMethod(
"org.apache.hadoop.fs.FileSystem", "get", conf));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I believe you need to break for each param.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

{
using var tempDirectory = new TemporaryDirectory();

using FileSystem fs = Assert.IsAssignableFrom<FileSystem>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can remove IsAssignableFrom here since it's not going to compile if it doesn't return assignable FileSystem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@imback82
Copy link
Contributor

imback82 commented Jan 5, 2021

@AFFogarty Can you also update the title/description now that this PR provides bindings to Hadoop Filesystem?

using Microsoft.Spark.Extensions.Hadoop.FileSystem;
using Microsoft.Spark.Sql;
using System;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, didn't mean to commit that. Removed.

@AFFogarty
Copy link
Contributor Author

Addressed nits and updated description.

@imback82
Copy link
Contributor

imback82 commented Jan 6, 2021

@AFFogarty can you check the failed tests if it's transient?

@AFFogarty
Copy link
Contributor Author

@imback82 I merged with master and now the tests are passing.

Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @AFFogarty!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants