FileSystem extension skeleton #787

AFFogarty · 2020-12-02T00:13:23Z

This PR provides the skeleton for an extension that implements FileSystem. This PR only implements the Delete() API, which allows users to delete files. The rest of the APIs are intentionally omitted so that they can be implemented in future PRs.

This PR also provides the skeleton for Configuration, which is returned by a new method SparkContext.HadoopConfiguration(). This Configuration skeleton does not currently implement any methods, but can be passed to FileSystem.Get() to get the FileSystem for a given SparkContext.

Example 1: Constructing the `FileSystem` object

SparkSession spark = ...

FileSystem fs = FileSystem.Get(spark.SparkContext.HadoopConfiguration());

Example 2: Deleting a file

FileSystem fs = ...

fs.Delete("abfss://mycontainer@mydatalake.dfs.core.windows.net/myfolder/myfile.csv");

Example 3: Deleting a directory and its contents

FileSystem fs = ...

fs.Delete("abfss://mycontainer@mydatalake.dfs.core.windows.net/myfolder/", true);

This PR relates to #328.

Niharikadutta · 2020-12-02T20:15:45Z

src/csharp/Extensions/Microsoft.Spark.Extensions.FileSystem/FileSystem.cs

+    /// An abstract base class for a fairly generic filesystem. It may be implemented as a distributed filesystem, or
+    /// as a "local" one that reflects the locally-connected disk. The local version exists for small Hadoop instances
+    /// and for testing.
+    /// 
+    /// All user code that may potentially use the Hadoop Distributed File System should be written to use an FileSystem
+    /// object. The Hadoop DFS is a multi-machine system that appears as a single disk.It's useful because of its fault
+    /// tolerance and potentially very large capacity.
+    /// </summary>


Please reformat to keep each line within the 110 character limit

Oops, I thought we increased it to 120 for some reason. Fixed.

src/csharp/Extensions/Microsoft.Spark.Extensions.FileSystem/FileSystem.cs

Niharikadutta · 2020-12-02T20:25:28Z

src/csharp/Extensions/Microsoft.Spark.Extensions.FileSystem/FileSystem.cs

+            JvmObjectReference hadoopConfiguration = (JvmObjectReference)
+                ((IJvmObjectReferenceProvider)sparkContext).Reference.Invoke("hadoopConfiguration");
+
+            return new JvmReferenceFileSystem(


Why do we need JvmReferenceFileSystem to encapsulate the JVM object, why can't we do that within FileSystem itself?

In the JVM implementation, FileSystem is an abstract class. I wanted to keep that same pattern here.

Niharikadutta · 2020-12-02T20:25:53Z

src/csharp/Extensions/Microsoft.Spark.Extensions.FileSystem/FileSystem.cs

+        /// Delete a file.
+        /// </summary>
+        /// <param name="path">The path to delete.</param>
+        /// <param name="recursive">If path is a directory and set to true, the directory is deleted else throws an


Exceeding 110 character limit.

src/csharp/Extensions/Microsoft.Spark.Extensions.FileSystem/JvmReferenceFileSystem.cs

Niharikadutta · 2020-12-02T20:29:14Z

src/csharp/Extensions/Microsoft.Spark.Extensions.FileSystem.E2ETest/Constants.cs

+    /// <summary>
+    /// Constants related to the FileSystem test suite.
+    /// </summary>
+    internal class Constants


Is this file needed? What else are we expecting to be added here?

This is the pattern we're using in all the test projects.

Niharikadutta · 2020-12-02T20:35:51Z

src/csharp/Extensions/Microsoft.Spark.Extensions.FileSystem.E2ETest/FileSystemTests.cs

+            Assert.True(fs.Delete(path, true));
+            Assert.False(fs.Delete(path, true));
+        }
+    }


Can we add another test that validates the Delete API by checking if the file got deleted from the file system?

Changed TestSignatures() to only test the signatures, and moved the functional testing into a new test that validates that the file is properly deleted.

Niharikadutta · 2020-12-02T20:39:38Z

...oft.Spark.Extensions.FileSystem.E2ETest/Microsoft.Spark.Extensions.FileSystem.E2ETest.csproj

+
+  <PropertyGroup>
+    <TargetFramework>netcoreapp3.1</TargetFramework>
+    <IsPackable>false</IsPackable>


Why are we setting this to false?

Hmmm, I just copied this from the other extension E2ETest projects. I will remove it everywhere.

...oft.Spark.Extensions.FileSystem.E2ETest/Microsoft.Spark.Extensions.FileSystem.E2ETest.csproj

imback82 · 2020-12-02T22:19:17Z

src/csharp/Extensions/Microsoft.Spark.Extensions.FileSystem/FileSystem.cs

+        /// </summary>
+        /// <param name="sparkContext">The SparkContext whose configuration will be used.</param>
+        /// <returns>The FileSystem.</returns>
+        public static FileSystem Get(SparkContext sparkContext)


This is weird. Is this like a factory? How can I create a new type of FileSystem?

Why not just expose the Hadoop FileSystem directly?

This is the pattern from the Hadoop FileSystem class: https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#get-org.apache.hadoop.conf.Configuration-

FileSystem is an abstract class with static get() factory methods that return concrete implementations based on the configuration parameters.

For my .NET implementation, I've added an override of Get() that takes SparkContext so that we don't have to expose SparkContext.hadoopConfiguration.

If we expose the class Configuration in the future, we can expose SparkContext.hadoopConfiguration and add an override FileSystem Get(Configuration conf).

If you are mimicking the hadoop FileSystem, shall we follow the same signature (expose minimal Configuration)? Also, we should add Hadoop to the namespace? (and let's add the link to the comment as well)

Added Hadoop to the namespace.

For this PR, I just wanted to provide an MVP skeleton so that it would be easy for community members to contribute APIs in additional PRs. I'm thinking that we can invite others to contribute Configuration if they want to. Thoughts?

Looks like we cannot define Configuration class in the extension package, since SparkContext.hadoopConfiguration will be inside Microsoft.Spark..

What if we add Hadoop directory under https://github.com/dotnet/spark/tree/master/src/csharp/Microsoft.Spark and add FileSystem.cs and Configuration.cs? Note that we don't have to expose any of the APIs for Configuration. I just want to be able to create FileSystem by FileSystem.Get(sparkContext.HadoopConfiguration). Since we are at 1.0, we want to avoid breaking public APIs if possible.

Thoughts @rapoth? I know you wanted to put FileSystem in an extension.

Synced with @rapoth. We will go with @imback82 's approach.

…mReferenceFileSystem.cs Co-authored-by: Niharika Dutta <nidutta@microsoft.com>

…leSystem.cs Co-authored-by: Niharika Dutta <nidutta@microsoft.com>

imback82 · 2020-12-03T22:43:31Z

src/csharp/Extensions/Microsoft.Spark.Extensions.FileSystem/JvmReferenceFileSystem.cs

+    /// <summary>
+    /// <see cref="FileSystem"/> implementation that wraps a corresponding FileSystem object in the JVM.
+    /// </summary>
+    public class JvmReferenceFileSystem : FileSystem, IJvmObjectReferenceProvider


Why is JvmReference public?

I am thinking that we should just put APIs into FileSystem since get handles the getting the right concrete class. In what scenarios, do you see we need a concrete implementation of FileSystem other than hadoop filesystem?

And just name the package as Microsoft.Spark.Extensions.Hadoop.FileSystem so that there is no confusion; i.e., we are just wrapping org.apache.hadoop.fs.FileSystem more or less.

Sure, I'll just make FileSystem concrete. I was just trying to keep it as similar to the JVM implementation as possible. But I suppose we won't have any POCO FileSystem implementations.

Made FileSystem concrete and removed JvmReferenceFileSystem.

Changed implementation to use Configuration.

Niharikadutta

A few small nits but otherwise LGTM. Thanks @AFFogarty !

src/csharp/Extensions/Microsoft.Spark.Extensions.Hadoop.FileSystem/FileSystem.cs

Niharikadutta · 2020-12-06T11:30:49Z

src/csharp/Extensions/Microsoft.Spark.Extensions.Hadoop.FileSystem/FileSystem.cs

+        /// Returns the configured FileSystem implementation.
+        /// </summary>
+        /// <param name="sparkContext">The SparkContext whose configuration will be used.</param>
+        /// <returns>The FileSystem.</returns>


nit: The FileSystem object.

Made a reference.

AFFogarty · 2020-12-11T01:21:03Z

src/csharp/Microsoft.Spark/SparkContext.cs

+        /// A default Hadoop Configuration for the Hadoop code (e.g. file systems) that we reuse.
+        /// </summary>
+        /// <returns>The Hadoop Configuration.</returns>
+        public Configuration HadoopConfiguration() =>


@imback82 Not sure if this should be a function or a property.

Since this is def on the Scala side, this will be a method on C# side. So the current approach is good.

imback82

LGTM (few minor comments), thanks @AFFogarty!

imback82 · 2021-01-05T04:42:21Z

src/csharp/Microsoft.Spark/SparkContext.cs

+        /// A default Hadoop Configuration for the Hadoop code (e.g. file systems) that we reuse.
+        /// </summary>
+        /// <returns>The Hadoop Configuration.</returns>
+        public Configuration HadoopConfiguration() =>


Since this is def on the Scala side, this will be a method on C# side. So the current approach is good.

src/csharp/Microsoft.Spark/Hadoop/FS/FileSystem.cs

imback82 · 2021-01-05T04:47:15Z

src/csharp/Microsoft.Spark/Hadoop/FS/FileSystem.cs

+        /// <returns>The FileSystem.</returns>
+        public static FileSystem Get(Configuration conf) =>
+            new FileSystem((JvmObjectReference)SparkEnvironment.JvmBridge.CallStaticJavaMethod(
+                "org.apache.hadoop.fs.FileSystem", "get", conf));


nit: I believe you need to break for each param.

imback82 · 2021-01-05T04:49:23Z

src/csharp/Microsoft.Spark.E2ETest/Hadoop/FileSystemTests.cs

+        {
+            using var tempDirectory = new TemporaryDirectory();
+
+            using FileSystem fs = Assert.IsAssignableFrom<FileSystem>(


you can remove IsAssignableFrom here since it's not going to compile if it doesn't return assignable FileSystem.

imback82 · 2021-01-05T04:51:37Z

@AFFogarty Can you also update the title/description now that this PR provides bindings to Hadoop Filesystem?

…201_filesystem

Co-authored-by: Terry Kim <yuminkim@gmail.com>

…spark into anfog/1201_filesystem

imback82 · 2021-01-05T21:26:46Z

src/csharp/TestApp/Program.cs

+using Microsoft.Spark.Extensions.Hadoop.FileSystem;
+using Microsoft.Spark.Sql;
+using System;
+


Can we remove this file?

Oops, didn't mean to commit that. Removed.

AFFogarty · 2021-01-05T21:41:34Z

Addressed nits and updated description.

imback82 · 2021-01-06T04:20:38Z

@AFFogarty can you check the failed tests if it's transient?

…201_filesystem

AFFogarty · 2021-01-21T19:56:50Z

@imback82 I merged with master and now the tests are passing.

imback82

LGTM, thanks @AFFogarty!

Andrew Fogarty added 4 commits December 1, 2020 16:11

Basic FileSystem implementation

7c25fcb

Move to extension

cc8237f

Todo

f4e141b

Tests should run

bcde071

AFFogarty changed the title ~~[WIP] FileSystem implementation skeleton~~ [WIP] FileSystem extension skeleton Dec 2, 2020

AFFogarty changed the title ~~[WIP] FileSystem extension skeleton~~ FileSystem extension skeleton Dec 2, 2020

AFFogarty marked this pull request as ready for review December 2, 2020 05:45

Don't use interface

06d8b24

rapoth requested a review from Niharikadutta December 2, 2020 19:28

Niharikadutta reviewed Dec 2, 2020

View reviewed changes

imback82 reviewed Dec 2, 2020

View reviewed changes

AFFogarty and others added 9 commits December 3, 2020 12:47

Update src/csharp/Extensions/Microsoft.Spark.Extensions.FileSystem/Jv…

dfad910

…mReferenceFileSystem.cs Co-authored-by: Niharika Dutta <nidutta@microsoft.com>

Remove IsPackable

50bc611

Fixed: Line length 110

00d68a6

Update src/csharp/Extensions/Microsoft.Spark.Extensions.FileSystem/Fi…

ec52232

…leSystem.cs Co-authored-by: Niharika Dutta <nidutta@microsoft.com>

Update src/csharp/Extensions/Microsoft.Spark.Extensions.FileSystem/Fi…

4adbb64

…leSystem.cs Co-authored-by: Niharika Dutta <nidutta@microsoft.com>

Separate testing for signature and functionality

16abe0a

Merge

3310897

Fixed: Test the type of FileSystem

d79e702

Fixed: Empty line

bd8579b

imback82 reviewed Dec 3, 2020

View reviewed changes

Andrew Fogarty added 5 commits December 3, 2020 16:14

Rename to Hadoop.FileSystem

a3bf508

Rename references

4fc9718

Move to correct dir

1bacdb2

Merge implementation with abstract class

53351a6

Better comment

ac57039

AFFogarty requested review from Niharikadutta and imback82 December 4, 2020 01:37

Niharikadutta reviewed Dec 6, 2020

View reviewed changes

Fixed: Line length

8a85d60

Andrew Fogarty added 3 commits December 10, 2020 17:01

Merge new files into existing project

9936e9f

Fix references

c896d63

Fixed: return message

e1f33e8

AFFogarty commented Dec 11, 2020

View reviewed changes

AFFogarty requested a review from Niharikadutta December 11, 2020 01:22

imback82 reviewed Jan 5, 2021

View reviewed changes

Andrew Fogarty and others added 8 commits January 5, 2021 13:02

Merge branch 'master' of https://github.com/dotnet/spark into anfog/1…

a856829

…201_filesystem

Update src/csharp/Microsoft.Spark/Hadoop/FS/FileSystem.cs

6433b3c

Co-authored-by: Terry Kim <yuminkim@gmail.com>

Merge branch 'anfog/1201_filesystem' of https://github.com/AFFogarty/…

4b906e0

…spark into anfog/1201_filesystem

Compiles

dcf07d5

Casing fix -- part 1

e41e1ef

Casing fix -- part 2

f798618

Don't need to test assignable

8cc2378

Formatting

74ce69d

imback82 reviewed Jan 5, 2021

View reviewed changes

Fixed: Don't commi test app

257ce6c

Merge branch 'master' of https://github.com/dotnet/spark into anfog/1…

9f92c7d

…201_filesystem

imback82 approved these changes Jan 21, 2021

View reviewed changes

imback82 merged commit 491de17 into dotnet:master Jan 21, 2021

AFFogarty mentioned this pull request Jan 21, 2021

[FEATURE REQUEST]: support for something like dbutils to move/copy/delete data easily #328

Open

FileSystem extension skeleton #787

FileSystem extension skeleton #787

Conversation

AFFogarty commented Dec 2, 2020 • edited Loading

Example 1: Constructing the FileSystem object

Example 2: Deleting a file

Example 3: Deleting a directory and its contents

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AFFogarty Dec 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Niharikadutta left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

imback82 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

imback82 commented Jan 5, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AFFogarty commented Jan 5, 2021

imback82 commented Jan 6, 2021

AFFogarty commented Jan 21, 2021

imback82 left a comment

Choose a reason for hiding this comment

AFFogarty commented Dec 2, 2020 •

edited

Loading

Example 1: Constructing the `FileSystem` object

AFFogarty Dec 3, 2020 •

edited

Loading