Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FileSystem extension skeleton #787

Merged
merged 33 commits into from
Jan 21, 2021
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
7c25fcb
Basic FileSystem implementation
Dec 2, 2020
cc8237f
Move to extension
Dec 2, 2020
f4e141b
Todo
Dec 2, 2020
bcde071
Tests should run
Dec 2, 2020
06d8b24
Don't use interface
Dec 2, 2020
dfad910
Update src/csharp/Extensions/Microsoft.Spark.Extensions.FileSystem/Jv…
AFFogarty Dec 3, 2020
50bc611
Remove IsPackable
Dec 3, 2020
00d68a6
Fixed: Line length 110
Dec 3, 2020
ec52232
Update src/csharp/Extensions/Microsoft.Spark.Extensions.FileSystem/Fi…
AFFogarty Dec 3, 2020
4adbb64
Update src/csharp/Extensions/Microsoft.Spark.Extensions.FileSystem/Fi…
AFFogarty Dec 3, 2020
16abe0a
Separate testing for signature and functionality
Dec 3, 2020
3310897
Merge
Dec 3, 2020
d79e702
Fixed: Test the type of FileSystem
Dec 3, 2020
bd8579b
Fixed: Empty line
Dec 3, 2020
a3bf508
Rename to Hadoop.FileSystem
Dec 4, 2020
4fc9718
Rename references
Dec 4, 2020
1bacdb2
Move to correct dir
Dec 4, 2020
53351a6
Merge implementation with abstract class
Dec 4, 2020
ac57039
Better comment
Dec 4, 2020
8a85d60
Fixed: Line length
Dec 7, 2020
9936e9f
Merge new files into existing project
Dec 11, 2020
c896d63
Fix references
Dec 11, 2020
e1f33e8
Fixed: return message
Dec 11, 2020
a856829
Merge branch 'master' of https://github.com/dotnet/spark into anfog/1…
Jan 5, 2021
6433b3c
Update src/csharp/Microsoft.Spark/Hadoop/FS/FileSystem.cs
AFFogarty Jan 5, 2021
4b906e0
Merge branch 'anfog/1201_filesystem' of https://github.com/AFFogarty/…
Jan 5, 2021
dcf07d5
Compiles
Jan 5, 2021
e41e1ef
Casing fix -- part 1
Jan 5, 2021
f798618
Casing fix -- part 2
Jan 5, 2021
8cc2378
Don't need to test assignable
Jan 5, 2021
74ce69d
Formatting
Jan 5, 2021
257ce6c
Fixed: Don't commi test app
Jan 5, 2021
9f92c7d
Merge branch 'master' of https://github.com/dotnet/spark into anfog/1…
Jan 21, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.

namespace Microsoft.Spark.Extensions.FileSystem.E2ETest
{
/// <summary>
/// Constants related to the FileSystem test suite.
/// </summary>
internal class Constants
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file needed? What else are we expecting to be added here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the pattern we're using in all the test projects.

{
public const string FileSystemTestContainerName = "FileSystem Tests";
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.

using Microsoft.Spark.E2ETest;
using Xunit;

namespace Microsoft.Spark.Extensions.FileSystem.E2ETest
{
[CollectionDefinition(Constants.FileSystemTestContainerName)]
public class FileSystemTestCollection : ICollectionFixture<SparkFixture>
{
// This class has no code, and is never created. Its purpose is simply
// to be the place to apply [CollectionDefinition] and all the
// ICollectionFixture<> interfaces.
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.

using System.IO;
using Microsoft.Spark.E2ETest;
using Microsoft.Spark.Sql;
using Microsoft.Spark.UnitTest.TestUtils;
using Xunit;

namespace Microsoft.Spark.Extensions.FileSystem.E2ETest
{
[Collection(Constants.FileSystemTestContainerName)]
public class FileSystemTests
{
private readonly SparkSession _spark;

public FileSystemTests(SparkFixture fixture)
{
_spark = fixture.Spark;
}

/// <summary>
/// Test that methods return the expected signature.
/// </summary>
[Fact]
public void TestSignatures()
{
using FileSystem fs = FileSystem.Get(_spark.SparkContext);

using var tempDirectory = new TemporaryDirectory();
string path = Path.Combine(tempDirectory.Path, "temp-table");
_spark.Range(25).Write().Format("parquet").Save(path);

Assert.True(fs.Delete(path, true));
Assert.False(fs.Delete(path, true));
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add another test that validates the Delete API by checking if the file got deleted from the file system?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed TestSignatures() to only test the signatures, and moved the functional testing into a new test that validates that the file is properly deleted.

}
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
<Project Sdk="Microsoft.NET.Sdk">

<PropertyGroup>
<TargetFramework>netcoreapp3.1</TargetFramework>
<IsPackable>false</IsPackable>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we setting this to false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I just copied this from the other extension E2ETest projects. I will remove it everywhere.

</PropertyGroup>

<ItemGroup>
<ProjectReference Include="..\..\Microsoft.Spark.E2ETest\Microsoft.Spark.E2ETest.csproj" />
<ProjectReference Include="..\..\Microsoft.Spark\Microsoft.Spark.csproj" />
<ProjectReference Include="..\Microsoft.Spark.Extensions.FileSystem\Microsoft.Spark.Extensions.FileSystem.csproj" />
</ItemGroup>
AFFogarty marked this conversation as resolved.
Show resolved Hide resolved
</Project>
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.

using System;
using Microsoft.Spark.Interop;
using Microsoft.Spark.Interop.Ipc;

namespace Microsoft.Spark.Extensions.FileSystem
{
/// <summary>
/// An abstract base class for a fairly generic filesystem. It may be implemented as a distributed filesystem, or
/// as a "local" one that reflects the locally-connected disk. The local version exists for small Hadoop instances
/// and for testing.
///
/// All user code that may potentially use the Hadoop Distributed File System should be written to use an FileSystem
AFFogarty marked this conversation as resolved.
Show resolved Hide resolved
/// object. The Hadoop DFS is a multi-machine system that appears as a single disk.It's useful because of its fault
AFFogarty marked this conversation as resolved.
Show resolved Hide resolved
/// tolerance and potentially very large capacity.
/// </summary>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reformat to keep each line within the 110 character limit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I thought we increased it to 120 for some reason. Fixed.

public abstract class FileSystem : IDisposable
{
/// <summary>
/// Returns the configured FileSystem implementation.
/// </summary>
/// <param name="sparkContext">The SparkContext whose configuration will be used.</param>
/// <returns>The FileSystem.</returns>
public static FileSystem Get(SparkContext sparkContext)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird. Is this like a factory? How can I create a new type of FileSystem?

Why not just expose the Hadoop FileSystem directly?

Copy link
Contributor Author

@AFFogarty AFFogarty Dec 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the pattern from the Hadoop FileSystem class: https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#get-org.apache.hadoop.conf.Configuration-

FileSystem is an abstract class with static get() factory methods that return concrete implementations based on the configuration parameters.

For my .NET implementation, I've added an override of Get() that takes SparkContext so that we don't have to expose SparkContext.hadoopConfiguration.

If we expose the class Configuration in the future, we can expose SparkContext.hadoopConfiguration and add an override FileSystem Get(Configuration conf).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are mimicking the hadoop FileSystem, shall we follow the same signature (expose minimal Configuration)? Also, we should add Hadoop to the namespace? (and let's add the link to the comment as well)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added Hadoop to the namespace.

For this PR, I just wanted to provide an MVP skeleton so that it would be easy for community members to contribute APIs in additional PRs. I'm thinking that we can invite others to contribute Configuration if they want to. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we cannot define Configuration class in the extension package, since SparkContext.hadoopConfiguration will be inside Microsoft.Spark..

What if we add Hadoop directory under https://github.com/dotnet/spark/tree/master/src/csharp/Microsoft.Spark and add FileSystem.cs and Configuration.cs? Note that we don't have to expose any of the APIs for Configuration. I just want to be able to create FileSystem by FileSystem.Get(sparkContext.HadoopConfiguration). Since we are at 1.0, we want to avoid breaking public APIs if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts @rapoth? I know you wanted to put FileSystem in an extension.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synced with @rapoth. We will go with @imback82 's approach.

{
// TODO: Expose hadoopConfiguration as a .NET class and add an override for Get() that takes it.
JvmObjectReference hadoopConfiguration = (JvmObjectReference)
((IJvmObjectReferenceProvider)sparkContext).Reference.Invoke("hadoopConfiguration");

return new JvmReferenceFileSystem(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need JvmReferenceFileSystem to encapsulate the JVM object, why can't we do that within FileSystem itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the JVM implementation, FileSystem is an abstract class. I wanted to keep that same pattern here.

(JvmObjectReference)SparkEnvironment.JvmBridge.CallStaticJavaMethod(
"org.apache.hadoop.fs.FileSystem",
"get",
hadoopConfiguration));
}

/// <summary>
/// Delete a file.
/// </summary>
/// <param name="path">The path to delete.</param>
/// <param name="recursive">If path is a directory and set to true, the directory is deleted else throws an
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exceeding 110 character limit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

/// exception. In case of a file the recursive can be set to either true or false.</param>
/// <returns>True if delete is successful else false.</returns>
public abstract bool Delete(string path, bool recursive = true);

public abstract void Dispose();
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.

using Microsoft.Spark.Interop;
using Microsoft.Spark.Interop.Ipc;

namespace Microsoft.Spark.Extensions.FileSystem
{
/// <summary>
/// <see cref="FileSystem"/> implementation that wraps a corresponding FileSystem object in the JVM.
/// </summary>
public class JvmReferenceFileSystem : FileSystem, IJvmObjectReferenceProvider
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is JvmReference public?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking that we should just put APIs into FileSystem since get handles the getting the right concrete class. In what scenarios, do you see we need a concrete implementation of FileSystem other than hadoop filesystem?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And just name the package as Microsoft.Spark.Extensions.Hadoop.FileSystem so that there is no confusion; i.e., we are just wrapping org.apache.hadoop.fs.FileSystem more or less.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll just make FileSystem concrete. I was just trying to keep it as similar to the JVM implementation as possible. But I suppose we won't have any POCO FileSystem implementations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made FileSystem concrete and removed JvmReferenceFileSystem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed implementation to use Configuration.

{
private readonly JvmObjectReference _jvmObject;

internal JvmReferenceFileSystem(JvmObjectReference jvmObject)
{
_jvmObject = jvmObject;
}

JvmObjectReference IJvmObjectReferenceProvider.Reference => _jvmObject;

/// <summary>
/// Delete a file.
/// </summary>
/// <param name="path">The path to delete.</param>
/// <param name="recursive">If path is a directory and set to true, the directory is deleted else throws an
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exceeding 110 character limit

/// exception. In case of a file the recursive can be set to either true or false.</param>
/// <returns>True if delete is successful else false.</returns>
public override bool Delete(string path, bool recursive = true)
{
JvmObjectReference pathObject =
SparkEnvironment.JvmBridge.CallConstructor("org.apache.hadoop.fs.Path", path);

return (bool)_jvmObject.Invoke("delete", pathObject, recursive);
}

public override void Dispose()
{
_jvmObject.Invoke("close");
}
AFFogarty marked this conversation as resolved.
Show resolved Hide resolved
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
<Project Sdk="Microsoft.NET.Sdk">

<PropertyGroup>
<TargetFramework>netstandard2.0</TargetFramework>
</PropertyGroup>

<ItemGroup>
<ProjectReference Include="..\..\Microsoft.Spark\Microsoft.Spark.csproj" />
</ItemGroup>

</Project>
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,12 @@

<ItemGroup>
<InternalsVisibleTo Include="Microsoft.Spark.Extensions.Delta.E2ETest" />
<InternalsVisibleTo Include="Microsoft.Spark.Extensions.FileSystem.E2ETest" />
<InternalsVisibleTo Include="Microsoft.Spark.Extensions.Hyperspace.E2ETest" />
</ItemGroup>

<ItemGroup>
<ProjectReference Include="..\Extensions\Microsoft.Spark.Extensions.FileSystem\Microsoft.Spark.Extensions.FileSystem.csproj" />
<ProjectReference Include="..\Microsoft.Spark.E2ETest.ExternalLibrary\Microsoft.Spark.E2ETest.ExternalLibrary.csproj" />
<ProjectReference Include="..\Microsoft.Spark.Worker\Microsoft.Spark.Worker.csproj" />
<ProjectReference Include="..\Microsoft.Spark\Microsoft.Spark.csproj" />
Expand Down
14 changes: 14 additions & 0 deletions src/csharp/Microsoft.Spark.sln
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,10 @@ Project("{9A19103F-16F7-4668-BE54-9A1E7A4F7556}") = "Microsoft.Spark.Extensions.
EndProject
Project("{9A19103F-16F7-4668-BE54-9A1E7A4F7556}") = "Microsoft.Spark.Extensions.Hyperspace.E2ETest", "Extensions\Microsoft.Spark.Extensions.Hyperspace.E2ETest\Microsoft.Spark.Extensions.Hyperspace.E2ETest.csproj", "{C6019E44-C777-4DE2-B70E-EA025B7D044D}"
EndProject
Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "Microsoft.Spark.Extensions.FileSystem", "Extensions\Microsoft.Spark.Extensions.FileSystem\Microsoft.Spark.Extensions.FileSystem.csproj", "{30FC2E08-5944-47B1-9441-90D6D5D91896}"
EndProject
Project("{9A19103F-16F7-4668-BE54-9A1E7A4F7556}") = "Microsoft.Spark.Extensions.FileSystem.E2ETest", "Extensions\Microsoft.Spark.Extensions.FileSystem.E2ETest\Microsoft.Spark.Extensions.FileSystem.E2ETest.csproj", "{33DF254A-C238-40D8-9A15-B0CCC4A2CE95}"
EndProject
Global
GlobalSection(SolutionConfigurationPlatforms) = preSolution
Debug|Any CPU = Debug|Any CPU
Expand Down Expand Up @@ -101,6 +105,14 @@ Global
{C6019E44-C777-4DE2-B70E-EA025B7D044D}.Debug|Any CPU.Build.0 = Debug|Any CPU
{C6019E44-C777-4DE2-B70E-EA025B7D044D}.Release|Any CPU.ActiveCfg = Release|Any CPU
{C6019E44-C777-4DE2-B70E-EA025B7D044D}.Release|Any CPU.Build.0 = Release|Any CPU
{30FC2E08-5944-47B1-9441-90D6D5D91896}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
{30FC2E08-5944-47B1-9441-90D6D5D91896}.Debug|Any CPU.Build.0 = Debug|Any CPU
{30FC2E08-5944-47B1-9441-90D6D5D91896}.Release|Any CPU.ActiveCfg = Release|Any CPU
{30FC2E08-5944-47B1-9441-90D6D5D91896}.Release|Any CPU.Build.0 = Release|Any CPU
{33DF254A-C238-40D8-9A15-B0CCC4A2CE95}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
{33DF254A-C238-40D8-9A15-B0CCC4A2CE95}.Debug|Any CPU.Build.0 = Debug|Any CPU
{33DF254A-C238-40D8-9A15-B0CCC4A2CE95}.Release|Any CPU.ActiveCfg = Release|Any CPU
{33DF254A-C238-40D8-9A15-B0CCC4A2CE95}.Release|Any CPU.Build.0 = Release|Any CPU
EndGlobalSection
GlobalSection(SolutionProperties) = preSolution
HideSolutionNode = FALSE
Expand All @@ -114,6 +126,8 @@ Global
{7BDE09ED-04B3-41B2-A466-3D6F7225291E} = {71A19F75-8279-40AB-BEA0-7D4B153FC416}
{70DDA4E9-1195-4A29-9AA1-96A8223A6D4F} = {71A19F75-8279-40AB-BEA0-7D4B153FC416}
{C6019E44-C777-4DE2-B70E-EA025B7D044D} = {71A19F75-8279-40AB-BEA0-7D4B153FC416}
{30FC2E08-5944-47B1-9441-90D6D5D91896} = {71A19F75-8279-40AB-BEA0-7D4B153FC416}
{33DF254A-C238-40D8-9A15-B0CCC4A2CE95} = {71A19F75-8279-40AB-BEA0-7D4B153FC416}
EndGlobalSection
GlobalSection(ExtensibilityGlobals) = postSolution
SolutionGuid = {FD15FFDB-EA1B-436F-841D-3386DDF94538}
Expand Down
2 changes: 2 additions & 0 deletions src/csharp/Microsoft.Spark/Microsoft.Spark.csproj
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@
<InternalsVisibleTo Include="Microsoft.Spark.Extensions.Delta.E2ETest" />
<InternalsVisibleTo Include="Microsoft.Spark.Extensions.DotNet.Interactive" />
<InternalsVisibleTo Include="Microsoft.Spark.Extensions.DotNet.Interactive.UnitTest" />
<InternalsVisibleTo Include="Microsoft.Spark.Extensions.FileSystem" />
<InternalsVisibleTo Include="Microsoft.Spark.Extensions.FileSystem.E2ETest" />
<InternalsVisibleTo Include="Microsoft.Spark.Extensions.Hyperspace" />
<InternalsVisibleTo Include="Microsoft.Spark.Extensions.Hyperspace.E2ETest" />
<InternalsVisibleTo Include="Microsoft.Spark.UnitTest" />
Expand Down