Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace DV data type system with .NET standard type system. #863

Merged
merged 44 commits into from
Sep 19, 2018

Conversation

codemzs
Copy link
Member

@codemzs codemzs commented Sep 7, 2018

This change replaces DvType system with .NET standard data type system and fixes #673

Old Type New Type
DvInt1 sbyte
DvInt2 short
DvInt4 int
DvInt8 long
DvBool bool
DvDateTime DateTime
DvDateTimeZone DateTimeOffset
DvTimeSpan TimeSpan
DvText ReadOnlyMemory<char>

…nto typesystem

# Conflicts:
#	src/Microsoft.ML.Api/ApiUtils.cs
#	src/Microsoft.ML.Data/Data/Conversion.cs
#	test/BaselineOutput/SingleDebug/Command/Datatypes-datatypes.txt
#	test/BaselineOutput/SingleRelease/Command/Datatypes-datatypes.txt
#	test/Microsoft.ML.Core.Tests/UnitTests/DvTypes.cs
#	test/Microsoft.ML.Tests/CollectionDataSourceTests.cs
…into typesystem

# Conflicts:
#	src/Microsoft.ML.Api/ApiUtils.cs
#	src/Microsoft.ML.Api/DataViewConstructionUtils.cs
#	src/Microsoft.ML.Api/TypedCursor.cs
#	src/Microsoft.ML.Core/Data/DataKind.cs
#	src/Microsoft.ML.Core/Data/DvInt1.cs
#	src/Microsoft.ML.Core/Data/DvInt2.cs
#	src/Microsoft.ML.Core/Data/DvInt4.cs
#	src/Microsoft.ML.Core/Data/DvInt8.cs
#	src/Microsoft.ML.Core/Data/DvText.cs
#	src/Microsoft.ML.Core/Data/TypeUtils.cs
#	src/Microsoft.ML.Data/Data/Conversion.cs
#	src/Microsoft.ML.Data/Evaluators/BinaryClassifierEvaluator.cs
#	src/Microsoft.ML.Data/Evaluators/ClusteringEvaluator.cs
#	src/Microsoft.ML.Data/Evaluators/MultiOutputRegressionEvaluator.cs
#	src/Microsoft.ML.Data/Evaluators/MulticlassClassifierEvaluator.cs
#	src/Microsoft.ML.Data/Evaluators/RankerEvaluator.cs
#	src/Microsoft.ML.Data/Evaluators/RegressionEvaluatorBase.cs
#	test/BaselineOutput/SingleDebug/Command/Datatypes-datatypes.txt
#	test/BaselineOutput/SingleRelease/Command/Datatypes-datatypes.txt
#	test/Microsoft.ML.Core.Tests/UnitTests/DvTypes.cs
#	test/Microsoft.ML.Core.Tests/UnitTests/TestCSharpApi.cs
#	test/Microsoft.ML.Core.Tests/UnitTests/TestEntryPoints.cs
#	test/Microsoft.ML.StaticPipelineTesting/StaticPipeTests.cs
#	test/Microsoft.ML.TestFramework/TestSparseDataView.cs
#	test/Microsoft.ML.Tests/CollectionDataSourceTests.cs
 into typesystem

# Conflicts:
#	src/Microsoft.ML.Data/Data/Conversion.cs
#	src/Microsoft.ML.Data/DataLoadSave/Binary/CodecFactory.cs
#	src/Microsoft.ML.Data/DataLoadSave/Binary/Codecs.cs
#	src/Microsoft.ML.Data/DataLoadSave/Binary/UnsafeTypeOps.cs
#	src/Microsoft.ML.Transforms/NAReplaceUtils.cs
#	test/BaselineOutput/SingleDebug/Command/Datatypes-datatypes.txt
#	test/BaselineOutput/SingleRelease/Command/Datatypes-datatypes.txt
@@ -120,47 +120,38 @@ public bool IsBool
}

/// <summary>
/// Whether this type is the standard timespan type.
/// Whether this type is the standard <see cref="TimeSpan"/> type.
Copy link
Contributor

@TomFinley TomFinley Sep 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TimeSpan [](start = 57, length = 8)

TimeSpanType, not TimeSpan right? Certainly any instance of a ColumnType could not be a System.TimeSpan. #Resolved

Contracts.Assert(this == TimeSpanType.Instance);
return true;
Contracts.Assert((this == TimeSpanType.Instance) == (this is TimeSpanType));
return this is TimeSpanType;
Copy link
Contributor

@TomFinley TomFinley Sep 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very nice simplification of some, frankly, previously ridiculous code. Thank you for adding it. #Resolved

@@ -11,4 +11,8 @@
<Folder Include="Properties\" />
</ItemGroup>

<ItemGroup>
<PackageReference Include="System.Memory" Version="4.5.1" />
Copy link
Contributor

@TomFinley TomFinley Sep 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're referencing this package in many places, the version should be a variable SystemMemoryVersion in build/Dependencies.props. #Resolved

@@ -42,12 +42,12 @@ public static class Kinds
public const string ScoreColumnSetId = "ScoreColumnSetId";

/// <summary>
/// Metadata kind that indicates the prediction kind as a string. E.g. "BinaryClassification". The value is typically a DvText.
/// Metadata kind that indicates the prediction kind as a string. E.g. "BinaryClassification". The value is typically a ReadOnlyMemory.
Copy link
Contributor

@TomFinley TomFinley Sep 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReadOnlyMemory [](start = 132, length = 14)

Not just any, but specifically ReadOnlyMemory<char>. #Resolved

@@ -295,7 +295,7 @@ public static IEnumerable<int> GetColumnSet(this ISchema schema, string metadata
/// Returns <c>true</c> if the specified column:
/// * is a vector of length N (including 0)
/// * has a SlotNames metadata
/// * metadata type is VBuffer&lt;DvText&gt; of length N
/// * metadata type is VBuffer&lt;ReadOnlyMemory&gt; of length N
Copy link
Contributor

@TomFinley TomFinley Sep 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReadOnlyMemory [](start = 43, length = 14)

If you wanted to read it out it would be not ReadOnlyMemory but ReadOnlyMemory&ltchar&gt, but this is why we have <see tags in the XML docs. #Resolved

{

/// <summary>
/// This implements IEquatable's Equals method.
Copy link
Contributor

@TomFinley TomFinley Sep 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implements IEquatable's Equals method. [](start = 12, length = 43)

It couldn't possibly be implementing an interface method since this is a static method on a static class, and even if it wasn't you couldn't make ReadOnlyMemory<char> implement any interface anyway, since it is not a type under your control.

I think you meant to say it was a utility function out of which one could make an IEquatityComparer<ReadOnlyMemory<char>>, but it's difficult to be certain. #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please try to use <see tags.


In reply to: 216126127 [](ancestors = 216126127)

/// <summary>
/// Compare equality with the given system string value.
/// </summary>
public static bool EqualsStr(string s, ReadOnlyMemory<char> memory)
Copy link
Contributor

@TomFinley TomFinley Sep 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EqualsStr [](start = 27, length = 9)

Is this different method necessary due to perf reasons of s.AsMemory? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can say that.


In reply to: 216126482 [](ancestors = 216126482)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you say that?


In reply to: 216189037 [](ancestors = 216189037,216126482)


for (int i = 0; i < memory.Length; i++)
{
if (memory.Span[i] != b.Span[i])
Copy link
Contributor

@TomFinley TomFinley Sep 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Span [](start = 27, length = 4)

Please store the spans as a variable upfront, then operate on them. These calls to .Span property are not free. Here and everywhere.

Consider this code.

class Program
{
    private static void TimeTest(string str)
    {
        long temp = 0;

        var sw = new Stopwatch();
        const int trials = 1_000_000_000;

        sw.Restart();
        for (int i = 0; i < trials; ++i)
            temp += str[0];
        sw.Stop();
        Console.WriteLine($"{sw.Elapsed} on string");
        temp = 0;

        sw.Restart();
        var mem = str.AsMemory();
        for (int i = 0; i < trials; ++i)
            temp += mem.Span[0];
        sw.Stop();
        Console.WriteLine($"{sw.Elapsed} on memory.Span");
        temp = 0;

        sw.Restart();
        var span = mem.Span;
        for (int i = 0; i < trials; ++i)
            temp += span[0];
        sw.Stop();
        Console.WriteLine($"{sw.Elapsed} on a stored span");
    }

    private static void Main(string[] args)
    {
        TimeTest("hello");
    }
}

When compiled and run in .NET Core 2.1, in a release build, the timings are this:

00:00:00.4667924 on string
00:00:01.8801733 on memory.Span
00:00:00.2327934 on a stored span
``` #Resolved

yield break;
}

int ichMin = 0;
Copy link
Contributor

@TomFinley TomFinley Sep 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ichMin [](start = 16, length = 6)

As far as I see both ichMin and ichLim could, here and most methods in this file, be replaced with direct usage of the 0 literal and memory.Length (or, more precisely, span.Length since we should be storing the span upfront, see above) literal. This would both simplify the code and make it more clear.

Look at it this way. If we had code over an array a that looked like this:

for (int i=0; i<a.Length; ++i)
    DoSomething(a[i]);

we certainly would not welcome any movement to change the code to this.

int min = 0;
int lim = a.Length;
for (int i=min; i<lim; ++i)
    DoSomething(a[i]);

That's less clear and needlessly verbose. #Resolved

Copy link
Member Author

@codemzs codemzs Sep 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ichLim might be worth keeping since we don't want to call memory.Length multiple times. There are places where ichMin is incremented and even there we should keep it.


In reply to: 216126883 [](ancestors = 216126883)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ichLim might be worth keeping since we don't want to call memory.Length multiple times.

Why? Are you under the impression that calling this on a ReadOnlySpan<char> is a perf hit? Have you checked?

Just reactivating, since it reads like you aren't going to do this.


In reply to: 216189476 [](ancestors = 216189476,216126883)

Copy link
Member Author

@codemzs codemzs Sep 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I did not check but based on your comment for Span(and other places in the codebase I have seen such fields cached locally) I assumed it might also be the case for Length but may be I'm wrong. I will replace ichLim as well.


In reply to: 216192013 [](ancestors = 216192013,216189476,216126883)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I did not check but based on your comment for Span(and other places in the codebase I have seen such fields cached locally) I assumed it might also be the case for Length but may be I'm wrong. I will replace ichLim as well.

Why would you? Are you under the impression that an array .Length involves some heavy computation as well?

The point of caching is to avoid computation or memory allocations. In this case .Length is an immutable property of an immutable struct. By caching its value you're storing more on the stack.

The point is that a call to Span involves some actual work whereas Length does not.


In reply to: 216192140 [](ancestors = 216192140,216192013,216189476,216126883)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Span doing additional computation is also concerning to me. We need to take this up with the .NET team and provide this feedback.


In reply to: 216194473 [](ancestors = 216194473,216192140,216192013,216189476,216126883)

}

// Note that we don't use any fields of "this" here in case one
// of the out parameters is the same as "this".
Copy link
Contributor

@TomFinley TomFinley Sep 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy pasta detected. #Resolved

}

/// <summary>
/// Splits this instance on the left-most occurrence of an element of separators character array and
Copy link
Contributor

@TomFinley TomFinley Sep 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this instance [](start = 19, length = 13)

This is not an instance method, so saying "this instance" is unclear. A <paramref however would not go amiss. #Resolved


/// <summary>
/// Splits this instance on the left-most occurrence of an element of separators character array and
/// produces the left and right ReadOnlyMemory values. If this instance does not contain any of the
Copy link
Contributor

@TomFinley TomFinley Sep 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReadOnlyMemory [](start = 40, length = 14)

Another textual reference to a non-generic ReadOnlyMemory structure, which does not exist. I think these can be replaced more or less well with <see cref="ReadOnlyMemory{char}"/>. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gives an error but apparently it needs to be of


In reply to: 216127057 [](ancestors = 216127057)

// REVIEW: Can this be faster?
private static bool ContainsChar(char ch, char[] rgch)
{
Contracts.CheckNonEmpty(rgch, nameof(rgch));
Copy link
Contributor

@TomFinley TomFinley Sep 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contracts.CheckNonEmpty(rgch, nameof(rgch)); [](start = 12, length = 44)

Since there's that review comment (though I suspect it was just copy/pasted over), yes, this can be faster.

First this is a private method that you're calling repeatedly in tight loops in the split code. An assert on the non-emptyness of the input is appropriate for sure, and you ought to check that the code is unreachable in any situation where the separators fed in should not be null, but other than that it seems fine. A check is code run all the time even in a release build, which is not needed here. #Resolved

@codemzs
Copy link
Member Author

codemzs commented Sep 19, 2018

public class TextLoaderTestPipe : TestDataPipeBase

I have added test for this in DataTypesTest


In reply to: 419637006 [](ancestors = 419637006)


Refers to: test/Microsoft.ML.Tests/TextLoaderTests.cs:19 in 0511d74. [](commit_id = 0511d74, deletion_comment = False)

…to typesystem

# Conflicts:
#	src/Microsoft.ML.Transforms/Text/WordEmbeddingsTransform.cs
#	test/Microsoft.ML.Tests/Scenarios/Api/Estimators/Visibility.cs
#	test/Microsoft.ML.Tests/Scenarios/Api/Visibility.cs
…to typesystem

# Conflicts:
#	src/Microsoft.ML.Transforms/NAReplaceTransform.cs
#	src/Microsoft.ML.Transforms/NAReplaceUtils.cs
@eerhardt
Copy link
Member

        var ex = Assert.ThrowsAny<Exception>(() => TestCore(pathData, false, new[] { "loader=Parquet{bigIntDates=+}" }, forceDense: true));

That's not really the same thing. When you test for exceptions you should verify the exception type and optionally you can verify its message.

The code could start throwing MyCrazyException("Nullable object must have a value.") and this test would still pass. However, that could break user code that was catching a specific exception type, but is no longer thrown.


In reply to: 422599427 [](ancestors = 422599427,420677248)


Refers to: test/Microsoft.ML.TestFramework/DataPipe/Parquet.cs:36 in d45bc2c. [](commit_id = d45bc2c, deletion_comment = False)

@codemzs
Copy link
Member Author

codemzs commented Sep 19, 2018

        var ex = Assert.ThrowsAny<Exception>(() => TestCore(pathData, false, new[] { "loader=Parquet{bigIntDates=+}" }, forceDense: true));

By that logic the code could also throw another exception of the same type and even with the same message..... :)


In reply to: 422847966 [](ancestors = 422847966,422599427,420677248)


Refers to: test/Microsoft.ML.TestFramework/DataPipe/Parquet.cs:36 in d45bc2c. [](commit_id = d45bc2c, deletion_comment = False)

/// <summary>
/// Returns a <see cref="ReadOnlyMemory{T}"/> of <see cref="char"/> with trailing whitespace trimmed.
/// </summary>
public static ReadOnlyMemory<char> TrimEndWhiteSpace(ReadOnlyMemory<char> memory, ReadOnlySpan<char> span)
Copy link
Member

@eerhardt eerhardt Sep 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this method is a little awkward and should be removed. It is assuming that the memory and span are pointing to the same location. I see it is only called in 1 spot, which can just be moved to the above overload that only takes a ROM<char>. #Resolved

{
Contracts.Assert(scan.IchMinBuf <= scan.IchMinNext && scan.IchMinNext <= scan.IchLimBuf);

var text = scan.TextBuf;
int ichLim = scan.IchLimBuf;
int ichCur = scan.IchMinNext;

//var span = text.Span;
Copy link
Member

@eerhardt eerhardt Sep 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) can be removed. #Resolved

return pool.Get(memory);
}

public static void AddToStringBuilder(ReadOnlyMemory<char> memory, StringBuilder sb)
Copy link
Member

@eerhardt eerhardt Sep 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can get rid of this method and just have all the callers call sb.Append(memory) themselves. That way we don't have yet another public util method that does the same thing. #Resolved


for (; ; ichMin++)
public static Result Parse(out Single value, ReadOnlySpan<char> span)
Copy link
Member

@eerhardt eerhardt Sep 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we are in here fixing this, can we switch the parameters around so the out variable is at the end? #Resolved

@eerhardt
Copy link
Member

eerhardt commented Sep 19, 2018

I think this is looking really good. Just a couple last comments to address and I think this will be ready to merge. #Resolved

@codemzs
Copy link
Member Author

codemzs commented Sep 19, 2018

Thanks for reviewing!


In reply to: 422861081 [](ancestors = 422861081)

Copy link
Member

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

.NET data type system instead of DvTypes
5 participants