Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace DvInt* with .NET standard data types. WIP #683

Closed
wants to merge 56 commits into from

Conversation

codemzs
Copy link
Member

@codemzs codemzs commented Aug 16, 2018

This change also removes missing value handling for sbyte, short, int and long because default of these values is a null and that does not fit well with sparse vector architecture where default for missing values is a zero. fixes #673

RegisterSimpleCodec(new UnsafeTypeCodec<byte>(this));
RegisterSimpleCodec(new UnsafeTypeCodec<DvInt2>(this));
RegisterSimpleCodec(new UnsafeTypeCodec<Int16>(this));
Copy link
Contributor

@TomFinley TomFinley Aug 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Int16 [](start = 52, length = 5)

short, int, long, not Int16, Int32, Int64. #Closed

@@ -417,9 +417,9 @@ protected override void GetMetadataCore<TValue>(string kind, int iinfo, ref TVal
}
}

private void GetCategoricalSlotRanges(int iiinfo, ref VBuffer<DvInt4> dst)
private void GetCategoricalSlotRanges(int iiinfo, ref VBuffer<Int32> dst)
Copy link
Contributor

@TomFinley TomFinley Aug 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Int32 [](start = 74, length = 5)

Please just use the keywords for types (so, in this case int), here and everywhere. #Closed

Copy link
Member Author

@codemzs codemzs Aug 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change across the code base using find-replace. #Resolved


public void Conv(ref long? src, ref DvInt8 dst) => dst = src ?? DvInt8.NA;
public void Conv(ref long? src, ref Int64? dst) => dst = src;

Copy link
Contributor

@TomFinley TomFinley Aug 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see there are no conversions from float to float, or double to double. So: are these conversions actually necessary? #Resolved

Copy link
Member Author

@codemzs codemzs Aug 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated this code to not assume there can be nullables but I do see a conversion between byte and byte. @mandyshieh can you please take a look? #Resolved

@TomFinley
Copy link
Contributor

TomFinley commented Aug 16, 2018

    public RefPredicate<T> GetIsDefaultPredicate<T>(ColumnType type)

You will need to be careful here. We want the "default" of the new nullable integer types to be 0, since that's both a more sensible choice and it retains sparsity preservation for conversion from int? to int. (For that reason it may be beneficial to choose a better name than "default" since it will no longer be identical with C# sparsity.)

The implications for things like VBufferUtils.Densify are interesting.

The sparsity question will be quite far reaching, and will be by far the most challenging part of this change. #Resolved


Refers to: src/Microsoft.ML.Data/Data/Conversion.cs:703 in a0ceabb. [](commit_id = a0ceabb, deletion_comment = False)

RegisterSimpleCodec(new UnsafeTypeCodec<byte>(this));
RegisterSimpleCodec(new UnsafeTypeCodec<DvInt2>(this));
RegisterSimpleCodec(new UnsafeTypeCodec<Int16>(this));
Copy link
Contributor

@TomFinley TomFinley Aug 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RegisterSimpleCodec(new UnsafeTypeCodec(this)); [](start = 12, length = 54)

We will want the nullable types to be serializable. #Closed

Copy link
Member Author

@codemzs codemzs Aug 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed since there will no nullable types 👍 #Resolved

Copy link
Contributor

@TomFinley TomFinley Aug 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good, thanks @codemzs. Incidentally were you going to update the description of #673 pursuant to our discussions yesterday? It is still the old description.

Edit: Actually I don't think you've linked to that issue from this PR at all. You should, maybe "fixes #673" in your description.


In reply to: 210744044 [](ancestors = 210744044)

@codemzs codemzs changed the title WIP Replace DvInt* with .NET standard data types. Replace DvInt* with .NET standard data types and remove missing value handling for sbyte, short, int and long. Aug 16, 2018
@codemzs codemzs changed the title Replace DvInt* with .NET standard data types and remove missing value handling for sbyte, short, int and long. Replace DvInt* with .NET standard data types. Aug 16, 2018
@@ -525,7 +525,7 @@ private void UpdatePeg(Peg peg)

namespace Microsoft.ML.Runtime.Internal.Utilities
{
// Reasonable choices are Double and System.Int64.
// Reasonable choices are Double and System.long.
Copy link
Member Author

@codemzs codemzs Aug 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

long [](start = 48, length = 4)

fix this #Resolved

{
Ch.Assert(colType.ItemType == NumberType.I4);
return CreateConvertingArrayGetterDelegate<int?, DvInt4>(index, x => x ?? DvInt4.NA);
return CreateConvertingArrayGetterDelegate<int, int>(index, x => x);
Copy link
Member

@eerhardt eerhardt Aug 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a "converting" delegate here to return the same value? #Resolved

@@ -185,25 +185,25 @@ public static bool TryGetDataKind(this Type type, out DataKind kind)
Contracts.CheckValueOrNull(type);

// REVIEW: Make this more efficient. Should we have a global dictionary?
if (type == typeof(DvInt1) || type == typeof(sbyte) || type == typeof(sbyte?))
if (type == typeof(sbyte))
kind = DataKind.I1;
else if (type == typeof(byte) || type == typeof(byte?))
Copy link
Member

@eerhardt eerhardt Aug 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

byte? should be removed here as well, to follow the rest of the changes. #Resolved

Copy link
Member Author

@codemzs codemzs Aug 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is "byte?" mapped to a U1? I thought U1 just mapped to a byte and did not support nullables ....this needs to go away anyways


In reply to: 210963225 [](ancestors = 210963225)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Someone made a mistake somewhere, didn't realize this was supposed to be inverse to above function. Who knows who, but we ought to correct now.


In reply to: 210965477 [](ancestors = 210965477,210963225)

using RawI2 = Int16;
using RawI4 = Int32;
using RawI8 = Int64;
using I1 = SByte;
Copy link
Member

@eerhardt eerhardt Aug 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to just start using the C# keywords inline in the code? That way readers didn't have to map I2 => short in their heads? #Resolved

Copy link
Contributor

@TomFinley TomFinley Aug 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is generally the case except in this one file, if we consider its nature I feel like this would actually impede readability, at least for me.

So:

AddStd<I2, I1>(Convert);
AddStd<I2, I2>(Convert);
AddStd<I2, I4>(Convert);
AddStd<I2, I8>(Convert);
AddStd<I2, R4>(Convert);
AddStd<I2, R8>(Convert);
AddAux<I2, SB>(Convert);

becomes

AddStd<short, byte>(Convert);
AddStd<short, short>(Convert);
AddStd<short, int>(Convert);
AddStd<short, long>(Convert);
AddStd<short, float>(Convert);
AddStd<short, double>(Convert);
AddAux<short, SB>(Convert);

Maybe. That's hardly disastrous I suppose, and might be easier for some people. Perhaps we ought to revisit though once we've done all the type changes we want to do...


In reply to: 210964331 [](ancestors = 210964331)

private BoolType()
: base(typeof(DvBool), DataKind.BL)
private readonly string _name;

Copy link
Contributor

@TomFinley TomFinley Aug 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. What's this for? Is it still relevant? #Closed

@TomFinley
Copy link
Contributor

TomFinley commented Aug 29, 2018

Change looks OK for now, let's hold off till after 0.5 is cut then we can perhaps do all the changes to the type system in one go. #Resolved

Copy link
Contributor

@TomFinley TomFinley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @codemzs let's hold off till 0.5 though.

@codemzs codemzs changed the title Replace DvInt* with .NET standard data types. Replace DvInt* with .NET standard data types. WIP Aug 29, 2018
@@ -731,6 +731,12 @@ public void GetMetadata<TValue>(string kind, int col, ref TValue value)
/// </summary>
private const ulong ReaderVersion = MissingTextVersion;
Copy link
Contributor

@TomFinley TomFinley Aug 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReaderVersion = MissingTextVersion [](start = 28, length = 34)

This was not updated, it should have been. We can read StandardDataTypesVersion so it should be that. #Resolved

@@ -1090,10 +1096,10 @@ private unsafe Header InitHeader()
throw _host.Except("Cannot read version {0} data, earliest that can be handled is {1}",
Header.VersionToString(header.CompatibleVersion), Header.VersionToString(MetadataVersion));
}
if (header.CompatibleVersion > ReaderVersion)
if (header.CompatibleVersion > StandardDataTypesVersion)
Copy link
Contributor

@TomFinley TomFinley Aug 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CompatibleVersion > StandardDataTypesVersion [](start = 23, length = 44)

Revert these two lines please. #Resolved

/// The first version that removes DvTypes and uses .NET standard
/// data types.
/// </summary>
private const ulong StandardDataTypesVersion = 0x0001000100010006;
Copy link
Contributor

@TomFinley TomFinley Aug 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StandardDataTypesVersion [](start = 28, length = 24)

Also will need to update the constant in Header.cs. #Resolved

@codemzs
Copy link
Member Author

codemzs commented Aug 30, 2018

Sounds good, thanks for reviewing. I was wondering if we should bump the version number for every type conversion PR check-in as this will allow us to check-in PRs without waiting for other PRs to close, thoughts?


In reply to: 417051605 [](ancestors = 417051605)

@codemzs
Copy link
Member Author

codemzs commented Aug 30, 2018

    /// Missing values are mapped to zero with a true return.

Fix this #Resolved


Refers to: src/Microsoft.ML.Data/Data/Conversion.cs:1199 in 40c01fc. [](commit_id = 40c01fc, deletion_comment = False)


Assert.True(error);

//5. Missing value in text to int.
Copy link
Member Author

@codemzs codemzs Aug 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

//5. Missing value in text to int. [](start = 12, length = 34)

Empty string #Resolved

mapper(ref src, ref dst);
Assert.Equal(default, dst);

//6. Empty string in text to sbyte.
Copy link
Member Author

@codemzs codemzs Aug 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Empty string in text to sbyte. [](start = 14, length = 33)

Missing value #Resolved

…to types

# Conflicts:
#	test/Microsoft.ML.TestFramework/DataPipe/TestDataPipeBase.cs
@TomFinley
Copy link
Contributor

I think we can avoid in those cases since I feel like we can easily just have separate codecs probably. In this case we could not avoid I think because we were using the same codec for both purposes.


In reply to: 417434524 [](ancestors = 417434524,417051605)

@TomFinley
Copy link
Contributor

TomFinley commented Aug 31, 2018

datatypes.idv is still shown as being modified in latest commit #Resolved

…to types

# Conflicts:
#	test/BaselineOutput/SingleDebug/SavePipe/TestParquetPrimitiveDataTypes-Data.txt
#	test/BaselineOutput/SingleDebug/SavePipe/TestParquetPrimitiveDataTypes-Schema.txt
#	test/BaselineOutput/SingleRelease/SavePipe/TestParquetPrimitiveDataTypes-Data.txt
#	test/BaselineOutput/SingleRelease/SavePipe/TestParquetPrimitiveDataTypes-Schema.txt
#	test/data/Parquet/alltypes.parquet
@codemzs codemzs changed the title Replace DvInt* with .NET standard data types. WIP Replace DvInt* with .NET standard data types. Sep 4, 2018
@codemzs codemzs changed the title Replace DvInt* with .NET standard data types. Replace DvInt* with .NET standard data types. WIP Sep 4, 2018
@TomFinley
Copy link
Contributor

TomFinley commented Sep 4, 2018

Hi @codemzs you may want to merge master again, I think our work on APIs and additional tests for pigsty may interfere. #Resolved

@codemzs
Copy link
Member Author

codemzs commented Sep 7, 2018

Done.


In reply to: 418545763 [](ancestors = 418545763)

@codemzs
Copy link
Member Author

codemzs commented Sep 20, 2018

This change was committed via #863

@codemzs codemzs closed this Sep 20, 2018
@codemzs codemzs deleted the types branch September 20, 2018 18:17
@ghost ghost locked as resolved and limited conversation to collaborators Mar 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

.NET data type system instead of DvTypes
4 participants