Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert PcaTransform to Estimator API #1333

Merged
merged 30 commits into from
Oct 25, 2018

Conversation

shmoradims
Copy link

This PR is proper conversion of PcaTransform to estimator API (a work item related to #754):

  • Overhauled PcaTransfrom to be OneToOneTransformerBase
  • Deleted the Wrapped versions
  • Extended the unit tests
  • Lots of code cleanup/simplifications

separator: ';', hasHeader: true)
.Read(dataSource);
.Read(new MultiFileSource(_dataSource));
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new MultiFileSource(_dataSource) [](start = 22, length = 32)

I would prefer to have string path rather than MultiFileSource #Closed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All snippets I had seen were using MultiFileSource. Didn't know that string input was an option, which is much cleaner. Will change to string.


In reply to: 227046972 [](ancestors = 227046972)

public PcaTests(ITestOutputHelper helper)
: base(helper)
{
_env = new ConsoleEnvironment(seed: 1, conc: 1);
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_env = new ConsoleEnvironment(seed: 1, conc: 1); [](start = 12, length = 48)

How necessary is to have this environment?
I believe you specify seed in arguments, so you don't have to specify seed here.
Does concurrency affect results of PCA? #Closed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concurrency didn't have an effect, on my Windows machine at least, so I removed it.

The seed value is needed and it does change the output. PCA seed doesn't seem to work and I opened an issue #1337 for it.
#1337


In reply to: 227047586 [](ancestors = 227047586)

@@ -15,19 +13,29 @@
using Microsoft.ML.Runtime.Internal.Utilities;
using Microsoft.ML.Runtime.Model;
using Microsoft.ML.Runtime.Numeric;
using Microsoft.ML.Core.Data;
using Microsoft.ML.StaticPipe;
using Microsoft.ML.StaticPipe.Runtime;
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ctrl+K+G #Closed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorted them


In reply to: 227047810 [](ancestors = 227047810)

[assembly: LoadableClass(PcaTransform.Summary, typeof(PcaTransform), null, typeof(SignatureLoadModel),
PcaTransform.UserName, PcaTransform.LoaderSignature)]

[assembly: LoadableClass(typeof(IRowMapper), typeof(PcaTransform), null, typeof(SignatureLoadRowMapper),
PcaTransform.UserName, PcaTransform.LoaderSignature)]

[assembly: LoadableClass(typeof(void), typeof(PcaTransform), null, typeof(SignatureEntryPointModule), PcaTransform.LoaderSignature)]

namespace Microsoft.ML.Runtime.Data
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Runtime.Data [](start = 23, length = 12)

Can you rename it to Microsoft.ML.Transforms? #Closed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


In reply to: 227047966 [](ancestors = 227047966)

if (!inputSchema.TryFindColumn(colInfo.Input, out var col))
throw _host.ExceptSchemaMismatch(nameof(inputSchema), "input", colInfo.Input);

if (!(col.Kind == SchemaShape.Column.VectorKind.Vector && col.ItemType.IsNumber))
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

col.ItemType.IsNumber [](start = 74, length = 21)

Just want to double check. Do we really support all number types? Doubles, longs? #Closed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the original check was with .IsNumber, but you're right, it only works with floats. I changed the check.


In reply to: 227048656 [](ancestors = 227048656)

_transformInfos[i] = new TransformInfo(args.Column[i], args, Infos[i].TypeSrc.ValueCount);
_center[i] = args.Column[i].Center ?? args.Center;
_oversampling[i] = args.Column[i].Oversampling ?? args.Oversampling;
Host.CheckUserArg(_oversampling[i] >= 0, nameof(args.Oversampling), "Oversampling must be non-negative");
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oversampling must be non-negative" [](start = 85, length = 34)

can you put this check inside ColumnInfo constructor? #Closed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to ctor


In reply to: 227051899 [](ancestors = 227051899)

Oversampling = overSampling;
Center = center;
Seed = seed;
Contracts.CheckUserArg(Oversampling >= 0, nameof(Oversampling), "Oversampling must be non-negative.");
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contracts [](start = 16, length = 9)

Can you add
Contracts.CheckParam(0 < Rank && Rank <= Dimension, nameof(Rank), "Rank must be positive, and at most the dimension of untransformed data");
as well?
Also change it to CheckParam
#Resolved

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That check already exists in TransformInfo constructor. The Dimension information is kept in that class and is not available in ColumnInfo class, because it depends on the input schema.


In reply to: 227573435 [](ancestors = 227573435)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's part of TransformInfo constructor, but in same time we hit that constructor only during training stage.
You can create estimator, put it as part of chain, and never call it, or call it few screens later, and only after it hit TransformInfo it would throw exception.
If you put it in ColumnInfo as well, it will throw immediately.


In reply to: 227624838 [](ancestors = 227624838,227573435)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Dimension is equal to the vector size of the column, and it's only known at the time of Fit, not before. So there is no way to throw early.


In reply to: 227869834 [](ancestors = 227869834,227624838,227573435)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least you can check rank negativity.


In reply to: 227869834 [](ancestors = 227869834,227624838,227573435)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True :)


In reply to: 227877885 [](ancestors = 227877885,227869834,227624838,227573435)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Positivity


In reply to: 227879084 [](ancestors = 227879084,227877885,227869834,227624838,227573435)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added positivity check for rank


In reply to: 227879168 [](ancestors = 227879168,227879084,227877885,227869834,227624838,227573435)

/// <summary>
/// Describes how the transformer handles one column pair.
/// </summary>
public ColumnInfo(string input,
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you copy comments about params from pigstension to here? #Closed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


In reply to: 227573714 [](ancestors = 227573714)

string dataSource = GetDataPath("generated_regression_dataset.csv");
var data = TextLoader.CreateReader(env,
c => (label: c.LoadFloat(11), features: c.LoadFloat(0, 10)),
var data = TextLoader.CreateReader(_env,
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data [](start = 16, length = 4)

nit: you can create third data with same schema but loading floats from 1 to 6, and pass it to validForFitNotValidForTransformInput #Pending

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this what you mean?
c.LoadFloat(11), weight: c.LoadFloat(0), features: c.LoadText(1, 6)
that sounds like validFitInput to me.


In reply to: 227574746 [](ancestors = 227574746)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you replace features:c.LoadText(1,6) to features:c.LoadFloat(1,6) then yes (I doubt you can fit on top of text vector)
And that's exactly my point, it's a valid fit but if you already train you data on bigger array, it should fail on smaller one, this is why it called validForFitNotValidForTransformInput :)


In reply to: 227626841 [](ancestors = 227626841,227574746)

return dstGetter;
}

private static void TransformFeatures(IExceptionContext ectx, ref VBuffer<float> src, ref VBuffer<float> dst, TransformInfo transformInfo)
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static void TransformFeatures [](start = 20, length = 29)

nit: I don't see reason why it's static, you can just make it part of class and stop accept IExceptionContext and use Host instead #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or keep as is :)


In reply to: 227575467 [](ancestors = 227575467)

}

protected override ColumnType GetColumnTypeCore(int iinfo)
internal static void ValidatePcaInput(IHost host, string name, ColumnType type)
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static [](start = 17, length = 6)

any reason why it's static? #ByDesign

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it reflects the fact that the PCA check doesn't depends any instance state. It only cares about a column type. It also gets used inside the Mapper.


In reply to: 227576143 [](ancestors = 227576143)

Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

}

protected override ColumnType GetColumnTypeCore(int iinfo)
internal static void ValidatePcaInput(IHost host, string name, ColumnType type)
Copy link
Contributor

@Zruty0 Zruty0 Oct 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IHost [](start = 46, length = 5)

let's use the narrowest possible interface. In this case it should be IExceptionContext #Closed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


In reply to: 227587653 [](ancestors = 227587653)

Host.Check(0 <= iinfo & iinfo < Utils.Size(_types));
return _types[iinfo];
if (!type.IsVector)
throw host.Except($"Pca transform can only be applied to vector columns. Column ${name} is of type ${type}");
Copy link
Contributor

@Zruty0 Zruty0 Oct 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

host.Except( [](start = 22, length = 12)

I suggest to refactor this slightly to fit into ExceptSchemaMismatch #Resolved

Copy link
Author

@shmoradims shmoradims Oct 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There wasn't a good type to to put for the expected type. The best I could come up with was this, which didn't look good. Do you have a better version?

        string inputSchema; // just used for the excpections

        if (!type.IsVector)
            throw ectx.ExceptSchemaMismatch(nameof(inputSchema), "input", name, "Vector", type.ToString());

In reply to: 227587808 [](ancestors = 227587808)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, you are expecting a fixed-size vector of float. You can either spell it out 'fixed size vector of float', or create a SchemaShape.Column and call GetTypeString


In reply to: 227630278 [](ancestors = 227630278,227587808)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consolidated all the checks.


In reply to: 227877917 [](ancestors = 227877917,227630278,227587808)

}
}

public ColumnType InputType => _schema[_input].Type;
Copy link
Contributor

@Zruty0 Zruty0 Oct 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public ColumnType InputType => _schema[_input].Type; [](start = 16, length = 52)

instead of properties over internal state, let's adopt a pattern where all of the properties are computed once in the ctor, and there is no private fields. #Closed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense. done!


In reply to: 227589703 [](ancestors = 227589703)

if (colSchemaInfo.InputType.VectorSize != _parent._transformInfos[i].Dimension)
{
var msg = $"Dimension of column ${colPair.input} is ${colSchemaInfo.InputType.VectorSize}, which doesn't match the expected size ${_parent._transformInfos[i].Dimension}";
throw Host.Except(msg);
Copy link
Contributor

@Zruty0 Zruty0 Oct 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except [](start = 35, length = 6)

this message also should be ExceptSchemaMismatch #Resolved

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExceptSchemaMismatch only produces this type of message $"Schema mismatch for {columnRole} column '{columnName}': expected {expectedType}, got {actualType}"; I don't know how to turn this message to fit that.


In reply to: 227591272 [](ancestors = 227591272)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Schema mismatch for input column 'Foo': expected V<R4, 4>, got V<R4, 5>
Produced by ExceptSchemaMismatch("input", colPair.input, new VectorType(NumberType.R4, _parent._transformInfos[i].Dimension).ToString(), colSchemaInfo.InputType.ToString())


In reply to: 227631956 [](ancestors = 227631956,227591272)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool. fixed it :)


In reply to: 227878757 [](ancestors = 227878757,227631956,227591272)

if (!inputSchema.TryFindColumn(colInfo.Input, out var col))
throw _host.ExceptSchemaMismatch(nameof(inputSchema), "input", colInfo.Input);

if (col.Kind != SchemaShape.Column.VectorKind.Vector || col.ItemType.RawKind != DataKind.R4)
Copy link
Contributor

@Zruty0 Zruty0 Oct 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ItemType [](start = 76, length = 8)

!col.ItemType.Equals(NumberType.R4) is a more direct comparison to R4 #Closed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


In reply to: 227593906 [](ancestors = 227593906)

IReadOnlyDictionary<PipelineColumn, string> outputNames,
IReadOnlyCollection<string> usedNames)
{
// Only one column is allowed.
Copy link
Contributor

@Zruty0 Zruty0 Oct 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Only one column is allowed. [](start = 15, length = 31)

remove this comment #Closed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


In reply to: 227594435 [](ancestors = 227594435)

}
}

/// <summary>Compute the principal components of the input column. Can significantly reduce size of vector.</summary>
Copy link
Contributor

@Zruty0 Zruty0 Oct 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compute the principal components of the input column. [](start = 21, length = 53)

describe the transformation, not the training procedure. 'Replaces vector with its projection to the principal component subspace' or something along these lines #Closed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated


In reply to: 227594833 [](ancestors = 227594833)

Copy link
Contributor

@Zruty0 Zruty0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Contributor

@Zruty0 Zruty0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Contributor

@Zruty0 Zruty0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@shmoradims shmoradims merged commit b056d08 into dotnet:master Oct 25, 2018
@shmoradims shmoradims deleted the PCA_api_conversion branch December 12, 2018 22:08
@ghost ghost locked as resolved and limited conversation to collaborators Mar 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants