Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transform wrappers and a reference implementation for tokenizers #931

Merged
merged 8 commits into from
Sep 19, 2018

Conversation

Zruty0
Copy link
Contributor

@Zruty0 Zruty0 commented Sep 17, 2018

No description provided.

@Zruty0 Zruty0 added the API Issues pertaining the friendly API label Sep 17, 2018
@Zruty0 Zruty0 self-assigned this Sep 17, 2018
text: ctx.LoadFloat(1)), hasHeader: true)
.Read(new MultiFileSource(sentimentDataPath));

var est = new WordTokenizer(Env, "text", "words")
Copy link
Contributor

@zeahmed zeahmed Sep 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WordTokenizer [](start = 26, length = 13)

Whats the purpose of TransformWrapper? Is it to quickly manufacture transformers from existing transform? because I don't see the Pigsty extensions to it? Will it require additional work? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is exactly for that.

Pigsty comes extra, see the subsequent iterations


In reply to: 218245812 [](ancestors = 218245812)

/// <param name="env">The environment.</param>
/// <param name="columns">Pairs of columns to run the tokenization on.</param>
/// <param name="advancedSettings">Any advanced settings to be applied.</param>
public CharacterTokenizer(IHostEnvironment env, (string input, string output)[] columns,
Copy link
Contributor

@zeahmed zeahmed Sep 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(string input, string output) [](start = 56, length = 29)

name-tuple array is sometime quite confusing when passing arguments e.g. when there are 2 or more tuples.

Also, maybe you are missing params keyword. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the pattern for the transformer conversion so far.
params is not possible here because of the useMarkerCharacters


In reply to: 218246352 [](ancestors = 218246352)


/// <summary>
/// A fake schema that is manufactured out of a SchemaShape.
/// It will pretend that all vector sizes are equal to 10, all key value counts are equal to 10,
Copy link
Contributor

@zeahmed zeahmed Sep 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 [](start = 59, length = 2)

Is there any specific reason for choosing 10?...:) #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No


In reply to: 218248145 [](ancestors = 218248145)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be more than one, that's the only limitation I'm aware of.


In reply to: 218253037 [](ancestors = 218253037,218248145)

{
ColumnType curType = inputCol.ItemType;
if (inputCol.IsKey)
curType = new KeyType(curType.AsPrimitive.RawKind, 0, 10);
Copy link
Contributor

@zeahmed zeahmed Sep 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 [](start = 70, length = 2)

I see it being used at multiple places. Can we define a const for it? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure


In reply to: 218248484 [](ancestors = 218248484)


var dataPipe = _xf;
var transforms = new List<IDataTransform>();
while (dataPipe is IDataTransform xf)
Copy link
Contributor

@zeahmed zeahmed Sep 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataPipe is IDataTransform xf [](start = 19, length = 29)

It seems only for linear chain like pipeline. Is it possible that pipelines have graph like transformation structure? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No


In reply to: 218249827 [](ancestors = 218249827)

@@ -2,6 +2,7 @@
// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.

using Microsoft.ML.Transforms.Text;
using Microsoft.ML.Data.StaticPipe;
Copy link
Contributor

@TomFinley TomFinley Sep 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhhh. So nice to see a nice fresh PR. And you know what I'm going to say as my very first comment?

You know as well as I do. I always do it of course. I'd even go so far as to call it a tradition. A hoary tradition, by this point.

Just pausing for a moment, to relish the moment. You understand. :D Because, you know, I figure I deserve it at this point. It may seem a little self indulgent, and I suppose it is, but you know, I have to do it. Because, without it, I just wouldn't be me, you feel me?

So here it is:

Do you want to sort your usings? :) #Resolved

/// <param name="inputColumn">The column containing text to tokenize.</param>
/// <param name="outputColumn">The column containing output tokens. Null means <paramref name="inputColumn"/> is replaced.</param>
/// <param name="separators">The separators to use (comma separated).</param>
public WordTokenizer(IHostEnvironment env, string inputColumn, string outputColumn = null, string separators = "space")
Copy link
Contributor

@TomFinley TomFinley Sep 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string separators = "space" [](start = 99, length = 27)

This is a great thing to have as the arguments to a command line. To an API, we should use actual char or char[]. #ByDesign

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started with char, then I discovered that multiple separators can be used.
I am all for making opportunistic changes to the transforms to make them friendlier and more API-appropriate, but only if this doesn't take too much extra time / lines of code.

I also put a REVIEW somewhere saying exactly that.


In reply to: 218261276 [](ancestors = 218261276)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to not block this PR on this issue though.


In reply to: 218288071 [](ancestors = 218288071,218261276)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just looked at the code changes. They are not too big, but I'd rather file an issue and do it later than do it now. Are you OK with that?


In reply to: 218288153 [](ancestors = 218288153,218288071,218261276)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a comment on line 55 and will create a GH issue


In reply to: 218289358 [](ancestors = 218289358,218288153,218288071,218261276)

using Microsoft.ML.Runtime.Data;
using Microsoft.ML.Runtime.Internal.Utilities;
using System.Collections.Generic;
using System.Linq;
Copy link
Member

@sfilipi sfilipi Sep 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to follow on the list of people that get picky about using sorting, i have seen both Microsoft.ML than System, and vice-versa.
we gotta sort this out #ByDesign

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we use what VS does when you do Ctrl-K-G (sort usings). It sorts alphabetically.


In reply to: 218328459 [](ancestors = 218328459)

/// <summary>
/// Estimator for untrained wrapped transformers.
/// </summary>
public abstract class TrivialWrapperEstimator : TrivialEstimator<TransformWrapper>
Copy link
Member

@sfilipi sfilipi Sep 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TrivialWrapperEstimator [](start = 26, length = 23)

why not call it like the comment: UntrainedWrapperEstimator #ByDesign

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this name. 'Untrained' could mean 'not yet trained', or 'not trainable at all'. The current name 'trivial wrapper estimator' makes more sense as it's a TrivialEstimator of a Wrapper.


In reply to: 218329322 [](ancestors = 218329322)

/// </summary>
/// <param name="input">The column to apply to.</param>
/// <param name="useMarkerCharacters">Whether to use marker characters to separate words.</param>
public static VarVector<Key<ushort, string>> TokenizeIntoCharacters(this Scalar<string> input, bool useMarkerCharacters = true) => new OutPipelineColumn(input, useMarkerCharacters);
Copy link
Contributor

@zeahmed zeahmed Sep 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this Scalar [](start = 76, length = 20)

I think char tokenizer also works on vector of string. Do you plan to support that extension as well? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I like that implicit behavior. For now, I would rather only have uncontroversial Pigsty extensions, go by what makes sense vs. what is currently supported/possible


In reply to: 218547643 [](ancestors = 218547643)

/// </summary>
/// <param name="input">The column to apply to.</param>
/// <param name="separators">The separators to use (comma separated).</param>
public static VarVector<string> TokenizeText(this Scalar<string> input, string separators = "space") => new OutPipelineColumn(input, separators);
Copy link
Contributor

@zeahmed zeahmed Sep 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"space" [](start = 100, length = 7)

I think we should make them actual space(' ') or tab ('\t') on the front facing API instead of using keywords. What do you think? #ByDesign

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to just do #935 and not have some form of half-solutions here. I am ok with replacing "space" with " ", but it just begs the question: why not ' ', and that's a tougher change.


In reply to: 218548057 [](ancestors = 218548057)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not know there is an issue already to handle that. I am ok with that.


In reply to: 218574836 [](ancestors = 218574836,218548057)

Copy link
Contributor

@zeahmed zeahmed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

/// <param name="columns">Pairs of columns to run the tokenization on.</param>
/// <param name="useMarkerCharacters">Whether to use marker characters to separate words.</param>
public CharacterTokenizer(IHostEnvironment env, (string input, string output)[] columns, bool useMarkerCharacters = true)
: base(Contracts.CheckRef(env, nameof(env)).Register(nameof(WordTokenizer)), MakeTransformer(env, columns, useMarkerCharacters))
Copy link
Contributor

@zeahmed zeahmed Sep 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WordTokenizer [](start = 72, length = 13)

CharacterTokenizer??? #Resolved

/// <summary>
/// Estimator for trained wrapped transformers.
/// </summary>
internal abstract class TrainedWrapperEstimatorBase : IEstimator<TransformWrapper>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal [](start = 4, length = 8)

Is it internal by design? I need to use it to convert Ngram transform.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, let's make it public for now


In reply to: 218629635 [](ancestors = 218629635)


public abstract TransformWrapper Fit(IDataView input);

public SchemaShape GetOutputSchema(SchemaShape inputSchema)
Copy link
Member

@sfilipi sfilipi Sep 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public [](start = 8, length = 6)

we should all make it a habit of documenting everything public .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it apply to interface methods? IEstimator.GetOutputSchema is pretty well documented.


In reply to: 218630155 [](ancestors = 218630155)

public void Tokenize()
{
var env = new ConsoleEnvironment(seed: 0);
var dataPath = GetDataPath("wikipedia-detox-250-line-data.tsv");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

h("wikipedia-detox-250-line-data.tsv" [](start = 37, length = 37)

should use the datasets from the TestDatasets.cs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I merged before I saw your comments. Will do in one of the next PRs.


In reply to: 218631455 [](ancestors = 218631455)

Copy link
Member

@sfilipi sfilipi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@Zruty0 Zruty0 merged commit f063151 into dotnet:master Sep 19, 2018
@Zruty0 Zruty0 deleted the feature/transform-wrapper branch September 19, 2018 00:07
@Zruty0 Zruty0 mentioned this pull request Sep 24, 2018
@ghost ghost locked as resolved and limited conversation to collaborators Mar 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
API Issues pertaining the friendly API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants