Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up of TextLoader constructor #1784

Merged
merged 9 commits into from
Dec 7, 2018
Merged

Conversation

artidoro
Copy link
Contributor

Fixes #1611.

  1. Hid the constructor of TextLoader that takes Arguments, and exposed HasHeader and SeparatorChars as non-advanced parameters.
  2. Made Create methods internal and modified the code accordingly.
  3. Added comments for the public facing constructor that was retained.

/// <param name="env">The environment to use.</param>
/// <param name="columns">Defines a mapping between input columns in the file and IDataView columns.</param>
/// <param name="hasHeader">Whether the file has a header.</param>
/// <param name="separatorChars">Defines the characters used as separators between data points in a row. By default the tab character is taken as separator.</param>
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Nov 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default the tab character [](start = 112, length = 29)

this statement and char[] separatorChars = null a bit weird.
I know what latter down the line we probably check for null in separators, and use tab as default, but still. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is not ideal, that's why I have an explanation in the documentation above. But if you have a better idea, I would be happy to take it!


In reply to: 237683995 [](ancestors = 237683995)

// We read the first 11 values as a single float vector.
new TextLoader.Column("FeatureVector", DataKind.R4, 0, 10),

// Separately, read the target variable.
new TextLoader.Column("Target", DataKind.R4, 11),
},
// First line of the file is a header, not a data row.
HasHeader = true,
true,
Copy link
Member

@eerhardt eerhardt Nov 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still qualify hasHeader: here. #Resolved

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apply this everywhere - especially in docs/samples/etc.


In reply to: 238006767 [](ancestors = 238006767)

// Default separator is tab, but we need a semicolon.
Separator = ";"
});
new[] { ';' }
Copy link
Member

@eerhardt eerhardt Nov 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the single separator the more common case? Maybe the "simple" constructor just takes a single character separator. And the "advanced" case can support multiple separators. #Resolved

/// <param name="catalog">The catalog.</param>
/// <param name="args">The arguments to text reader, describing the data schema.</param>
/// <param name="dataSample">The optional location of a data sample.</param>
public static TextLoader TextReader(this DataOperations catalog,
Copy link
Member

@eerhardt eerhardt Nov 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yesterday in the API design review, we decided against this approach. See the notes from the review here: https://github.com/dotnet/apireviews/pull/81/files

But basically, the pattern decided will be:

  1. Make a simple constructor/factory that has the most common parameters.
  2. If there are advanced parameters that we don't want exposed in Get a working build #1, then make another constructor/factory that takes the only the nested Arguments class (to be renamed to "Options").

We are going to move away from the Action<Arguments> advancedSettings approach. One main reason is because there can be conflicts between the "simple" parameters and the "advanced" parameters - and which one should win? Another reasoning is that it is simpler and understandable to construct and pass an object to a method.

I'd say, for this change, let's not move away from where we are going. You don't need to rename Arguments to Options. But let's leave this overload, and remove the "advancedSettings" parameter below instead. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have left the overload as you suggested and removed the advanced arguments parameter from the constructors.


In reply to: 238009119 [](ancestors = 238009119)

@artidoro
Copy link
Contributor Author

artidoro commented Nov 30, 2018

Thanks @eerhardt! I'll update accordingly #Resolved

@eerhardt
Copy link
Member

eerhardt commented Nov 30, 2018

Sorry for the "late breaking" change here, but I thought it would be good to note where we landed yesterday, and not have to redo some of this work. #Resolved

@artidoro
Copy link
Contributor Author

artidoro commented Dec 3, 2018

I am doing the changes and it looks a lot better, it's good to see that!


In reply to: 443354726 [](ancestors = 443354726)

/// <param name="advancedSettings">The delegate to set additional settings</param>
/// <param name="path">The path to the file</param>
/// <param name="hasHeader">Whether the file has a header.</param>
/// <param name="separatorChar"> The character used as separator between data points in a row. By default the tab character is used as separator.</param>
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Dec 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[](start = 40, length = 1)

extra space, here and in constructor above. #Resolved


var env = catalog.GetEnvironment();

// REVIEW: it is almost always a mistake to have a 'trainable' text loader here.
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Dec 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// REVIEW: it is almost always a mistake to have a 'trainable' text loader here [](start = 12, length = 79)

Did it work well if you specify header to true and didn't pass dataSample? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, check this test public void CustomTransformer()


In reply to: 239161201 [](ancestors = 239161201)

@@ -283,8 +283,7 @@ internal static void SaveRoleMappings(IHostEnvironment env, IChannel ch, RoleMap
{
// REVIEW: Should really validate the schema here, and consider
// ignoring this stream if it isn't as expected.
var loader = TextLoader.ReadFile(env, new TextLoader.Arguments(),
new RepositoryStreamWrapper(rep, DirTrainingInfo, RoleMappingFile));
var loader = TextLoader.ReadFile(env, new RepositoryStreamWrapper(rep, DirTrainingInfo, RoleMappingFile), new TextLoader.Arguments());
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Dec 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, new TextLoader.Arguments() [](start = 120, length = 28)

can we change method to have arg default = null, and if it's null to use new TextLoader.Arguments()? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can do that. Do you think I should change that in the constructor of TextLoader too?


In reply to: 239162303 [](ancestors = 239162303)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think it makes sense


In reply to: 239258487 [](ancestors = 239258487,239162303)

@@ -64,30 +64,29 @@ public void TrainSentiment()
{
var env = new MLContext(seed: 1);
// Pipeline
var loader = TextLoader.ReadFile(env,
new TextLoader.Arguments()
var arguemnts = new TextLoader.Arguments()
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Dec 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arguemnts [](start = 16, length = 9)

msitype #Resolved

@@ -369,7 +364,11 @@ public TermLookupTransformer(IHostEnvironment env, IDataView input, IDataView lo
var txtArgs = new TextLoader.Arguments();
bool parsed = CmdParser.ParseArguments(host, "col=Term:TX:0 col=Value:TX:1", txtArgs);
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Dec 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need this anymore #Resolved

@@ -595,7 +595,7 @@ public void RankingLightGBMTest()
public void TestTreeEnsembleCombiner()
{
var dataPath = GetDataPath("breast-cancer.txt");
var dataView = TextLoader.Create(Env, new TextLoader.Arguments(), new MultiFileSource(dataPath));
var dataView = TextLoader.ReadFile(Env, new MultiFileSource(dataPath), new TextLoader.Arguments());
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Dec 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReadFile [](start = 38, length = 8)

can you use ReadTextFile instead? #ByDesign

Copy link
Member

@eerhardt eerhardt Dec 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't text implied by the name of TextLoader? #Resolved

@@ -617,7 +617,7 @@ public void TestTreeEnsembleCombiner()
public void TestTreeEnsembleCombinerWithCategoricalSplits()
{
var dataPath = GetDataPath("adult.tiny.with-schema.txt");
var dataView = TextLoader.Create(Env, new TextLoader.Arguments(), new MultiFileSource(dataPath));
var dataView = TextLoader.ReadFile(Env, new MultiFileSource(dataPath), new TextLoader.Arguments());
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Dec 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReadFile [](start = 38, length = 8)

ReadTextFile #ByDesign

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

across whole file


In reply to: 239164058 [](ancestors = 239164058)

Copy link
Member

@eerhardt eerhardt Dec 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't text implied by the name of TextLoader? #Resolved

@@ -438,7 +438,7 @@ protected void VerifyArgParsing(IHostEnvironment env, string[] strs)

// Note that we don't pass in "args", but pass in a default args so we test
// the auto-schema parsing.
var loadedData = TextLoader.ReadFile(env, new TextLoader.Arguments(), new MultiFileSource(pathData));
var loadedData = TextLoader.ReadFile(env, new MultiFileSource(pathData), new TextLoader.Arguments());
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Dec 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReadFile [](start = 40, length = 8)

ReadTextFile? #ByDesign

Copy link
Member

@eerhardt eerhardt Dec 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't text implied by the name of TextLoader? #Resolved

Copy link
Contributor Author

@artidoro artidoro Dec 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Eric, since it's a method of TextLoader it should already implied that it loads a text file. #Resolved

var reader = mlContext.Data.TextReader(new TextLoader.Arguments
{
Column = new[] {
var reader = mlContext.Data.TextReader(new[] {
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Dec 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TextReader [](start = 40, length = 10)

please update https://github.com/dotnet/machinelearning/blob/master/docs/code/MlNetCookBook.md with your changes. #Resolved

Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

/// <param name="args">The arguments to text reader, describing the data schema.</param>
/// <param name="columns">The columns of the schema.</param>
/// <param name="hasHeader">Whether the file has a header.</param>
/// <param name="separatorChar">The character used as separator between data points in a row. By default the tab character is used as separator.</param>
/// <param name="dataSample">The optional location of a data sample.</param>
public static TextLoader TextReader(this DataOperations catalog,
Copy link
Member

@eerhardt eerhardt Dec 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want these named CreateTextReader. See https://github.com/dotnet/apireviews/pull/81/files #Resolved

/// <param name="hasHeader">Whether the file has a header.</param>
/// <param name="separatorChar"> The character used as separator between data points in a row. By default the tab character is used as separator.</param>
/// <param name="fileSource">Specifies a file from which to read.</param>
public static IDataView ReadFile(IHostEnvironment env, IMultiStreamSource fileSource, Column[] columns, bool hasHeader = false, char separatorChar = '\t')
Copy link
Member

@eerhardt eerhardt Dec 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have both IDataView ReadFromTextFile(this DataOperations catalog, and these methods? I think we should only have 1. #Resolved

@artidoro
Copy link
Contributor Author

artidoro commented Dec 6, 2018

@eerhardt I just removed the ReadFile method from TextLoader, as you suggested. #Resolved

@artidoro
Copy link
Contributor Author

artidoro commented Dec 6, 2018

I am actually updating the cookbook again, to reflect that and the new name for the MlContext extension.


In reply to: 444982206 [](ancestors = 444982206)

{
var result = new Arguments { Column = columns };
advancedSettings?.Invoke(result);
separatorChars = separatorChars ?? new[] { '\t' };
Copy link
Member

@eerhardt eerhardt Dec 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) separatorChars can never be null, right? This is a private method and only called in 1 spot that ensures it won't be null.

Maybe just add an Assert it won't be null and you can remove the null check here. #Resolved

var loader = TextLoader.ReadFile(env, new TextLoader.Arguments(),
new RepositoryStreamWrapper(rep, DirTrainingInfo, RoleMappingFile));
var loader = new TextLoader(env, dataSample: new RepositoryStreamWrapper(rep, DirTrainingInfo, RoleMappingFile))
.Read(new RepositoryStreamWrapper(rep, DirTrainingInfo, RoleMappingFile));
Copy link
Member

@eerhardt eerhardt Dec 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't create two instances of new RepositoryStreamWrapper(rep, DirTrainingInfo, RoleMappingFile). #Resolved

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it may be valuable to still have an internal TextLoader.ReadFile helper method for our internal code.


In reply to: 239856914 [](ancestors = 239856914)

new TextLoader.Column("Term", DataKind.TX, 0),
new TextLoader.Column("Value", DataKind.TX, 1)
},
dataSample: new MultiFileSource(filename)
Copy link
Member

@eerhardt eerhardt Dec 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here about creating duplicate objects - new MultiFileSource(filename) #Resolved

Copy link
Member

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just a couple minor comments to clean up.

@artidoro artidoro self-assigned this Dec 7, 2018
@artidoro artidoro added the API Issues pertaining the friendly API label Dec 7, 2018
@artidoro artidoro merged commit 14c7a47 into dotnet:master Dec 7, 2018
@artidoro artidoro deleted the textloader branch January 5, 2019 00:01
@ghost ghost locked as resolved and limited conversation to collaborators Mar 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
API Issues pertaining the friendly API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants