Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParquetLoader - Save Schema to context to support loading the model without files. #472

Merged
merged 5 commits into from
Jul 18, 2018

Conversation

tyclintw
Copy link
Contributor

@tyclintw tyclintw commented Jul 2, 2018

This changes address issue #471

The Schema is added to the model context when saving. The context model can then be properly loaded without the need of additional files in order to inspect the schema. A file is still required on initial construction and an error will be thrown if a RowCursor is created without a file.

//verWrittenCur: 0x00010001, // Initial
verWrittenCur: 0x00010002, // Add Schema to Model Context
verReadableCur: 0x00010002,
verWeCanReadBack: 0x00010002,
Copy link
Contributor

@TomFinley TomFinley Jul 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already released this, so I think breaking backwards compatibility would be a bad thing. Is that really necessary to do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we did our end to end testing with kmeans we found that models were being loaded without files just to look at the schema. Since we require a file with this loader, this caused a failure on distributed models. As such, this is necessary to get distributed models to work properly and appears to be a requirement as documented in BinaryLoader.cs line 916.


In reply to: 199610470 [](ancestors = 199610470)

Copy link
Contributor Author

@tyclintw tyclintw Jul 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, I see what you're asking. I'm fixing so it's backwards compatible.


In reply to: 200417526 [](ancestors = 200417526,199610470)

throw new InvalidDataException("Cannot read Parquet file", ex);
}

_columnsLoaded = InitColumns(schemaDataSet);
Copy link
Contributor

@TomFinley TomFinley Jul 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_columnsLoaded = InitColumns(schemaDataSet); [](start = 16, length = 44)

What happens if you load a schema from the model, but then the parquet loader does not "agree" with that schema? I don't see how this case is handled here. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added schema check.


In reply to: 199612477 [](ancestors = 199612477)

throw _host.ExceptDecode();
BinaryLoader loader = null;
var strm = new MemoryStream(buffer, writable: false);
loader = new BinaryLoader(_host, new BinaryLoader.Arguments(), strm);
Copy link
Contributor

@TomFinley TomFinley Jul 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BinaryLoader loader = null;
loader = new BinaryLoader(_host, new BinaryLoader.Arguments(), strm);

Probably a bit more comprehensible as

var loader = new BinaryLoader(_host, new BinaryLoader.Arguments(), strm);
``` #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


In reply to: 201382279 [](ancestors = 201382279)

@TomFinley
Copy link
Contributor

Hi @tyclintw sorry for the delay reviewing, was pretty sick last week.

/// <param name="schema">The schema</param>
/// <param name="otherSchema">The schema to compare against</param>
/// <returns>true if the schema columns match</returns>
public static bool EqualColumns(this ISchema schema, ISchema otherSchema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EqualColumns [](start = 27, length = 12)

I don't know that this is a terribly common case. Maybe move this into the Parquet loader. If it turns out that lots of code uses stuff like this we can probably elevate it, but honestly I'm not sure this sort of thing happens that often.

Copy link
Contributor Author

@tyclintw tyclintw Jul 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to ParquetLoader


In reply to: 201385299 [](ancestors = 201385299)

}
else if (!Schema.EqualColumns(streamSchema))
{
throw _host.Except("File schema does not match the model schema");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I like the idea of a hard fail. Can we make this like BinaryLoader? From memory:

  1. If there are no files, it loads the internal schema (what you are doing here more or less),
  2. If there is a file, it uses that instead, and ignores the internal schema.

I might prefer to be consistent. Is there any harm with that scheme?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a hard fail though, I might prefer that it be more specific. That is just saying "they're not the same" isn't as actionable feedback as we can give. Maybe if it somehow printed out where it differed that would work?


In reply to: 201386238 [](ancestors = 201386238)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to objection to your request. Falling in line with BinaryLoader.


In reply to: 201386608 [](ancestors = 201386608,201386238)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to objection

A Freudian slip!! :D :D


In reply to: 201761047 [](ancestors = 201761047,201386608,201386238)

// This ensures that the two schemas map names to the same column indices.
int col1, col2;
bool f1 = schema.TryGetColumnIndex(name1, out col1);
bool f2 = otherSchema.TryGetColumnIndex(name2, out col2);
Copy link
Contributor

@TomFinley TomFinley Jul 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incidentally this can be simplified as out int col rather than declaring the integers above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function has been removed. Thank you for the callout on convention.


In reply to: 201388092 [](ancestors = 201388092)


private static VersionInfo GetVersionInfo()
{
return new VersionInfo(
modelSignature: ModelSignature,
verWrittenCur: 0x00010001, // Initial
//verWrittenCur: 0x00010001, // Initial
verWrittenCur: 0x00010002, // Add Schema to Model Context
verReadableCur: 0x00010001,
Copy link
Contributor

@TomFinley TomFinley Jul 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0x00010001 [](start = 32, length = 10)

Hi Tyler, so is this correct? It may be I just don't understand the scheme. Are you certain that a model written in this latest "2" format could be interpreted (if lossily) by the "1" reader that existed? Maybe it is, since it's just another file, not part of the stream written to the model ... just want to double check that this was intentional was all. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since previous version functions without the Schema it can just be dropped and function as normal. However I ran some tests due to your concern an there was a failure due to the header sizes being different. As such, I'm just going to set to 2.


In reply to: 201396161 [](ancestors = 201396161)

@tyclintw
Copy link
Contributor Author

No problem.


In reply to: 403859077 [](ancestors = 403859077)

Copy link
Contributor

@TomFinley TomFinley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:


private static VersionInfo GetVersionInfo()
{
return new VersionInfo(
modelSignature: ModelSignature,
verWrittenCur: 0x00010001, // Initial
verReadableCur: 0x00010001,
//verWrittenCur: 0x00010001, // Initial
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this commented out code provide value? Can it be removed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we generally prefer to avoid checking in commented in code, the version history of past versions is different, since we want to know why we had to bump the version number each time, and we like to have that first version in there. See e.g., the text loader model, which is the most extreme example I am aware of.

_columnsLoaded = InitColumns(schemaDataSet);
Schema = CreateSchema(_host, _columnsLoaded);
}
else if (Schema == null && files.Count == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is files.Count == 0 redundant because we are checking if (files.Count > 0) above?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm certainly...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tyclintw hope you don't mind, pushed on your branch just so we can close this out more rapidly.

using (var strm = new MemoryStream())
{
var allColumns = Enumerable.Range(0, Schema.ColumnCount).ToArray();
saver.SaveData(strm, noRows, allColumns);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this possible to refactor? It seems inefficient to first save it to a MemoryStream, and then write that memory stream out. Can we just do it in a single step?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this specific case not necessarily, since the binary saver requires a seekable writable stream. (E.g., it writes data, then seeks back to the header so it can store in the header the offsets of various records in the file.) The repository writer, on the other hand, is based on a zip archive, which AFAIK does not provide seekable writable streams.

@TomFinley TomFinley merged commit 5e0a40e into dotnet:master Jul 18, 2018
eerhardt pushed a commit to eerhardt/machinelearning that referenced this pull request Jul 27, 2018
…ithout files. (dotnet#472)

* Save Schema to context to support loading the model without files.
* Use the input file's schema if the file is available.
@ghost ghost locked as resolved and limited conversation to collaborators Mar 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants