Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable TensorFlowTransform to work with pre-trained models that are not frozen #853

Merged
merged 54 commits into from
Sep 25, 2018

Conversation

abgoswam
Copy link
Member

@abgoswam abgoswam commented Sep 7, 2018

Fixes #784

  1. Added ability in the TensorFlowTransform to to handle TensorFlow's SavedModel format
  2. For doing this, we use a temporary directory for TF SavedModel files. It looks like this is a TensorFlow requirement. TensorFlow requires the model to be in a directory and doesn't provide a mechanism for loading from some sort of stream/memory representation.
  3. The temp directory is created with appropriate ACL's for high-priviledge processes. We take care of deleting the temporary directory
  4. Added unit tests.

@yaeldekel
Copy link

yaeldekel commented Sep 7, 2018

Should the model files go in the separate repo that we use for the other TensorFlow models? #Resolved

@Ivanidzo4ka
Copy link
Contributor

Ivanidzo4ka commented Sep 7, 2018

// Licensed to the .NET Foundation under one or more agreements.

Can you merge it with master?
I've just push changes in this file to introduce estimator and transformer objects #Resolved


Refers to: src/Microsoft.ML.TensorFlow/TensorflowTransform.cs:1 in 57508d3. [](commit_id = 57508d3, deletion_comment = False)

}

public void Save(ModelSaveContext ctx)
{
_host.AssertValue(ctx);
ctx.CheckAtModel();
ctx.SetVersionInfo(GetVersionInfo());
ctx.Writer.Write(_isFrozen ? 1 : 0);
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Sep 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Write [](start = 27, length = 5)

I believe you can write bool #Resolved

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ctx.Writer.WriteBoolByte(IsFrozen);


In reply to: 216071878 [](ancestors = 216071878)

}

public void Save(ModelSaveContext ctx)
{
_host.AssertValue(ctx);
ctx.CheckAtModel();
ctx.SetVersionInfo(GetVersionInfo());
ctx.Writer.Write(_isFrozen ? 1 : 0);
if (_isFrozen)
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Sep 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_isFrozen [](start = 20, length = 9)

Please remain saving order, so you can read old version of model files. #Resolved

@@ -235,6 +235,8 @@ private string GetBuildPrefix()
}

[Fact(Skip = "Execute this test if you want to regenerate ep-list and _manifest.json")]
// [Fact]
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Sep 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// [Fact] [](start = 7, length = 10)

not necessary #Resolved

public string[] OutputColumns;
}

private readonly IHost _host;
private const string RegistrationName = "TensorFlowTransform";

internal readonly TFSession Session;
internal readonly bool IsFrozen;
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Sep 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsFrozen [](start = 31, length = 8)

You don't need them. All you care about is session, and you already have it. This is just details of how session been loaded. #Resolved

return new TensorFlowTransform(env, modelBytes, io.Item1, io.Item2, (isFrozen == 1));
}
else
{
Copy link
Member Author

@abgoswam abgoswam Sep 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{ [](start = 11, length = 2)

[observation] The Train() method of the LearningPipeline API ends up calling Save() and Create() multiple times.

For un-frozen models we need a particular directory structure so we can create the TFSession correctly using the TFSession.FromSavedModel API (line 215)

In this call to Create() we do the following :
(a) load the binary stream
(b) write the binary stream out to a zip file
(c) extract the zip file to a temporary folder TempSavedModelDirName (_savedModel)
(d) create the session
(e) delete the zip file

We do not delete the temporary folder because in a subsequent call to Save() we are writing out the contents of the temporary folder as a byte stream.

So for this scenario, what we have currently is that we define a unique location for the temporary folder -- and it gets cleaned up only when required, inside the Create() call. Line 139.

Is there some way to ensure the temporary folder is always cleaned up ? Or perhaps some way to not have a temporary folder at all.

@ericstj @zeahmed @yaeldekel #Resolved

Copy link
Member

@ericstj ericstj Sep 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should avoid the temporary zip file. If you can get a seekable stream from ctx, just pass that into the ZipArchive constructor: https://msdn.microsoft.com/es-es/library/system.io.compression.ziparchive.ziparchive(v=vs.110).aspx. It looks like you can do this by accessing the BaseStream on the binary reader passed into the TryLoadBinaryStream action.

You should also avoid allocating an extra byte array for the zip and instead do all the reading within the callback from TryLoadBinaryStream, since it looks like TryLoadBinaryStream will close the stream:

. So within this callback, new up the archive, then call archive.ExtractToDirectory extension method: https://docs.microsoft.com/en-us/dotnet/api/system.io.compression.zipfileextensions.extracttodirectory?view=netframework-4.7.2.

I think you still need to have a temporary directory for the model files themselves. It looks like this is a requirement of TensorFlow https://github.com/tensorflow/tensorflow/blob/3be04971716fcaf0c11ad9262e60efa428553e14/tensorflow/c/c_api.h#L1367. It requires the model to be in a directory and doesn't provide a mechanism for loading from some sort of stream/memory representation. So you'll still need a temp path that you'll have to clean up.

Ideally you should provide a way for this path to be specified, rather than always call GetTempPath behind the scenes. I think ML.NET has a pattern for this, if not, perhaps consider a constructor overload that allows for specification of a temp path root. You may also consider some way to say "don't clean up the temp path".

Typically with things like this you should treat the temporary folder as state managed by the object and then cleaned up in the dispose/finalizer.


In reply to: 217123233 [](ancestors = 217123233)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried both of the suggestions above, but they seem to have some limitations with respect to this transform. I have since modified the logic of load&restore to do things in-memory.

Am documenting the issues hit with the above suggestion:

(1) the ZipArchive approach looked promising to avoid the temporary zip file. However it works on the entire stream, rather than on a subset of the stream.
(2) currently the TFTransform does not dispose the TFSession object. So attempts to cleanup the temp folder fail with 'file still in use' errors. The approach of using unique location for the temporary folder is problematic in general (e.g. during CV or if pipeline contains multiple TFTransforms)

I have modified the logic so we do the ML.NET serialization/de-serialization using in-memory data structures.


In reply to: 217145506 [](ancestors = 217145506,217123233)

Copy link
Member

@ericstj ericstj Sep 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should file an issue on #2.

I think this approach looks much better than zip-in-a-zip approach before. One change I'd suggest is that instead of reading to byte array and storing in memory, you create the file within the callback (var fs = new FileStream(...)) and use br.BaseStream.CopyTo(fs, fileLength); To do this you'll need to replace readbytearray/writebytearray with a your own calls that store file length, followed by the stream copy.

One thing I notice that could be problematic for the frozen case above (not related to this change) is that it seems to use Graph.Import to import a entire model as a managed byte array. I can imagine that this will potentially hit the upper limit of managed array sizes for large models. I see that TF's graph import functionality expects a single buffer (https://github.com/tensorflow/tensorflow/blob/3be04971716fcaf0c11ad9262e60efa428553e14/tensorflow/c/c_api.h#L1018-L1020).


In reply to: 217447210 [](ancestors = 217447210,217145506,217123233)

Copy link
Member Author

@abgoswam abgoswam Sep 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason for reading to byte array and storing in memory is to handle the scenario where we saw the sequence of Save1() -> Create1() -> Save2() -> Create2() being called

When Create1() is called, we load up the contents into a dictionary. So when Save2() is called we use the dictionary to write out the contents.

Because of issue #906 (not being able to clean-up unmanaged resources ) I am not sure we can do away with the in-memory approach. Thoughts ?


In reply to: 217462971 [](ancestors = 217462971,217447210,217145506,217123233)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adopted the stream copy based approach instead of reading to byte array

also added a finalizer that closes the session (if it isn’t closed) and deletes the temporary directory


In reply to: 217545311 [](ancestors = 217545311,217462971,217447210,217145506,217123233)

Copy link

@yaeldekel yaeldekel Sep 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eerhardt , @danmosemsft , would this be a security risk? To load the TensorFlow model here, we create a temporary directory and copy some files to it, and then load the model from that directory. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

followed the recommendation of the .NET security team to address this risk


In reply to: 218159541 [](ancestors = 218159541)

private const string RegistrationName = "TensorFlowTransform";
private const string TempSavedModelDirName = "MLNET_TensorFlowTransform_SavedModel";
private const string TempSavedModelZipName = "SavedModel.zip";
Copy link
Contributor

@zeahmed zeahmed Sep 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private const string TempSavedModelZipName = "SavedModel.zip"; [](start = 8, length = 62)

What will happen if I am working on two different pipelines with TensorflowTransforms executing in parallel? I think having same name for .zip (SavedModel.zip) and model_dir (MLNET_TensorFlowTransform_SavedModel)will create a problem.

Let see what does @eric suggest. However, when I did initially, my approach was as follows

  • In the constructor, load all the files in un-frozen model into a dictionary (called model dictionary) where keys are the file name and value is the stream of file.
  • Load the model from the location that user provided.
  • When saving, save the streams in the model dictionary into ML.Net stream.
  • When loading from ML.Net stream, load the streams into the model dictionary, create temporary folder and write all the stream into the folder. Load the model into TF using that temporary folder. Delete the model once loading is done.

#Resolved

Copy link
Contributor

@zeahmed zeahmed Sep 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This obviously has a limitation on the size of model.


In reply to: 217133381 [](ancestors = 217133381)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. I have modified the logic so we do the load&restore in-memory (using a dictionary).


In reply to: 217134254 [](ancestors = 217134254,217133381)

subDirInfo.Delete(true);
}
}
internal static TFSession GetSession(IHostEnvironment env, string modelPath)
Copy link

@yaeldekel yaeldekel Sep 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal [](start = 8, length = 8)

nit: Add an empty line between methods. #Closed

@yaeldekel
Copy link

    public void Dispose()

@TomFinley or @Zruty0 might be able to give more insight. I think that with the Estimator/Transformer API it might cause problems: The TensorFlowEstimator instantiates a TensorFlowTransform in its constructor, which is returned to the user when Fit() is called. The estimator still owns the TensorFlowTransform though, and it will return the same one if Fit() is called again, so we wouldn't want someone to dispose it.


In reply to: 423241990 [](ancestors = 423241990,423236224)


Refers to: src/Microsoft.ML.TensorFlow/TensorflowTransform.cs:401 in ae672d6. [](commit_id = ae672d6, deletion_comment = True)

@ericstj
Copy link
Member

ericstj commented Sep 20, 2018

    public void Dispose()

Is the Estimator the better place for this state to live? At the end of the day, the directory is representing the "training state" and it needs to be managed by some object. That object should be responsible for cleaning it up and should have a definite lifetime.


In reply to: 423310984 [](ancestors = 423310984,423241990,423236224)


Refers to: src/Microsoft.ML.TensorFlow/TensorflowTransform.cs:401 in ae672d6. [](commit_id = ae672d6, deletion_comment = True)

<ItemGroup>
<PackageReference Include="System.IO.FileSystem.AccessControl" Version="4.5.0" />
<PackageReference Include="System.Security.Principal.Windows" Version="4.5.0" />
</ItemGroup>
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Sep 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A) Do we really need them? Because presence of Principal.Windows, for multiplatform code is suspicious.
B) If we really need them you need to make them part of nuget dependency, and also we need to modify https://github.com/dotnet/machinelearning/blob/master/build/Dependencies.props instead of specify versions here. #Resolved

Copy link
Member Author

@abgoswam abgoswam Sep 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use this as a security measure on windows platform. We are using temp directory for loading TFSession for SavedModels. Models are considered executable code, so for this temp directory we need to ACL it in the high-rights process so low-rights process can’t access it.

I will look into making it part of nuget dependency


In reply to: 219572600 [](ancestors = 219572600)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use this one as reference implementation https://github.com/dotnet/machinelearning/blob/master/pkg/Microsoft.ML.ImageAnalytics/Microsoft.ML.ImageAnalytics.nupkgproj


In reply to: 219577320 [](ancestors = 219577320,219572600)

return CheckFileAndRead(env, modelPath);
}

private static TFSession CheckFileAndRead(IHostEnvironment env, string modelFile)
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Sep 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private static TFSession CheckFileAndRead(IHostEnvironment env, string modelFile) [](start = 8, length = 81)

can you get rid of this method? no one using it any more except GetSession, and I think it would be much better to move checks in start of GetSesssion #Resolved

return attr.HasFlag(FileAttributes.Directory);
}

internal static void CreateTempDirectory(string tempDirPath)
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Sep 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CreateTempDirectory [](start = 29, length = 19)

What's the main reason for this? Why we treat Windows differently?
If this is necessary, why it's part of tensorflow only and not part of utility which we use among whole project? #Resolved

Copy link
Member Author

@abgoswam abgoswam Sep 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(replied on other comment about its necessity)

Currently used in TFTransform only to protect SavedModels which are considered executable code by .NET security team


In reply to: 219577130 [](ancestors = 219577130)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put comment with description why we doing this?
I would be really tempted to remove this code if I wouldn't know story behind it.


In reply to: 219578542 [](ancestors = 219578542,219577130)

@abgoswam
Copy link
Member Author

abgoswam commented Sep 21, 2018

    public void Dispose()

As suggested marking as Pending for now. We do not want block this PR on this. Will track it as part of issue #906


In reply to: 423320325 [](ancestors = 423320325,423310984,423241990,423236224)


Refers to: src/Microsoft.ML.TensorFlow/TensorflowTransform.cs:401 in ae672d6. [](commit_id = ae672d6, deletion_comment = True)

/// Given a folder path, create it with proper ACL if it doesn't exist.
/// Fails if the folder name is empty, or can't create the folder.
/// </summary>
internal static void CreateFolderWithAclIfNotExists(IHostEnvironment env, string folder)
Copy link
Contributor

@zeahmed zeahmed Sep 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CreateFolderWithAclIfNotExists [](start = 29, length = 30)

How does creation of ACL folder work in Linux/Mac? #Closed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catches the PlatformNotSupported exception and does the normal create.


In reply to: 219903672 [](ancestors = 219903672)

Copy link
Contributor

@zeahmed zeahmed Sep 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Eric, got it.


In reply to: 219906478 [](ancestors = 219906478,219903672)

Copy link
Contributor

@zeahmed zeahmed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

<PackageReference Include="System.IO.FileSystem.AccessControl" Version="$(SystemIOFileSystemAccessControl)" />
<PackageReference Include="System.Security.Principal.Windows" Version="$(SystemSecurityPrincipalWindows)" />
<ProjectReference Include="../Microsoft.ML/Microsoft.ML.nupkgproj" />
<ProjectReference Include="../Microsoft.ML.TensorFlow.Redist/Microsoft.ML.TensorFlow.Redist.nupkgproj" />
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Sep 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor, but please use spaces #Resolved

Copy link

@yaeldekel yaeldekel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

}

internal static TFSession GetSession(IHostEnvironment env, string modelPath)
{
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Sep 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contracts.Check(env, nameof(env); #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added this check.

btw i believe you meant Contracts.Check(env != null, nameof(env));


In reply to: 220031028 [](ancestors = 220031028)

{
env.CheckValue(exportDirSavedModel, nameof(exportDirSavedModel));
var sessionOptions = new TFSessionOptions();
var exportDir = exportDirSavedModel;
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Sep 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var exportDir = exportDirSavedModel; [](start = 11, length = 37)

not needed. #Resolved

Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@@ -2,7 +2,7 @@

<PropertyGroup>
<OutputType>Exe</OutputType>
<TargetFramework>netcoreapp2.1</TargetFramework>
<TargetFramework>netcoreapp2.0</TargetFramework>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be netcoreapp2.1. Note that netcoreapp and netstandard are different things. We want netcoreapp2.1 for any and all executables and tests. We want netstandard2.0 for any libraries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moreover, we don't even want this in the lib folder as it is just a commandline tool.

@abgoswam abgoswam deleted the abgoswam/tf_savedmodel branch November 2, 2018 21:07
@ghost ghost locked as resolved and limited conversation to collaborators Mar 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants