Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed the TextTransform bug where chargrams where being computed differently when using with/without word tokenizer. #548

Merged
merged 8 commits into from
Jul 26, 2018

Conversation

zeahmed
Copy link
Contributor

@zeahmed zeahmed commented Jul 17, 2018

This PR fixes #530. The cause of the problem was word tokenizer being applied before char tokenizer causing scalar valued text (e.g. This is a cat) to become vector (e.g. <This, is, a, cat>).

Previously, char tokenizer treated every vector item as separate text item (e.g. computing chargrams on each item by placing start and end markers <STX>token<ETX> instead of taking This is a cat as single text item).

The fix is in CharTokenizeTransform. The CharTokenizeTransform can take either a scalar or vector column as input. The processing of chargrams are done as follows.

  • If the input column is a scalar with text type then chargrams are computed by prepending <STX> and appending <ETX> characters at the start and at the end of the text respectively. For example, if the input is This is a cat then chargrams will be computed on <STX>This is a cat<ETX>.

  • If the input column is a vector with text type (it could be a result of concatenation of several text columns together or application of word tokenizer before char tokenizer) then chargrams will be computed by prepending <STX> and appending <ETX> characters at start and at the end of the vector respectively. Also, characters are inserted after every item in the vector. For example, if the input is <This, is, a, cat> then chargrams will be computed on <STX>This<US>is<US>a<US>cat<ETX>.

  • To be backward compatible, CharTokenizerTransform version was bumped up and the support for loading models saved with previous version is added.

Moving forward, the chargrams will be computed as follows

  • if stop word removal is request, chargrams will be computed after StopwordRemovalTransform is applied e.g. <STX>This<US>is<US>a<US>cat<ETX>.
  • otherwise, raw text after text normalization will be used for chargram computing e.g. This is a cat.

@Ivanidzo4ka
Copy link
Contributor

Ivanidzo4ka commented Jul 17, 2018

            verWrittenCur: 0x00010001, // Initial

shall we bump version if we change transform implementation? @TomFinley #Resolved


Refers to: src/Microsoft.ML.Transforms/Text/CharTokenizeTransform.cs:71 in 868a72f. [](commit_id = 868a72f, deletion_comment = False)

values[index++] = TextStartMarker;

if (i > 0)
values[index++] = ' ';
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Jul 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

' ' [](start = 50, length = 3)

I'm a bit struggle to understand this solution.
Correct me if I'm wrong.
You get broken into peaces sentence by word extractor.
You make assumption it was broken by space character, so you just add it to chargram.

You distinguish case of processed /not processed by wordgram by size of your vector, is it always true?
You handle only space, what if i broke words by comma, tab, semicolon?
#Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Space was not selected as the breaking character between tokens in a vector explicitly because it was quite likely to occur within individual tokens. This is why we chose a character unlikely to occur in text (vs. one of the most frequently observed characters) in the first place. :)


In reply to: 203213571 [](ancestors = 203213571)

Copy link
Contributor Author

@zeahmed zeahmed Jul 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, there is no way to get the token that was used for tokenization? #Resolved

Copy link
Contributor Author

@zeahmed zeahmed Jul 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In reality, if an statistical technique for word tokenization is applied then there is no reliable way to get the actual text after it has been tokenized.

So, use of space is just an approximation. Otherwise, it not possible to get actual text string. #Resolved

Copy link
Contributor

@TomFinley TomFinley Jul 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, there is no way to get the token that was used for tokenization?

No. Indeed you have absolutely no idea whether the tokenization scheme involved breaking on single characters at all -- a vector of text could have come from anywhere. Indeed, more elaborate tokenization schemes exist, that involve an actual language model and so are considerably more elaborate than a simple string.Split(char), which of course is the assumption made by this "fix."

I suggest that we not try to get into the business of trying to guess how the data that was fed into a transform, since doing so degrades the system. It would do more harm than good to do it -- we've built a modular system of transforms and data pipelines, but the consequence of a modular system is that you must accept that your data could come from literally anywhere.

What I believe is a reasonable compromise, is that some sort of special character be used, instead of space... perhaps characters not used often in text, but that is nonetheless specially reserved to indicate the start and end of a given block of text.

Happily, I believe that was the state of affairs of this transform in its existing implementation, before it was "fixed." 😄 #Resolved

Copy link
Contributor

@TomFinley TomFinley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I misclicked and marked this as approve, which is not correct.

Copy link
Contributor

@TomFinley TomFinley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zeahmed . This code is problematic. In particular, that the treatment of a vector of text, vs. an input text is treated as a distinct scenario is actually deliberate, as we ,may guess from the fact that it was coded that way.

The solution to the problem of "make transform X equivalent to whether transform Y was applied ahead of it or not," is for the user to simply not freakin' apply transform Y. That is unambiguously the correct solution, since it will also be more clear, and more efficient.

@zeahmed
Copy link
Contributor Author

zeahmed commented Jul 18, 2018

@TomFinley, actually that's not the solution in the context of TextTransform...:)

In TextTransform, it is quite often desirable to compute both wordgrams and chargrams. If wordgrams are being computed then text will be tokenized and character tokenizer will see vector of text instead of just scalar text.

Text will also be tokenized when "stop words removal" option is set where user wants to remove stop words before computing chargrams.

So, character tokenizer will compute chargrams that are not valid chargrams (cf. #530). The first case can be avoided by applying character tokenizer directly on original text column (before word tokenizer). However, 2nd case can not be avoided. That's the only solution right now I see. #Closed

@TomFinley
Copy link
Contributor

            verWrittenCur: 0x00010001, // Initial

Certainly if we were to change the transform along these lines we would have to do that. Backcompat is important.


In reply to: 405757792 [](ancestors = 405757792)


Refers to: src/Microsoft.ML.Transforms/Text/CharTokenizeTransform.cs:71 in 868a72f. [](commit_id = 868a72f, deletion_comment = False)

@justinormont
Copy link
Contributor

justinormont commented Jul 18, 2018

The original issue was: the chargrams produced differed depending on if n-grams were also produced. This oddity should be fixed.

I'm unsure the order we run the transforms, but I assume we can run as:

  1. Language Detection [not often used]
  2. Text normalization (lowercase, diacritics, punctuation, numbers)
  3. Word Tokenization
  4. Stopword removal on tokenized text
  5. Chararacter tokenizer (on #4 if stop word removal is enabled, otherwise on #2)
  6. Hashing (TF) / dictionary (TF, IDF, TF-IDF) of n-grams & Lp vector normalizer (LInf, L1 or L2)
  7. Hashing (TF) / dictionary (TF, IDF, TF-IDF) of chargrams & Lp vector normalizer (LInf, L1 or L2)
  8. Concat n-gram & chargram vectors
  9. Return #8 & optionally #3 for tokens=+

Let me know if I'm wrong...
Currently, the character tokenizer is reading from #4 when n-grams are enabled, and #2 when not. We can improve on this.

I think to produce consistency we can have the character tokenizer read from #4 when the stop words are enabled, and #2 when stop words are not enabled. [Option A]

This still produces the artifact of <STX>This<ETX><STX>is<ETX><STX>a<ETX><STX>cat<ETX> when the stop word remover is enabled. I'm ok leaving this specific oddity in place, as when you turn on the stop word remover, the user is expecting the output tokens to change (though this exact change is odd). The downside to this method is the wasted char of having both <ETX> & <STX> between tokens, which is lost signal. For instance from This is a cat, a 5-char-gram should get a cat but instead we produce <ETX><STX>cat. Perhaps replacing <ETX><STX> w/ <US> to reduce this waste.

One stronger way to produce consistency, is to always have the character tokenizer read from #2. The effect of this is, the user's choice for stop words is ignored. [Option B]

I prefer [Option A]. Thoughts @Ivanidzo4ka, @TomFinley, @zeahmed?

#Resolved

@Ivanidzo4ka
Copy link
Contributor

I'm no so familiar with Text transform pipeline, but I would prefer not to apply char tokenizer on top of word tokenizer. They should be treated independently, and consume same data source.


In reply to: 405858371 [](ancestors = 405858371)

@TomFinley
Copy link
Contributor

TomFinley commented Jul 18, 2018

Perhaps replacing w/ to reduce this waste.

Hi @justinormont , a single character delimiter to represent a transition vs. two seems fine and beneficial to me. The unit separator seems like a fine choice, considering its description. Of course we must ensure that we don't break backcompat of previously serialized and composed models.

One stronger way to produce consistency, is to always have the character tokenizer read from #2

Then you have one path that has stopwords removed, and another that does not. I don't see how that is more consistent. I don't even know what we're arguing is inconsistent in the current scheme.

I'm no so familiar with Text transform pipeline, but I would prefer not to apply char tokenizer on top of word tokenizer. They should be treated independently, and consume same data source.

@Ivanidzo4ka , why? I wasn't involved in engineering this, but it seems like the right choice for the n-grams for words or chars to happen over the post-processed data (including stopword removal, if requested), rather than prior to processing. #Resolved

@zeahmed
Copy link
Contributor Author

zeahmed commented Jul 19, 2018

Hi @justin Ormont , a single character delimiter to represent a transition vs. two seems fine and beneficial to me. The unit separator seems like a fine choice, considering its description. Of course we must ensure that we don't break backcompat of previously serialized and composed models.

I think this is in my current fix in this PR except that

  • space is being used instead of unit separator character.
  • data is read always read from #4 instead of making a choice based on stop word removal.

The only problem I see from previous comments is whether it is good to do this fix in CharTokenizeTransform or not? In my opinion, it should be done in TextTransform. However, it does not seem possible unless there is a kind of string concate transform that can concatenate strings with separator (let me know if there is a way to do that?).

@justinormont , yes [Option A] does seem a reasonable approach with <ETX><STX> replaced with <US>. Lets have a total agreement on it before I procced with further changes. I have to also look at the backcompat issue.

Thanks @justinormont , @Ivanidzo4ka and @TomFinley for your useful insights, comments and feedback.


In reply to: 406017378 [](ancestors = 406017378)

@TomFinley
Copy link
Contributor

TomFinley commented Jul 19, 2018

I think this is in my current fix in this PR except that

  • space is being used instead of unit separator character.

Right. That's a big difference, and a poor choice since the space is likely to occur in actual text. <US> is better from that perspective.

Not to harp on this, but the major problem with your PR of course is that you broke backwards compatibility for models. #Resolved

@zeahmed
Copy link
Contributor Author

zeahmed commented Jul 19, 2018

The original issue was: the chargrams produced differed depending on if n-grams were also produced. This oddity should be fixed.

I'm unsure the order we run the transforms, but I assume we can run as:

  1. Language Detection [not often used]
  2. Text normalization (lowercase, diacritics, punctuation, numbers)
  3. Word Tokenization
  4. Stopword removal on tokenized text
  5. Chararacter tokenizer (on #4 if stop word removal is enabled, otherwise on #2)
  6. Hashing (TF) / dictionary (TF, IDF, TF-IDF) of n-grams & Lp vector normalizer (LInf, L1 or L2)
  7. Hashing (TF) / dictionary (TF, IDF, TF-IDF) of chargrams & Lp vector normalizer (LInf, L1 or L2)
  8. Concat n-gram & chargram vectors
  9. Return #8 & optionally #3 for tokens=+

Let me know if I'm wrong...

Currently, if word tokenization is requested, TextNormalization is applied after word tokenization. Should it be reversed? Is there any significance of TextNormalization being more effective after word tokenization is applied?


In reply to: 405858371 [](ancestors = 405858371)

@zeahmed
Copy link
Contributor Author

zeahmed commented Jul 19, 2018

Following are the changes in the new commit.

  • Bumped version in CharTokenizerTransform so that new changes are backward compatible.
  • Char tokenizer reads from the output of NormalizedTextTransform when stop word removal is not requested otherwise from the output of StopwordRemovalTransform.
  • Moved TextNormalization before WordTokenizer. Let me know if there is any adverse effect of this.
    #Resolved


dst = new VBuffer<ushort>(len, values, dst.Indices);
};
return CurrentModelVersion < version.VerReadableCur ? valueGetterOldVersion : valueGetterCurrentVersion;
Copy link
Contributor Author

@zeahmed zeahmed Jul 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomFinley and @Ivanidzo4ka, Is this correct way to maintain backward compatibility? #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly. For one thing you've stored this in a static variable, which is not correct, since this behavior will be specific to the instance. So if you read two instances of the transform, one with the old version, one with the new version, you'd break one or the other (depending on the order in which you read them). As a general rule, not just in ML.NET but in any coding anywhere, storing instance specific data inside a static variable is generally a bad idea. :)

For the sake of resaving an old model to the new format, you ought to store a boolean variable in the new version of the file, to control whether it is the old or new version. So the saver will have ctx.Writer.WriteBoolByte(_isStartEnd) and in the reader have something like _isStartEnd = ctx.Header.ModelVerReadable < 0x00010002 || ctx.Reader.ReadBoolByte(). This hypothetical _isStartEnd will be private readonly, and emphatically not public static.

If anything about that is not clear let me know, thanks @zeahmed!


In reply to: 203896119 [](ancestors = 203896119)

@@ -262,6 +262,30 @@ public static IDataTransform Create(IHostEnvironment env, Arguments args, IDataV
view = new ConcatTransform(h, new ConcatTransform.Arguments() { Column = xfCols }, view);
}

if (tparams.NeedsNormalizeTransform)
Copy link
Contributor Author

@zeahmed zeahmed Jul 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinormont, I moved TextNormalizer above WordTokenizer. Let me know if there is any adverse effect of it? #Resolved

Copy link
Contributor

@justinormont justinormont Jul 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if there are averse effects of running TextNormalizer before WordTokenizer. I expect it is different, but assume not adversely so. For instance (depending on the specifics of the tokenizer), I don't can be split into {I, do, n't} but if removing punctuation first, it may be split to {I, dont}.

If you make me a build, I'll run a manual regression test. Unfortunately we don't currently have running nightly regression tests to tell us if the text datasets decreased in their core metrics.

#Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this change. It's far from clear to me that if someone asks to remove stopwords that they meant for the content of stopwords to be retained in the chargrams. At the very least this ought to be a configurable option.


In reply to: 204000786 [](ancestors = 204000786)

Copy link
Contributor Author

@zeahmed zeahmed Jul 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomFinley, Can you elaborate more on this? Your point is not clear to me. Don't you see it good to have TextNormalizer applied before WordTokenizer?


In reply to: 204201054 [](ancestors = 204201054,204000786)

Copy link
Contributor

@TomFinley TomFinley Jul 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry thought you were saying something else for some reason got confused. #Resolved

@@ -64,12 +64,13 @@ public sealed class Arguments : TransformInputBase
public const string LoaderSignature = "CharToken";
public const string UserName = "Character Tokenizer Transform";

public static uint CurrentModelVersion = 0x00010002;
Copy link
Contributor

@TomFinley TomFinley Jul 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CurrentModelVersion [](start = 27, length = 19)

Hmmm. Just to clarify something, because having this be a constant is essential for the correctness of your serialization and deserialization routines, turning it into a public static mutable field is probably not a good idea.

I question the worth of making this variable * at all*, but even if you should please don't make this public, and this should definitely be a const. Also "current" version is not descriptive. It is the version where you changed the behavior of the separating characters to be the single character <US> vs having start and end characters. If you really want a variable, it should be something like VariableSeparatorIsUnitSeparator or somesuch to actually describe what version 0x00010002 is. But again I'd question how much worth such a const as this would provide. #Resolved

private static VersionInfo GetVersionInfo()
{
return new VersionInfo(
modelSignature: "CHARTOKN",
verWrittenCur: 0x00010001, // Initial
verReadableCur: 0x00010001,
verWrittenCur: CurrentModelVersion,
Copy link
Contributor

@TomFinley TomFinley Jul 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verWrittenCur: CurrentModelVersion, [](start = 16, length = 35)

Please refer to other components where we bumped the version number. (If I search for 0x00010002 I see 43 matching files, so you should be able to find examples easily.) You will see that in those other versions we have a comment that describes why the version was changed. This is blank. #Resolved

private static VersionInfo GetVersionInfo()
{
return new VersionInfo(
modelSignature: "CHARTOKN",
verWrittenCur: 0x00010001, // Initial
Copy link
Contributor

@TomFinley TomFinley Jul 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verWrittenCur: 0x00010001, // Initial [](start = 16, length = 37)

See other examples. You will see that we keep commented out the version strings from old versions, to track them and so we know why we bumped the version each time. This is important: we have live code in our deserializers that is meant to handle those version bumps, and for that reason we like to have documentation on what exactly changed, so that when we have tests on this or that version in our deserializers, we know why they changed. #Resolved

@@ -119,6 +121,7 @@ private CharTokenizeTransform(IHost host, ModelLoadContext ctx, IDataView input)
// <base>
// byte: _useMarkerChars value.
_useMarkerChars = ctx.Reader.ReadBoolByte();
CurrentModelVersion = ctx.Header.ModelVerReadable;
Copy link
Contributor

@TomFinley TomFinley Jul 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CurrentModelVersion [](start = 12, length = 19)

Oh I see what you've done here.. this is not good, static variables are essentially global state. They are not particular to an instance. #Resolved

@@ -120,6 +125,9 @@ private CharTokenizeTransform(IHost host, ModelLoadContext ctx, IDataView input)
// byte: _useMarkerChars value.
_useMarkerChars = ctx.Reader.ReadBoolByte();

var version = GetVersionInfo();
_isSeparatorStartEnd = ctx.Header.ModelVerReadable < version.VerReadableCur || ctx.Reader.ReadBoolByte();
Copy link
Contributor

@TomFinley TomFinley Jul 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

version.VerReadableCur [](start = 65, length = 22)

This is fragile and will break as soon as we rev the version again. just please compare to 0x00010002. #Closed

@TomFinley
Copy link
Contributor

TomFinley commented Jul 25, 2018

Hi @zeahmed looking pretty good, I think my only concern is the fragile version check I mentioned, otherwise looks fine. #Resolved

@@ -102,6 +107,7 @@ public CharTokenizeTransform(IHostEnvironment env, Arguments args, IDataView inp

_type = GetOutputColumnType();
SetMetadata();
_isSeparatorStartEnd = false;
Copy link
Contributor

@TomFinley TomFinley Jul 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI there is no need for this. The default value of bool is false. #Resolved

Copy link
Contributor

@TomFinley TomFinley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zeahmed !

Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@zeahmed zeahmed merged commit 0e0f702 into dotnet:master Jul 26, 2018
@zeahmed
Copy link
Contributor Author

zeahmed commented Jul 26, 2018

Thanks @TomFinley, @justinormont and @Ivanidzo4ka!

eerhardt pushed a commit to eerhardt/machinelearning that referenced this pull request Jul 27, 2018
codemzs pushed a commit to codemzs/machinelearning that referenced this pull request Aug 1, 2018
@zeahmed zeahmed deleted the texttransform_bug branch January 30, 2019 21:30
@ghost ghost locked as resolved and limited conversation to collaborators Mar 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[TextTransform] Char n-grams are different when using with/without word n-grams.
4 participants