Get words in original sentence #420

linas · 2016-09-26T22:52:08Z

See pull req #416 for discussion.
To summarize: stackoverflow question - how to use LG as a grammar checker - asks for an API to find which words where not included in a parse. The issue is non-trivial due to the following:

word-numbering is tokenization-dependent, e.g. if a word is split it's -> it 's
spell-guesser can change length of words
utf8 langs have word begin/end counts differ from byte counts.

ampli · 2016-09-26T23:12:02Z

I guess you mean "that token in the original sentence that I would get, if I split the original sentence by using blank spaces as word-separators"

Approximately but not exactly - see below.

For the case of "it's" being split into two words, the user could write a utility themselves, its not >hard. For spell-guessing, it gets hard, because string compares are no longer possible.

It may get complicated - there are also punctuation split, morpheme splits and unit splits.

Yes, I guess something like this might work:

char * linkage_get_orig_word(const Linkage linkage, WordIdx word_num);
and maybe also

int linkage_get_orig_index(const Linkage linkage, WordIdx word_num);

It is more complicated than that. Take for example the following sentence:

Linkparser> This is a, test
...
LEFT-WALL this.p is.v a [,] test.n

For "[,]", you don't want to return "a,".
There are similar and also more complicated other examples.

To remove any ambiguity, it seems there is maybe a need to return (start,end) string positions. This will also enable the application to mark (underline, color, etc.) the offending tokens, and more.

Only after trying to implement it, including writing a demo using it, we can see what is actually needed. I can try to do that.

Can you think of a better name for "orig_word" and "orig_index"? There must be some nicer way of saying this...

Maybe use the word sentence instead of orig.

linas · 2016-09-27T01:34:20Z

yes, start,end string positions are exactly the right thing! then linkage_get_word_start() and linkage_get_word_end() would be good names.

ampli · 2016-09-27T13:46:51Z

I will implement it and add it to the sentence-check.py example program.

ampli · 2016-12-07T02:28:42Z

Hi Linas,
I implemented it. However, I encountered a problem.
I need you comments on these things before I finish the code polishing (including test cases) and send a PR.

Currently, my implementation returns start and end offsets in bytes.
When I wanted to use that in a demo program, it looked fine for a simple English sentences.
But when I tried Russian I realized that I really would like character offsets, not byte offsets.
On the other hand, for programs that work on the raw byte strings, character offsets are useless.

Possible solutions:

Provide API for byte offsets and for character offsets (different functions, or same functions with an additional Boolean argument).
In that case, there are two options:
-- Internally compute both the byte offset and character offset for each token. Here there is a slight overhead of computing a representation that will not be used by the API user.
-- Compute only one of them for each token and convert it to the other one on demand. Here there is much more overhead if the "wrong" representation is the one which is initially computed.
Provide only character offsets, since modern programs work with Unicode.
Define a new a parse-option with 4 possible values:
-- 0: No offset computations. This has the benefit of having a slight less overhead in case they are not needed (but the overhead is small - I could not see it in timing checks on en/corpus-basic.batch).
-- 1: Provide only byte offsets.
-- 2: Provide only Unicode offsets.
-- 3: Provide both kinds of offsets. For that option (1) needs to be implemented too.

My preferred solution for now:
I will change my implementation to option (2) above, unless you think another thing is better.

Another problem was the C types of the linkage word, and the returned result.
The type for the linkage word seems clear: WordIdx (which is currently defined as size_t).
The returned result should be size_t if we look at other functions that return a number of items, like linkage_get_num_words() and linkage_get_num_links().

On Linux using x64, size_tseems like overkill (comparing to unsigned int), but nevertheless this is the appropriate type, by definition, to use as a positive index.

But note that in link_grammar.i, in absolutely most of the cases, int replaces size_t and WordIdx.
This is fine from the standpoint of the SWIG interface, as these prototypes are only used to represent the target language types, and the proper C prototypes are using for the library interface, when the int to size_t promotion is just fine for positive numbers (only) and size_t to int is fine because these are small numbers. Moreover, this using int instead of size_t is nicer when debugging Python on x64 Linux, because otherwise the corresponding numbers are displayed as 1L etc. (i.e. suffix L). So I also specified the parameters/returned values in link_grammar.i as int.

I also had a problem with the position of the wall words. My current solution is to set for each of them the start offset to be equal to the end offset.
The start offset of RIGHT-WALL is set to 0, and this doesn't seem problematic.
However the start offset of the LEFT-WALL is set to sentence_length+1, which has a vast drawback of not being a valid offset in the sentence!

linas · 2016-12-08T08:15:30Z

So:

Some 98% of the users are not interested in these offsets, and this API (right?) and so slowing down the implementation for all users seems like a bad idea.
If you internally store pointers into the original string, then offsets are easy to compute: byte offsets are just a pointer difference, while character offsets requires looping and counting characters, from beginning of string, to the start/end character. Since the pointers should always byte to the first byte of a character, this should be easy to do. This computation could be done, live, at the time that the user asks for it. In almost all cases, I think byte offsets are easier to work with, as otherwise, the user is forced to count characters even if they don't want to (!)
size_t vs int -- I don't care much, one way or the other, today. I thought it wold help improve the A:PI by making calls more strict, but then size_t is overkill, and makes printing harder. I keep changing my mind about whether it is a good idea.

*LEFT_WALL -- if it is a pointer, then it points at the null byte that terminates a C string. This seems like a better idea than making it be -1 or something like that...

ampli · 2016-12-08T09:20:51Z

Some 98% of the users are not interested in these offsets, and this API (right?) and so slowing down the implementation for all users seems like a bad idea.

Indeed most users may not be interested in it. It is intended for programs like a link-parser GUI and for syntax checkers.
However, its overhead is so minimal (a few assignments/additions per token) that I could not measure its impact.

If you internally store pointers into the original string, then offsets are easy to compute: byte offsets are just a pointer difference, while character offsets requires looping and counting characters, from beginning of string, to the start/end character. Since the pointers should always byte to the first byte of a character, this should be easy to do.

Indeed my implementation just stores pointers to subword start/end (for each wordgraph node).
Each time a new node is inserted, its start/end points are computed according to the previous token or unsplit_word start/end and the inserted subword length. (It takes care of spell corrections).

BTW, it turned out the start/end points for morphology=1 is already perfectly handled by the existing linkage->wg_path_display (wordgraph path for display, that matches the returned linkage words). So I just removed the comment "Experimental" from its description.

This computation could be done, live, at the time that the user asks for it. In almost all cases, I think byte offsets are easier to work with, as otherwise, the user is forced to count characters even if they don't want to (!)

Ok. So I should provide some API to request character offset.
I can think of these possibilities (repeating what that I already mentioned):

Different API calls:

linkage_get_word_byte_start(linkage, wordidx)
linkage_get_word_byte_end(linkage, wordidx)
linkage_get_word_char_start(linkage, wordidx)
linkage_get_word_char_end(linkage, wordidx)

Something like:
linkage_get_word_start(linkage, wordidx, flag)
When flag is false for byte offset and true for character offset.
A parse-option to control what linkage_get_word_start() / linkage_get_word_end() returns.
Provide a helper API to convert byte to character offset, something like:
convert_byte_to_char_offset(sent, offset)
The problem is that such a function will be the first one which is not LG-library specific, and also the user can easily write it. If we depend on the existence of such a function that only byte offsets can be provided.

(Of course more solutions can be thought of, but they may be too different than the current API.)

Computing the character offset on demand will be "expensive" due to its quadruple nature (unless it is requested in word order and internal caching is done). I can just implement it in its simplest form (counting from sentence start each time), and if needed it can be reimplemented in a more efficient (and complex) way.

So please indicate the prefered solution.

BTW, I encountered the need for character offset when in the Python demo I'm writing for the word offset feature (since the same code is for both Python 2 and 3, and Python 3 uses Unicode internally).
(However, I guess I can solve the indexing problem by converting to bytes and back.)

ampli · 2016-12-08T09:56:01Z

I forgot to reply on that:

*LEFT_WALL -- if it is a pointer, then it points at the null byte that terminates a C string. This seems like a better idea than making it be -1 or something like that...

(We both actually meant the RIGHT_WALL...)
For C there is no problem since the said null byte is part of the string.
However, in Python this position is out of the string and a special care should be taken to handle it.
(But this is not really important, and an implementation should also consider several other things.)

linas · 2016-12-08T10:24:17Z

On Thu, Dec 8, 2016 at 5:20 PM, Amir Plivatsky ***@***.***> wrote: Different API calls: linkage_get_word_byte_start(linkage, wordidx) linkage_get_word_byte_end(linkage, wordidx) linkage_get_word_char_start(linkage, wordidx) linkage_get_word_char_end(linkage, wordidx)

How about ``` void linkage_get_word_byte_span(linkage, wordidx, int* start, int* end) ``` with data returned in the mem locations start, end, so the intended use would be: ``` int start_offset; int end_offset; for (int n=0;n<nwords) linkage_get_word_byte_span(lkg, n, &start_offset, &end_offset); ``` ``` void linkage_get_word_char_span(linkage, wordidx, int*, int*) ``` Another possibility would be to malloc an entire array, (of start-end positions for each word)c and return that. We already have other routines that return malloced arrays. ... Hmm. I'm starting to think that perhaps having four routines are best.

2. Something like: linkage_get_word_start(linkage, wordidx, flag) When flag is false for byte offset and true for character offset.

yuck.

3. A parse-option to control what linkage_get_word_start() / linkage_get_word_end() returns.

yuck

…

4. Provide a helper API to convert byte to character offset, something like: convert_byte_to_char_offset(sent, offset) The problem is that such a function will be the first one which is not LG-library specific, and also the user can easily write it. If we depend on the existence of such a function that only byte offsets can be provided.

ampli · 2016-12-08T10:33:39Z

I starting to think that perhaps having four routines are best.

I can make it using 4, and we can declare this API "experimental and subject to change" until we think it is good enough or find a better idea.

The same can be said also on the new error facility.

If this is fine I will send PR for both.

(Note that the current pending PRs are not related to the above, and in any case can be applied right away. This will make it easy for me to ensure that the PRs I'm am about to send can be applied with no problems.)

linas · 2016-12-08T10:37:54Z

that sounds good.

linas · 2017-07-01T03:48:15Z

Having the four functions

linkage_get_word_byte_start(linkage, wordidx) 
linkage_get_word_byte_end(linkage, wordidx) 
linkage_get_word_char_start(linkage, wordidx) 
linkage_get_word_char_end(linkage, wordidx)

Seems like the best idea.

API functions per issue opencog#420.

Implement linkage_get_word_*() (issue #420)

linas · 2017-07-26T16:15:15Z

Closing, I think pull req #564 and #565 resolves this. Also, @glicerico states in #568 that this is now covered.

linas mentioned this issue Sep 26, 2016

Python-bindings example of checking for possible grammar problems #416

Merged

ampli added the enhancement label Oct 14, 2016

ampli mentioned this issue Nov 9, 2016

Improve error reporting #414

Closed

ampli mentioned this issue Dec 13, 2016

spell-guessing mis-handles capitalized words #404

Open

ampli added a commit to ampli/link-grammar that referenced this issue Jul 11, 2017

linkage_get_word_*(): Update the ChangeLog

2d1a8eb

API functions per issue opencog#420.

linas added a commit that referenced this issue Jul 13, 2017

Merge pull request #564 from ampli/get_word

c9a8527

Implement linkage_get_word_*() (issue #420)

linas closed this as completed Jul 26, 2017

linas mentioned this issue Jul 26, 2017

Bare word getters #568

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get words in original sentence #420

Get words in original sentence #420

linas commented Sep 26, 2016 •

edited

Loading

ampli commented Sep 26, 2016

linas commented Sep 27, 2016 •

edited

Loading

ampli commented Sep 27, 2016

ampli commented Dec 7, 2016 •

edited

Loading

linas commented Dec 8, 2016

ampli commented Dec 8, 2016

ampli commented Dec 8, 2016

linas commented Dec 8, 2016 via email •

edited

Loading

ampli commented Dec 8, 2016

linas commented Dec 8, 2016

linas commented Jul 1, 2017

linas commented Jul 26, 2017

Get words in original sentence #420

Get words in original sentence #420

Comments

linas commented Sep 26, 2016 • edited Loading

ampli commented Sep 26, 2016

linas commented Sep 27, 2016 • edited Loading

ampli commented Sep 27, 2016

ampli commented Dec 7, 2016 • edited Loading

linas commented Dec 8, 2016

ampli commented Dec 8, 2016

ampli commented Dec 8, 2016

linas commented Dec 8, 2016 via email • edited Loading

ampli commented Dec 8, 2016

linas commented Dec 8, 2016

linas commented Jul 1, 2017

linas commented Jul 26, 2017

linas commented Sep 26, 2016 •

edited

Loading

linas commented Sep 27, 2016 •

edited

Loading

ampli commented Dec 7, 2016 •

edited

Loading

linas commented Dec 8, 2016 via email •

edited

Loading