Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get words in original sentence #420

Closed
linas opened this issue Sep 26, 2016 · 12 comments
Closed

Get words in original sentence #420

linas opened this issue Sep 26, 2016 · 12 comments

Comments

@linas
Copy link
Member

linas commented Sep 26, 2016

See pull req #416 for discussion.
To summarize: stackoverflow question - how to use LG as a grammar checker - asks for an API to find which words where not included in a parse. The issue is non-trivial due to the following:

  • word-numbering is tokenization-dependent, e.g. if a word is split it's -> it 's
  • spell-guesser can change length of words
  • utf8 langs have word begin/end counts differ from byte counts.
@ampli
Copy link
Member

ampli commented Sep 26, 2016

I guess you mean "that token in the original sentence that I would get, if I split the original sentence by using blank spaces as word-separators"

Approximately but not exactly - see below.

For the case of "it's" being split into two words, the user could write a utility themselves, its not >hard. For spell-guessing, it gets hard, because string compares are no longer possible.

It may get complicated - there are also punctuation split, morpheme splits and unit splits.

Yes, I guess something like this might work:

char * linkage_get_orig_word(const Linkage linkage, WordIdx word_num);
and maybe also

int linkage_get_orig_index(const Linkage linkage, WordIdx word_num);

It is more complicated than that. Take for example the following sentence:

Linkparser> This is a, test
...
LEFT-WALL this.p is.v a [,] test.n

For "[,]", you don't want to return "a,".
There are similar and also more complicated other examples.

To remove any ambiguity, it seems there is maybe a need to return (start,end) string positions. This will also enable the application to mark (underline, color, etc.) the offending tokens, and more.

Only after trying to implement it, including writing a demo using it, we can see what is actually needed. I can try to do that.

Can you think of a better name for "orig_word" and "orig_index"? There must be some nicer way of saying this...

Maybe use the word sentence instead of orig.

@linas
Copy link
Member Author

linas commented Sep 27, 2016

yes, start,end string positions are exactly the right thing! then linkage_get_word_start() and linkage_get_word_end() would be good names.

@ampli
Copy link
Member

ampli commented Sep 27, 2016

I will implement it and add it to the sentence-check.py example program.

@ampli
Copy link
Member

ampli commented Dec 7, 2016

Hi Linas,
I implemented it. However, I encountered a problem.
I need you comments on these things before I finish the code polishing (including test cases) and send a PR.

Currently, my implementation returns start and end offsets in bytes.
When I wanted to use that in a demo program, it looked fine for a simple English sentences.
But when I tried Russian I realized that I really would like character offsets, not byte offsets.
On the other hand, for programs that work on the raw byte strings, character offsets are useless.

Possible solutions:

  1. Provide API for byte offsets and for character offsets (different functions, or same functions with an additional Boolean argument).
    In that case, there are two options:
    -- Internally compute both the byte offset and character offset for each token. Here there is a slight overhead of computing a representation that will not be used by the API user.
    -- Compute only one of them for each token and convert it to the other one on demand. Here there is much more overhead if the "wrong" representation is the one which is initially computed.

  2. Provide only character offsets, since modern programs work with Unicode.

  3. Define a new a parse-option with 4 possible values:
    -- 0: No offset computations. This has the benefit of having a slight less overhead in case they are not needed (but the overhead is small - I could not see it in timing checks on en/corpus-basic.batch).
    -- 1: Provide only byte offsets.
    -- 2: Provide only Unicode offsets.
    -- 3: Provide both kinds of offsets. For that option (1) needs to be implemented too.

My preferred solution for now:
I will change my implementation to option (2) above, unless you think another thing is better.

Another problem was the C types of the linkage word, and the returned result.
The type for the linkage word seems clear: WordIdx (which is currently defined as size_t).
The returned result should be size_t if we look at other functions that return a number of items, like linkage_get_num_words() and linkage_get_num_links().

On Linux using x64, size_tseems like overkill (comparing to unsigned int), but nevertheless this is the appropriate type, by definition, to use as a positive index.

But note that in link_grammar.i, in absolutely most of the cases, int replaces size_t and WordIdx.
This is fine from the standpoint of the SWIG interface, as these prototypes are only used to represent the target language types, and the proper C prototypes are using for the library interface, when the int to size_t promotion is just fine for positive numbers (only) and size_t to int is fine because these are small numbers. Moreover, this using int instead of size_t is nicer when debugging Python on x64 Linux, because otherwise the corresponding numbers are displayed as 1L etc. (i.e. suffix L). So I also specified the parameters/returned values in link_grammar.i as int.

I also had a problem with the position of the wall words. My current solution is to set for each of them the start offset to be equal to the end offset.
The start offset of RIGHT-WALL is set to 0, and this doesn't seem problematic.
However the start offset of the LEFT-WALL is set to sentence_length+1, which has a vast drawback of not being a valid offset in the sentence!

@linas
Copy link
Member Author

linas commented Dec 8, 2016

So:

  • Some 98% of the users are not interested in these offsets, and this API (right?) and so slowing down the implementation for all users seems like a bad idea.

  • If you internally store pointers into the original string, then offsets are easy to compute: byte offsets are just a pointer difference, while character offsets requires looping and counting characters, from beginning of string, to the start/end character. Since the pointers should always byte to the first byte of a character, this should be easy to do. This computation could be done, live, at the time that the user asks for it. In almost all cases, I think byte offsets are easier to work with, as otherwise, the user is forced to count characters even if they don't want to (!)

  • size_t vs int -- I don't care much, one way or the other, today. I thought it wold help improve the A:PI by making calls more strict, but then size_t is overkill, and makes printing harder. I keep changing my mind about whether it is a good idea.

*LEFT_WALL -- if it is a pointer, then it points at the null byte that terminates a C string. This seems like a better idea than making it be -1 or something like that...

@ampli
Copy link
Member

ampli commented Dec 8, 2016

Some 98% of the users are not interested in these offsets, and this API (right?) and so slowing down the implementation for all users seems like a bad idea.

Indeed most users may not be interested in it. It is intended for programs like a link-parser GUI and for syntax checkers.
However, its overhead is so minimal (a few assignments/additions per token) that I could not measure its impact.

If you internally store pointers into the original string, then offsets are easy to compute: byte offsets are just a pointer difference, while character offsets requires looping and counting characters, from beginning of string, to the start/end character. Since the pointers should always byte to the first byte of a character, this should be easy to do.

Indeed my implementation just stores pointers to subword start/end (for each wordgraph node).
Each time a new node is inserted, its start/end points are computed according to the previous token or unsplit_word start/end and the inserted subword length. (It takes care of spell corrections).

BTW, it turned out the start/end points for morphology=1 is already perfectly handled by the existing linkage->wg_path_display (wordgraph path for display, that matches the returned linkage words). So I just removed the comment "Experimental" from its description.

This computation could be done, live, at the time that the user asks for it. In almost all cases, I think byte offsets are easier to work with, as otherwise, the user is forced to count characters even if they don't want to (!)

Ok. So I should provide some API to request character offset.
I can think of these possibilities (repeating what that I already mentioned):

  1. Different API calls:
linkage_get_word_byte_start(linkage, wordidx)
linkage_get_word_byte_end(linkage, wordidx)
linkage_get_word_char_start(linkage, wordidx)
linkage_get_word_char_end(linkage, wordidx)
  1. Something like:
    linkage_get_word_start(linkage, wordidx, flag)
    When flag is false for byte offset and true for character offset.

  2. A parse-option to control what linkage_get_word_start() / linkage_get_word_end() returns.

  3. Provide a helper API to convert byte to character offset, something like:
    convert_byte_to_char_offset(sent, offset)
    The problem is that such a function will be the first one which is not LG-library specific, and also the user can easily write it. If we depend on the existence of such a function that only byte offsets can be provided.

(Of course more solutions can be thought of, but they may be too different than the current API.)

Computing the character offset on demand will be "expensive" due to its quadruple nature (unless it is requested in word order and internal caching is done). I can just implement it in its simplest form (counting from sentence start each time), and if needed it can be reimplemented in a more efficient (and complex) way.

So please indicate the prefered solution.

BTW, I encountered the need for character offset when in the Python demo I'm writing for the word offset feature (since the same code is for both Python 2 and 3, and Python 3 uses Unicode internally).
(However, I guess I can solve the indexing problem by converting to bytes and back.)

@ampli
Copy link
Member

ampli commented Dec 8, 2016

I forgot to reply on that:

*LEFT_WALL -- if it is a pointer, then it points at the null byte that terminates a C string. This seems like a better idea than making it be -1 or something like that...

(We both actually meant the RIGHT_WALL...)
For C there is no problem since the said null byte is part of the string.
However, in Python this position is out of the string and a special care should be taken to handle it.
(But this is not really important, and an implementation should also consider several other things.)

@linas
Copy link
Member Author

linas commented Dec 8, 2016 via email

@ampli
Copy link
Member

ampli commented Dec 8, 2016

I starting to think that perhaps having four routines are best.

I can make it using 4, and we can declare this API "experimental and subject to change" until we think it is good enough or find a better idea.

The same can be said also on the new error facility.

If this is fine I will send PR for both.

(Note that the current pending PRs are not related to the above, and in any case can be applied right away. This will make it easy for me to ensure that the PRs I'm am about to send can be applied with no problems.)

@linas
Copy link
Member Author

linas commented Dec 8, 2016

that sounds good.

@linas
Copy link
Member Author

linas commented Jul 1, 2017

Having the four functions

linkage_get_word_byte_start(linkage, wordidx) 
linkage_get_word_byte_end(linkage, wordidx) 
linkage_get_word_char_start(linkage, wordidx) 
linkage_get_word_char_end(linkage, wordidx)

Seems like the best idea.

ampli added a commit to ampli/link-grammar that referenced this issue Jul 11, 2017
linas added a commit that referenced this issue Jul 13, 2017
Implement linkage_get_word_*() (issue #420)
@linas
Copy link
Member Author

linas commented Jul 26, 2017

Closing, I think pull req #564 and #565 resolves this. Also, @glicerico states in #568 that this is now covered.

@linas linas closed this as completed Jul 26, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants