-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get words in original sentence #420
Comments
Approximately but not exactly - see below.
It may get complicated - there are also punctuation split, morpheme splits and unit splits.
It is more complicated than that. Take for example the following sentence:
For " To remove any ambiguity, it seems there is maybe a need to return (start,end) string positions. This will also enable the application to mark (underline, color, etc.) the offending tokens, and more. Only after trying to implement it, including writing a demo using it, we can see what is actually needed. I can try to do that.
Maybe use the word sentence instead of orig. |
yes, start,end string positions are exactly the right thing! then |
I will implement it and add it to the sentence-check.py example program. |
Hi Linas, Currently, my implementation returns start and end offsets in bytes. Possible solutions:
My preferred solution for now: Another problem was the C types of the linkage word, and the returned result. On Linux using x64, But note that in I also had a problem with the position of the wall words. My current solution is to set for each of them the start offset to be equal to the end offset. |
So:
*LEFT_WALL -- if it is a pointer, then it points at the null byte that terminates a C string. This seems like a better idea than making it be -1 or something like that... |
Indeed most users may not be interested in it. It is intended for programs like a link-parser GUI and for syntax checkers.
Indeed my implementation just stores pointers to subword start/end (for each wordgraph node). BTW, it turned out the start/end points for morphology=1 is already perfectly handled by the existing
Ok. So I should provide some API to request character offset.
linkage_get_word_byte_start(linkage, wordidx)
linkage_get_word_byte_end(linkage, wordidx)
linkage_get_word_char_start(linkage, wordidx)
linkage_get_word_char_end(linkage, wordidx)
(Of course more solutions can be thought of, but they may be too different than the current API.) Computing the character offset on demand will be "expensive" due to its quadruple nature (unless it is requested in word order and internal caching is done). I can just implement it in its simplest form (counting from sentence start each time), and if needed it can be reimplemented in a more efficient (and complex) way. So please indicate the prefered solution. BTW, I encountered the need for character offset when in the Python demo I'm writing for the word offset feature (since the same code is for both Python 2 and 3, and Python 3 uses Unicode internally). |
I forgot to reply on that:
(We both actually meant the RIGHT_WALL...) |
On Thu, Dec 8, 2016 at 5:20 PM, Amir Plivatsky ***@***.***> wrote:
Different API calls:
linkage_get_word_byte_start(linkage, wordidx)
linkage_get_word_byte_end(linkage, wordidx)
linkage_get_word_char_start(linkage, wordidx)
linkage_get_word_char_end(linkage, wordidx)
How about
```
void linkage_get_word_byte_span(linkage, wordidx, int* start, int* end)
```
with data returned in the mem locations start, end, so the intended use would be:
```
int start_offset;
int end_offset;
for (int n=0;n<nwords)
linkage_get_word_byte_span(lkg, n, &start_offset, &end_offset);
```
```
void linkage_get_word_char_span(linkage, wordidx, int*, int*)
```
Another possibility would be to malloc an entire array, (of start-end positions for each word)c and return that. We already have other routines that return malloced arrays. ...
Hmm. I'm starting to think that perhaps having four routines are best.
2.
Something like:
linkage_get_word_start(linkage, wordidx, flag)
When flag is false for byte offset and true for character offset.
yuck.
3. A parse-option to control what linkage_get_word_start() /
linkage_get_word_end() returns.
yuck
… 4. Provide a helper API to convert byte to character offset, something
like:
convert_byte_to_char_offset(sent, offset)
The problem is that such a function will be the first one which is not
LG-library specific, and also the user can easily write it. If we depend on
the existence of such a function that only byte offsets can be provided.
|
I can make it using 4, and we can declare this API "experimental and subject to change" until we think it is good enough or find a better idea. The same can be said also on the new error facility. If this is fine I will send PR for both. (Note that the current pending PRs are not related to the above, and in any case can be applied right away. This will make it easy for me to ensure that the PRs I'm am about to send can be applied with no problems.) |
that sounds good. |
Having the four functions
Seems like the best idea. |
API functions per issue opencog#420.
Implement linkage_get_word_*() (issue #420)
Closing, I think pull req #564 and #565 resolves this. Also, @glicerico states in #568 that this is now covered. |
See pull req #416 for discussion.
To summarize: stackoverflow question - how to use LG as a grammar checker - asks for an API to find which words where not included in a parse. The issue is non-trivial due to the following:
The text was updated successfully, but these errors were encountered: