Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a list of tokens obtained by any external tokenizer in Sentence class #640

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 11 additions & 6 deletions flair/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -273,7 +273,7 @@ class Sentence:
A Sentence is a list of Tokens and is used to represent a sentence or text fragment.
"""

def __init__(self, text: str = None, use_tokenizer: bool = False, labels: Union[List[Label], List[str]] = None):
def __init__(self, text: str = None, use_tokenizer: bool = False, tokens_from_list: List[str] = None, labels: Union[List[Label], List[str]] = None):

super(Sentence, self).__init__()

Expand All @@ -290,12 +290,17 @@ def __init__(self, text: str = None, use_tokenizer: bool = False, labels: Union[
# tokenize the text first if option selected
if use_tokenizer:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default use_tokenizer is False. In order to use tokens_from_list it must be manually set to True (which is a bit complicated I think).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another problem: the code only works when you specify text:

 # if text is passed, instantiate sentence with tokens (words)
        if text is not None:

I think this is not necessary, because you already pass the tokenized text.

With the current implementation, it is possible to use it with:

def test_sentence_external_tokenization():
    text: str = "Berlin and Munich are nice cities ."

    tokens: List[str] = text.split()

    sentence: Sentence = Sentence(text=text, use_tokenizer=True, tokens_from_list=tokens)

    assert ("Berlin and Munich are nice cities ." == sentence.to_tokenized_string())

which is too complicated. I think both text and use_tokenizer are not really needed here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your review.

For the first point, I thought that tokens_from_list should be a choice dependent to use_tokenizer=True, because we choose either segtok either any external tokenizer. But you right it is a bit complicated. It can totally be an independent argument.

For the second point, if the original text is needed, it is because I wanted to keep the same behavior and the same code that when we use the segtok tokenizer. It allows to use entity span indexes to work on the original text easily. Let me clarify this by an example using spacy french tokenizer:

from spacy.lang.fr import French
nlp = French() # use only the tokenizer 
text = "J'aime vivre à Montréal lorsqu'il neige."
doc = nlp(text)
tokens = [token.text for token in doc] # ["J'", 'aime', 'vivre', 'à', 'Montréal', "lorsqu'", 'il', 'neige', '.']
sentence = Sentence(text, use_tokenizer=True, tokens_from_list=tokens)

assert(text == sentence.to_original_text())
assert("J' aime vivre à Montréal lorsqu' il neige ." == sentence.to_tokenized_string())

In my opinion, it is valuable to keep this behavior, but maybe it could be optional?


if tokens_from_list is not None and len(tokens_from_list) > 0:
# use a list of tokens obtained by any tokenizer of your choice
tokens = tokens_from_list

else:
# use segtok for tokenization
tokens = []
sentences = split_single(text)
for sentence in sentences:
contractions = split_contractions(word_tokenizer(sentence))
tokens.extend(contractions)
tokens = []
sentences = split_single(text)
for sentence in sentences:
contractions = split_contractions(word_tokenizer(sentence))
tokens.extend(contractions)

# determine offsets for whitespace_after field
index = text.index
Expand Down