Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PatternParserLemmatizer: tagging errors negatively affecting sentiment analysis #6

Open
markuskiller opened this issue Aug 15, 2014 · 0 comments

Comments

@markuskiller
Copy link
Owner

Tagging errors in PatternParser output may lead to incorrect lemmatization of frequent German adjectives. As a consequence of this, there will be unexpected results in all tools relying on the parser's output (pos tagging, sentiment analysis, noun phrase extraction, etc.):

Example (using ipython):

In [1]: from textblob_de import TextBlobDE
In [2]: TextBlobDE(u"Peter hat einen schönen Hund.").sentiment
Out[2]: Sentiment(polarity=0.0, subjectivity=0.0)
Out[EXPECTED]: Sentiment(polarity=1.0, subjectivity=0.0)

In [3]: TextBlobDE(u"Peter hat einen schönen Hund.").noun_phrases
Out[3]: WordList([])
Out[EXPECTED]: WordList([u'schönen Hund'])

In [4]: TextBlobDE(u"Peter hat einen schönen Hund.").tags
Out[4]: [('Peter', 'NNP'), ('hat', 'VB'), ('einen', 'DT'),  (u'schönen', 'PRP$'),  ('Hund', 'NN')]
Out[EXPECTED]: [...,  (u'schönen', 'JJ'), ...]

Root cause:

In [5]: from pattern.de import parse, pprint

In [6]: pprint(parse(u"Peter hat einen schönen Hund.", lemmata=True))

          WORD   TAG    CHUNK   ROLE   ID     PNP    LEMMA     

      Peter   NNP    NP      -      -      -      peter     
        hat   VB     VP      -      -      -      haben       
      einen   DT     NP      -      -      -      ein       
    schönen > PRP$ < NP ^    -      -      -    > schön[en] <
       Hund   NN     NP ^    -      -      -      hund      
          .   .      -       -      -      -      .     

Please direct suggestions for improvement directly to the pattern project (see e.g. clips/pattern#63). The version of pattern.text.de included in textblob-de will be updated on a regular basis.

I am also working on the integration of additional lemmatizers into textblob_de, but PatternParserLemmatizer will remain the default choice, as it is implemented in Python.

@markuskiller markuskiller changed the title PatternParserLemmatizer: tagging errors PatternParserLemmatizer: tagging errors negatively affecting sentiment analysis Aug 15, 2014
@markuskiller markuskiller self-assigned this Aug 15, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant