Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange tagging for some verb forms of "sein" and for "du" #9

Open
CJAnti opened this issue Oct 25, 2014 · 8 comments
Open

Strange tagging for some verb forms of "sein" and for "du" #9

CJAnti opened this issue Oct 25, 2014 · 8 comments
Assignees

Comments

@CJAnti
Copy link

CJAnti commented Oct 25, 2014

I stumbled over some strange tagging and was wondering why it won't correctly recognize "bist" and "seid" as verb forms of "sein", though they are listed in the "de-verbs.txt" file. Also tagging the personal pronoun "du" as an adjective doesn't make much sense either.

>>> blob=TextBlob("Ich bin. Du bist. Er ist. Wir sind. Ihr seid. Sie sind.",
                  parser=PatternParser(pprint=True, lemmata=True))
>>> blob.parse()
          WORD   TAG    CHUNK   ROLE   ID     PNP    LEMMA   

           Ich   PRP    NP      -      -      -      ich     
           bin   VB     VP      -      -      -      sein    
             .   .      -       -      -      -      .       
            Du   JJ     NP      -      -      -      du      
          bist   NN     NP ^    -      -      -      bist    
             .   .      -       -      -      -      .       
            Er   PRP    NP      -      -      -      er      
           ist   VB     VP      -      -      -      sein    
             .   .      -       -      -      -      .       
           Wir   PRP    NP      -      -      -      wir     
          sind   VB     VP      -      -      -      sein    
             .   .      -       -      -      -      .       
           Ihr   PRP$   NP      -      -      -      ihr     
          seid   NN     NP ^    -      -      -      seid    
             .   .      -       -      -      -      .       
           Sie   PRP    NP      -      -      -      sie     
          sind   VB     VP      -      -      -      sein    
             .   .      -       -      -      -      .     
@markuskiller
Copy link
Owner

Thanks for the report. Unfortunately, this seems to be an issue of the pattern library. This library is used by textblob-de without changes to the source code other than making it Python3 compatible. Would be great, if you could report it directly to the pattern project under: https://github.com/clips/pattern/issues

You could use the following test or provide a link to this issue for them to be able to verify the strange behaviour:

# Tested on Python2.7.8, 32bit, on Windows 8.1 (64bit)

# pattern.__version__ 
# '2.6'

In [1]: from pattern.de import parse, pprint

In [2]: pprint(parse("Ich bin. Du bist. Er ist. Wir sind. Ihr seid. Sie sind.", lemmata=True))
          WORD   TAG    CHUNK   ROLE   ID     PNP    LEMMA

           Ich   PRP    NP      -      -      -      ich
           bin   VB     VP      -      -      -      sein
             .   .      -       -      -      -      .

          WORD   TAG    CHUNK   ROLE   ID     PNP    LEMMA

            Du   PRP    NP      -      -      -      du
          bist   NN     NP ^    -      -      -      bist
             .   .      -       -      -      -      .

          WORD   TAG    CHUNK   ROLE   ID     PNP    LEMMA

            Er   PRP    NP      -      -      -      er
           ist   VB     VP      -      -      -      sein
             .   .      -       -      -      -      .

          WORD   TAG    CHUNK   ROLE   ID     PNP    LEMMA

           Wir   PRP    NP      -      -      -      wir
          sind   VB     VP      -      -      -      sein
             .   .      -       -      -      -      .

          WORD   TAG    CHUNK   ROLE   ID     PNP    LEMMA

           Ihr   PRP$   NP      -      -      -      ihr
          seid   NN     NP ^    -      -      -      seid
             .   .      -       -      -      -      .

          WORD   TAG    CHUNK   ROLE   ID     PNP    LEMMA

           Sie   PRP    NP      -      -      -      sie
          sind   VB     VP      -      -      -      sein
             .   .      -       -      -      -      .

In [3]: pprint(parse("Ihr seid alle herzlich eingeladen zu meinem Geburtstagsfest.", lemmata=True))
           WORD   TAG    CHUNK    ROLE   ID     PNP    LEMMA

            Ihr   PRP$   NP       -      -      -      ihr
           seid   NN     NP ^     -      -      -      seid
           alle   RB     ADJP     -      -      -      alle
       herzlich   JJ     ADJP ^   -      -      -      herzlich
     eingeladen   VBN    VP       -      -      -      einladen
             zu   IN     PP       -      -      PNP    zu
         meinem   PRP$   NP       -      -      PNP    meinem
Geburtstagsfest   NN     NP ^     -      -      PNP    geburtstagsfest
              .   .      -        -      -      -      .

In [4]: pprint(parse("Du bist herzlich eingeladen zu meinem Geburtstagsfest.", lemmata=True))
           WORD   TAG    CHUNK   ROLE   ID     PNP    LEMMA

             Du   PRP    NP      -      -      -      du
           bist   NN     NP ^    -      -      -      bist
       herzlich   JJ     ADJP    -      -      -      herzlich
     eingeladen   VBN    VP      -      -      -      einladen
             zu   IN     PP      -      -      PNP    zu
         meinem   PRP$   NP      -      -      PNP    meinem
Geburtstagsfest   NN     NP ^    -      -      PNP    geburtstagsfest
              .   .      -       -      -      -      .

@CJAnti
Copy link
Author

CJAnti commented Oct 25, 2014

Thanks, for your fast reply. I'm using Python 3.4 64bit on Windows 8.1 and need to investigate further.
(I'm half through with the NLTK book and was thinking about starting such a project myself, w hen I saw that you already started a project for German. First of all, thanks for that. ^^ I'll still need to adjust it for German dialects anyway, as we are using many different ones in our German chat room. Also many smilies are missing, at least for us.)

@markuskiller
Copy link
Owner

Thanks for further investigating the issue and contributing your results to the pattern project. They seem to be rather streched for resources. What I like about the pattern implementation is that it is pure python and its lemmatization is quite fast compared to other taggers. However, accuracy is a major problem. I've been working on textblob-rftagger for a while and the results are promising. It's not ready for public release yet, but if you contact me via email, I could invite you to the bitbucket-repo (if you're interested).

@mk270
Copy link

mk270 commented Jul 22, 2015

By the way, do you know whether rftagger is open source?

@markuskiller
Copy link
Owner

@mk270 rftagger is open source and its source code is available under the following links:

project page: http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/

source code: http://www.cis.uni-muenchen.de/~schmid/tools/RFTagger/data/RFTagger.tar.gz

However, it is "freely available for education, research and other non-commercial purposes" only. If this is not a problem for your project, feel free to contact me via email for me to be able to invite you to the bitbucket repository of textblob-rftagger. It is fully functional and working on WIN/OSX & Linux. The only reason I'm holding it back is because I haven't had the time to sort out a sensible and secure way of distributing the included binaries and because I'm unsure about the licensing concerning these binaries.

@mk270
Copy link

mk270 commented Jul 22, 2015

So, "available for education, research and other non-commercial purposes" is fairly canonically NOT open source, see for instance http://opensource.org/osd-annotated section 6.

It's a shame. I presume, since they've chosen to exclude commercial use, that they're not going to be biddable.

@markuskiller
Copy link
Owner

Thanks for the link. I interpreted 'open source' as 'is the source code available/accessible' (i.e. can it be modified/tweaked, etc.), which it is.

On other projects they released under a similarly restrictive license, they added:

"In order to use the TreeTagger commercially, you need to obtain a commercial license (see contact address below)! " (Source: http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/). So, I assume that they might be willing to make a decision/offer/quote on a per project basis and that they want to know what the project is about if you intend to use their software commercially.

@mk270
Copy link

mk270 commented Jul 22, 2015

Yes, indeed, it's not remotely open source in that term's conventional acceptation - it's proprietary.

I am asking as it's a dependency of another project I'm interested in. Ah well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants