Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PatternParser problem #17

Open
mobileunit opened this issue Apr 9, 2018 · 2 comments
Open

PatternParser problem #17

mobileunit opened this issue Apr 9, 2018 · 2 comments
Assignees
Labels

Comments

@mobileunit
Copy link

mobileunit commented Apr 9, 2018

Hi there,

thanks a lot for textblob-de.
I found an issue when I try just to get the chunks info. My goal is to count the number of VP, NP, PP...
For that I am trying to extract only the chunks. I'm trying to the following code

from textblob_de import PatternParser
blob = TextBlob("Das ist ein schönes Auto, das du dir da gekauft hast. Das finde ich richtig klasse!", parser=PatternParser(pprint=True, chunks= True, tags=False, relations=True, lemmata=False, tokenize=False,  tagset = "UNIVERSAL"))
blob.parse()

But then I get the the pos tags in place of the chunk tags when using the pprint option.
I could not find a way to get the chunks by type in order to count them. Is there a trick to do so?

WORD TAG CHUNK ROLE ID PNP LEMMA
Das - PDS - - - -
ist - VVFIN - - - -
ein - ARTIND - - - -
schönes - NN - - - -
Auto, - NN ^ - - - -
das - ARTDEF - - - -
du - PPOSAT - - - -
dir - PPER - - - -
da - KOUS - - - -
gekauft - VVFIN PNP - - -
hast. - VVFIN ^ PNP - - -
Das - ARTDEF - - - -
finde - NN - - - -
ich - PPER - - - -
richtig - ADJA - - - -
klasse! - NN - - - -

Obviously counting the chunk tags results in wrong results as each token of the chunk contains the same chunk tag. How could I get the boundaries to count properly? Any suggestions?

Many thanks and best regards, Andy

@markuskiller
Copy link
Owner

markuskiller commented Apr 17, 2018

Hi Andy,

My apologies for the late reply. It seems to be working if you use the standard options that are passed on to the pattern parser (for a list auf the default values, see http://textblob-de.readthedocs.io/en/stable/api_reference.html#module-textblob_de.parsers). The main problem in your example is that the text is not tokenised properly (punctuation sticks to previous token), which leads to a number of additional mistakes in the tagging process. In addition, the chunks are not computed properly if you use the tags=False option. If I try this:

from textblob_de import TextBlobDE as TextBlob
from textblob_de import PatternParser
blob = TextBlob("Das ist ein schönes Auto, das du dir da gekauft hast. Das finde ich richtig klasse!", parser=PatternParser(pprint=True))
blob.parse()

I get:

          WORD   TAG    CHUNK    ROLE   ID     PNP    LEMMA

           Das   DT     -        -      -      -      -
           ist   VB     VP       -      -      -      -
           ein   DT     NP       -      -      -      -
       schönes   NN     NP ^     -      -      -      -
          Auto   NN     NP ^     -      -      -      -
             ,   ,      -        -      -      -      -
           das   WDT    NP       -      -      -      -
            du   PRP    NP ^     -      -      -      -
           dir   PRP    NP ^     -      -      -      -
            da   IN     PP       -      -      -      -
       gekauft   VB     VP       -      -      -      -
          hast   NN     NP       -      -      -      -
             .   .      -        -      -      -      -
           Das   DT     NP       -      -      -      -
         finde   NN     NP ^     -      -      -      -
           ich   PRP    NP ^     -      -      -      -
       richtig   JJ     ADJP     -      -      -      -
        klasse   JJ     ADJP ^   -      -      -      -
             !   .      -        -      -      -      -

For counting purposes you need to exclude chunks that are followed by a ^ sign in the pretty_print layout. However, it might be easier to use the standard layout for counting (pprint=False):

'Das/DT/O/O ist/VB/B-VP/O ein/DT/B-NP/O schönes/NN/I-NP/O Auto/NN/I-NP/O ,/,/O/O das/WDT/B-NP/O du/PRP/I-NP/O dir/PRP/I-NP/O da/IN/B-PP/O gekauft/VB/B-VP/O hast/NN/B-NP/O ././O/O Das/DT/B-NP/O finde/NN/I-NP/O ich/PRP/I-NP/O richtig/JJ/B-ADJP/O klasse/JJ/I-ADJP/O !/./O/O'

This gives you the option of just counting the chunks preceded by a B-. Unfortunately, there are still quite a few tagging mistakes & chunking mistakes in this output but this is about as accurate as you can get, using the pattern library.

Hope this helps.

Best wishes, Markus

@markuskiller markuskiller self-assigned this Apr 17, 2018
@mobileunit
Copy link
Author

This helps a lot! Thank you :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants