How to get dependency tags/tree in a CoNLL-format? #533

redstar12 · 2016-10-19T10:21:28Z

How to get dependency tags/tree in a CoNLL-format like this:
1 Bob _ NOUN NNP _ 2 nsubj _ _
2 brought _ VERB VBD _ 0 ROOT _ _
3 the _ DET DT _ 4 det _ _
4 pizza _ NOUN NN _ 2 dobj _ _
5 to _ ADP IN _ 2 prep _ _
6 Alice _ NOUN NNP _ 5 pobj _ _
?

honnibal · 2016-10-19T10:26:27Z

import spacy

nlp = spacy.load('en', vectors=False)
doc = nlp(u'Bob bought the pizza to Alice')
for sent in doc:
    for i, word in enumerate(sent):
        if word.head is word:
            head_idx = 0
        else:
             head_idx = word.i-sent[0].i+1
        print(
            i+1, # There's a word.i attr that's position in *doc*
            word.pos_, # Coarse-grained tag
            word.tag_, # Fine-grained tag
            head_idx,
            word.dep_, # Relation
            '_', '_')

Should have had this snippet up from the start --- thanks.

evanmiltenburg · 2016-10-20T05:52:32Z

For common formats, I feel like this should be a method to the doc-object returned by nlp(). So then your snippet would be shortened to:

import spacy

nlp = spacy.load('en', vectors=False)
doc = nlp(u'Bob bought the pizza to Alice')
doc.save_conll('bob_and_alice.conll')

And others could implement methods like save_CoreNLP (Stanford parser XML) and save_conllu (universal dependencies). Another option would be to have the method work like doc.save(bob_and_alice.conll, format='conll'). Either way, the function should at least have the following keywords to include/exclude particular layers (suggested default values in parentheses):

pos (default True)
tag (default True)
deps (default True if available, there should be a check)
entities (default False)

evanmiltenburg · 2016-10-20T08:32:12Z

@redstar12 messaged me to say that @honnibal's code is not working for her on Python 2.7. Reproducing it on my machine, the problem seems to be this part:

>>> nlp = spacy.load('en', vectors=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Applications/anaconda/envs/python2/lib/python2.7/site-packages/spacy/__init__.py", line 16, in load
    vectors_package = get_package_by_name(vectors, via=via)
  File "/Applications/anaconda/envs/python2/lib/python2.7/site-packages/spacy/util.py", line 41, in get_package_by_name
    lang = get_lang_class(name)
  File "/Applications/anaconda/envs/python2/lib/python2.7/site-packages/spacy/util.py", line 26, in get_lang_class
    lang = re.split('[^a-zA-Z0-9_]', name, 1)[0]
  File "/Applications/anaconda/envs/python2/lib/python2.7/re.py", line 171, in split
    return _compile(pattern, flags).split(string, maxsplit)
TypeError: expected string or buffer

This is easily solved by using the old approach:

from spacy.en import English
nlp = English()
doc = nlp(u'Bob bought the pizza to Alice')
for sent in doc.sents:
    for i, word in enumerate(sent):
        if word.head is word:
            head_idx = 0
        else:
             head_idx = word.i-sent[0].i+1
        print(
            i+1, # There's a word.i attr that's position in *doc*
            word.pos_, # Coarse-grained tag
            word.tag_, # Fine-grained tag
            head_idx,
            word.dep_, # Relation
            '_', '_')

I don't know whether this has been fixed in the meantime (I'm using an older version of SpaCy). If not, then there should be a new GitHub issue to address this.

redstar12 · 2016-10-20T08:46:57Z

I did upgrade and tested all these snippets and I'm getting the error:
TypeError: 'spacy.tokens.token.Token' object is not iterable
(AttributeError: 'spacy.tokens.doc.Doc' object has no attribute 'save_conll')

evanmiltenburg · 2016-10-20T09:00:43Z

Updated the code (apparently we forgot to write doc.sents instead of doc). The AttributeError makes sense because it was a feature request rather than a reference to an existing method.

redstar12 · 2016-10-20T09:10:23Z

Thank you! It works. But now I am getting:
1 PUNCT "" 0 ROOT _ _
1 PUNCT "" 0 ROOT _ _
1 PUNCT "" 0 ROOT _ _
1 PUNCT "" 0 ROOT _ _
1 PUNCT "" 0 ROOT _ _
1 PUNCT "" 0 ROOT _ _

evanmiltenburg · 2016-10-20T09:17:27Z

I got this (in Python 2.7), using the example code as given above.

(1, u'PROPN', u'NNP', 1, u'nsubj', '_', '_')
(2, u'VERB', u'VBD', 0, u'ROOT', '_', '_')
(3, u'DET', u'DT', 3, u'det', '_', '_')
(4, u'NOUN', u'NN', 4, u'dobj', '_', '_')
(5, u'ADP', u'IN', 5, u'prep', '_', '_')
(6, u'PROPN', u'NNP', 6, u'pobj', '_', '_')

So the code is definitely working. Did you change anything? (At some point you should start figuring this out yourself, though. It's your problem..)

redstar12 · 2016-10-20T09:18:02Z

OK! Thank you very much!

evanmiltenburg · 2016-10-24T08:04:49Z

I already told you: doc.save_conll doesn't work because it's not implemented yet. It was just a suggestion to create this function in the future. This will not work for you.
I'm not sure about your indentation. But let's assume that's not the problem.
I checked it again, in Python 2, and the code really works:

Python 2.7.11 |Continuum Analytics, Inc.| (default, Jun 15 2016, 16:09:16)
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
>>> from spacy.en import English
>>> nlp = English()
>>> doc = nlp(u'Bob bought the pizza to Alice')
>>> for sent in doc.sents:
...     for i, word in enumerate(sent):
...         if word.head is word:
...             head_idx = 0
...         else:
...              head_idx = word.i-sent[0].i+1
...         print(
...             i+1, # There's a word.i attr that's position in *doc*
...             word.pos_, # Coarse-grained tag
...             word.tag_, # Fine-grained tag
...             head_idx,
...             word.dep_, # Relation
...             '_', '_')
...
(1, u'PROPN', u'NNP', 1, u'nsubj', '_', '_')
(2, u'VERB', u'VBD', 0, u'ROOT', '_', '_')
(3, u'DET', u'DT', 3, u'det', '_', '_')
(4, u'NOUN', u'NN', 4, u'dobj', '_', '_')
(5, u'ADP', u'IN', 5, u'prep', '_', '_')
(6, u'PROPN', u'NNP', 6, u'pobj', '_', '_')
>>>

But now I'm wondering: did you download and install the SpaCy data? If not, do this on the command line: python -m spacy.en.download. Then run the code. Make sure that the indentations are indeed correct.

If that doesn't work, then please try to investigate where the code breaks down. Just coming here and saying "it doesn't work" is not good enough. Try to see if other things do work, e.g. tagging:

for token in doc:
    print('\t'.join(token.orth_, token.pos_, token.tag_))

redstar12 · 2016-10-24T10:07:20Z

Thank you very much for your reply. I don't know why but we had a problem with parsing after upgrade. We reinstalled Spacy and now the code works. BUT! As you can see, the HEAD value is wrong. It is the same as ID value:

(1, u'PROPN', u'NNP', 1, u'nsubj', '', '')
(2, u'VERB', u'VBD', 0, u'ROOT', '', '')
(3, u'DET', u'DT', 3, u'det', '', '')
(4, u'NOUN', u'NN', 4, u'dobj', '', '')
(5, u'ADP', u'IN', 5, u'prep', '', '')
(6, u'PROPN', u'NNP', 6, u'pobj', '', '')

And it has to be like this:

(1, u'PROPN', u'NNP', 2, u'nsubj', '', '')
(2, u'VERB', u'VBD', 0, u'ROOT', '', '')
(3, u'DET', u'DT', 4, u'det', '', '')
(4, u'NOUN', u'NN', 2, u'dobj', '', '')
(5, u'ADP', u'IN', 2, u'prep', '', '')
(6, u'PROPN', u'NNP', 5, u'pobj', '', '')

evanmiltenburg · 2016-10-24T10:31:02Z

Ok, then there was a small mistake in the code @honnibal wrote. I don't have time to fix it. All I wanted to say in this thread was that it'd be nice to have a method for the parsed document to save it in CONLL format.

Hints I can give you to fix the code:

Each token has a token.i attribute that gives you the index.
Each token has a token.head attribute that gives you the head token (which also has the i attribute).

redstar12 · 2016-10-24T17:09:03Z

Thank you very much for your hints. I fixed the code. It works!!! Thank you!

mosynaq · 2017-07-26T20:11:02Z

Here's the code that works for me:

doc = nlp(u'Bob bought the pizza to Alice')
for sent in doc.sents:
     for i, word in enumerate(sent):
         if word.head is word:
             head_idx = 0
         else:
            head_idx = doc[i].head.i+1
        
         print("%d\t%s\t%s\t%s\t%s\t%s\t%d\t%s\t%s\t%s"%(
             i+1, # There's a word.i attr that's position in *doc*
             word,
             '_',
             word.pos_, # Coarse-grained tag
             word.tag_, # Fine-grained tag
             '_',
             head_idx,
             word.dep_, # Relation
             '_', '_'))

And the output:

1	Bob	_	PROPN	NNP	_	2	nsubj	_	_
2	bought	_	VERB	VBD	_	0	ROOT	_	_
3	the	_	DET	DT	_	4	det	_	_
4	pizza	_	NOUN	NN	_	2	dobj	_	_
5	to	_	ADP	IN	_	2	prep	_	_
6	Alice	_	PROPN	NNP	_	5	pobj	_	_

You can test the output here (online) or use this (offline, .Net-based).

Sample output visualization from the latter:

GruffPrys · 2017-07-27T14:24:44Z

@mosynaq #1215 might be of interest to you for testing output in displaCy :)

Nou2017 · 2017-08-23T10:15:08Z

is it possible to get the same result using MaltParser and Python?

flackbash · 2017-09-04T12:17:37Z

The computation of the head id is not entirely correct in either one of the code snippets.
This is how it should be:

if word.head is word:
    head_idx = 0
else:
    # this is the corrected line:
    head_idx = word.head.i - sent[0].i + 1

Aside from that the comments here were really helpful, thanks!

Nou2017 · 2017-09-04T12:23:52Z

It works with French sentences?

Imane0 · 2018-01-18T09:52:31Z

@flackbash Could you please explain why you add "- sent[0].i + 1" ? Why isn't word.head.i enough ?

lock · 2018-05-08T02:54:55Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added the usage General spaCy usage label Oct 19, 2016

ines closed this as completed Jan 9, 2017

lock bot locked as resolved and limited conversation to collaborators May 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to get dependency tags/tree in a CoNLL-format? #533

How to get dependency tags/tree in a CoNLL-format? #533

redstar12 commented Oct 19, 2016

honnibal commented Oct 19, 2016 •

edited

Loading

evanmiltenburg commented Oct 20, 2016

evanmiltenburg commented Oct 20, 2016 •

edited

Loading

redstar12 commented Oct 20, 2016 •

edited

Loading

evanmiltenburg commented Oct 20, 2016

redstar12 commented Oct 20, 2016

evanmiltenburg commented Oct 20, 2016

redstar12 commented Oct 20, 2016

evanmiltenburg commented Oct 24, 2016

redstar12 commented Oct 24, 2016

evanmiltenburg commented Oct 24, 2016

redstar12 commented Oct 24, 2016 •

edited

Loading

mosynaq commented Jul 26, 2017 •

edited

Loading

GruffPrys commented Jul 27, 2017

Nou2017 commented Aug 23, 2017

flackbash commented Sep 4, 2017 •

edited

Loading

Nou2017 commented Sep 4, 2017

Imane0 commented Jan 18, 2018 •

edited

Loading

lock bot commented May 8, 2018

How to get dependency tags/tree in a CoNLL-format? #533

How to get dependency tags/tree in a CoNLL-format? #533

Comments

redstar12 commented Oct 19, 2016

honnibal commented Oct 19, 2016 • edited Loading

evanmiltenburg commented Oct 20, 2016

evanmiltenburg commented Oct 20, 2016 • edited Loading

redstar12 commented Oct 20, 2016 • edited Loading

evanmiltenburg commented Oct 20, 2016

redstar12 commented Oct 20, 2016

evanmiltenburg commented Oct 20, 2016

redstar12 commented Oct 20, 2016

evanmiltenburg commented Oct 24, 2016

redstar12 commented Oct 24, 2016

evanmiltenburg commented Oct 24, 2016

redstar12 commented Oct 24, 2016 • edited Loading

mosynaq commented Jul 26, 2017 • edited Loading

GruffPrys commented Jul 27, 2017

Nou2017 commented Aug 23, 2017

flackbash commented Sep 4, 2017 • edited Loading

Nou2017 commented Sep 4, 2017

Imane0 commented Jan 18, 2018 • edited Loading

lock bot commented May 8, 2018

honnibal commented Oct 19, 2016 •

edited

Loading

evanmiltenburg commented Oct 20, 2016 •

edited

Loading

redstar12 commented Oct 20, 2016 •

edited

Loading

redstar12 commented Oct 24, 2016 •

edited

Loading

mosynaq commented Jul 26, 2017 •

edited

Loading

flackbash commented Sep 4, 2017 •

edited

Loading

Imane0 commented Jan 18, 2018 •

edited

Loading