Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem using doc array() and bytes() functions together #3012

Closed
clippered opened this issue Dec 5, 2018 · 5 comments
Closed

Problem using doc array() and bytes() functions together #3012

clippered opened this issue Dec 5, 2018 · 5 comments
Labels
bug Bugs and behaviour differing from documentation feat / serialize Feature: Serialization, saving and loading

Comments

@clippered
Copy link
Contributor

How to reproduce the behaviour

import spacy
from spacy.attrs import ENT_IOB, ENT_TYPE
from spacy.tokens import Doc


nlp = spacy.load('en_core_web_sm')
doc = nlp('This is 10%.')
print(doc[2].orth_, doc[2].pos_, doc[2].tag_, doc[2].ent_type_)  # displays "10 NUM CD PERCENT"

# removing '10%' entity
header = [ENT_IOB, ENT_TYPE]
ent_array = doc.to_array(header)
idx = 2
ent_array[idx, 0] = 0
ent_array[idx, 1] = 0
doc.from_array(header, ent_array)

print(doc[2].orth_, doc[2].pos_, doc[2].tag_, doc[2].ent_type_)  # displays "10 NUM CD "

# serializing then deserializing
bytes = doc.to_bytes()
doc2 = Doc(nlp.vocab).from_bytes(bytes)
print(doc2[2].orth_, doc2[2].pos_, doc2[2].tag_, doc2[2].ent_type_)  # displays "10   "

Basically, my use case would be to change some of the tagged entities while keeping other entities. From the small example above, it seems the problem occurs when calling the serialization/deserialization part after entity removal part. When you run it, it shows:

10 NUM CD PERCENT
10 NUM CD 
10

However, commenting out the entity removal part (the call to to_array up to from_array) of the code, the serialization works fine and shows:

10 NUM CD PERCENT
10 NUM CD PERCENT
10 NUM CD PERCENT

Environment

  • Operating System: OSX 10.14
  • Python Version Used: Python 3.6.6
  • spaCy Version Used: spaCy 2.0.18
  • Environment Information:
@clippered
Copy link
Contributor Author

Apologies if I might be missing some information.
In the example above, I want to keep the POS/TAG attributes after serialization. Something with an output of:

10 NUM CD PERCENT
10 NUM CD 
10 NUM CD 

@honnibal honnibal added the bug Bugs and behaviour differing from documentation label Dec 6, 2018
@honnibal
Copy link
Member

honnibal commented Dec 6, 2018

Thanks for the example, this definitely looks suspicious. Would you mind making a pull request with your example as an xfail-ed test? It would be in the spacy/tests/regression directory.

clippered pushed a commit to clippered/spaCy that referenced this issue Dec 6, 2018
@honnibal
Copy link
Member

It looks to me like the problem might be with the is_tagged attribute. If this is set to False, the POS tags are not serialized, leading to the mismatch we're seeing here.

@ines ines added the feat / serialize Feature: Serialization, saving and loading label Dec 18, 2018
ines pushed a commit that referenced this issue Dec 18, 2018
* issue #3012: add test

* add contributor aggreement

* Make test work without models and fix typos

ten.pos_ instead of ten.orth_ and comparison against "10" instead of integer 10
honnibal added a commit that referenced this issue Dec 30, 2018
honnibal added a commit that referenced this issue Dec 30, 2018
If doc.from_array() was called with say, only entity information, this
would cause doc.is_tagged to be set to False, even if tags were set.
This caused tags to be dropped from serialisation. The same was true for
doc.is_parsed.

Closes #3012.
@honnibal
Copy link
Member

Yep, if we do doc.from_array([ENT_IOB, ENT_TYPE]), it was seeing that no TAG is set, and then clobbering setting doc.is_tagged to false. This led us to not serialize the tag data.

Fixed now 🎉

@lock
Copy link

lock bot commented Jan 29, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Jan 29, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation feat / serialize Feature: Serialization, saving and loading
Projects
None yet
Development

No branches or pull requests

3 participants