You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using the spanish es_core_web_md and am finding that it mutates the supplied text such at the phrase 'al' gets transformed into 'ael'. I'm relatively new to the world of spacy, but there seems to be no record made of this transformation. This is a feature request to record these substitutions made so it's possible to map back from the spacy token.idx parameter to an index in the raw text supplied to the spacy pipeline. The cumulative effect of these extra 'e's means that tokens towards the end of large documents can have their index in spacy shifted quite dramatically from the original one supplied.
Currently we can by-pass this issue by suppling a special_case to the tokenizer to transform 'al' to 'al'. This however ignores the fact that normally this gets tokenized into two tokens, 'a' and 'el'.
If I'm understanding correctly, that's definitely a bug. The following should be true for any unicode string:
text==nlp(text).text
Any case that breaks this invariant is a bug.
On the current v2 the problem seems to be solved:
>>>importspacy>>>nlp=spacy.blank('es')
>>>nlp(u'al')
al
The v2 model also performs quite a bit better on Spanish than the v1 model. You can get it with pip install spacy-nightly. Docs are available at https://alpha.spacy.io
Hi,
I'm using the spanish es_core_web_md and am finding that it mutates the supplied text such at the phrase 'al' gets transformed into 'ael'. I'm relatively new to the world of spacy, but there seems to be no record made of this transformation. This is a feature request to record these substitutions made so it's possible to map back from the spacy
token.idx
parameter to an index in the raw text supplied to the spacy pipeline. The cumulative effect of these extra 'e's means that tokens towards the end of large documents can have their index in spacy shifted quite dramatically from the original one supplied.Currently we can by-pass this issue by suppling a special_case to the tokenizer to transform 'al' to 'al'. This however ignores the fact that normally this gets tokenized into two tokens, 'a' and 'el'.
Your Environment
Info about spaCy
The text was updated successfully, but these errors were encountered: