v2.0.0 alpha: Neural network models, Pickle, better training & lots of API improvements
Pre-release Last update: 2.0.0rc2
, 2017-11-07
This is an alpha pre-release of spaCy v2.0.0 and available on pip as spacy-nightly
. It's not intended for production use. The alpha documentation is available at alpha.spacy.io. Please note that the docs reflect the library's intended state on release, not the current state of the implementation. For bug reports, feedback and questions, see the spaCy v2.0.0 alpha thread.
Before installing v2.0.0 alpha, we recommend setting up a clean environment.
pip install spacy-nightly
The models are still under development and will keep improving. For more details, see the benchmarks below. There will also be additional models for German, French and Spanish.
Name | Lang | Capabilities | Size | spaCy | Info |
---|---|---|---|---|---|
en_core_web_sm-2.0.0a4 |
en | Parser, Tagger, NER | 42MB | >=2.0.0a14 |
ℹ️ |
en_vectors_web_lg-2.0.0a0 |
en | Vectors (GloVe) | 627MB | >=2.0.0a10 |
ℹ️ |
xx_ent_wiki_sm-2.0.0a0 |
multi | NER | 12MB | <=2.0.0a9 |
ℹ️ |
You can download a model by using its name or shortcut. To load a model, use spaCy's loader, e.g. nlp = spacy.load('en_core_web_sm')
, or import it as a module (import en_core_web_sm
) and call its load()
method, e.g nlp = en_core_web_sm.load()
.
python -m spacy download en_core_web_sm
📈 Benchmarks
The evaluation was conducted on raw text with no gold standard information. Speed and accuracy are currently comparable to the v1.x models: speed on CPU is slightly lower, while accuracy is slightly higher. We expect performance to improve quickly between now and the release date, as we run more experiments and optimise the implementation.
Model | spaCy | Type | UAS | LAS | NER F | POS | Words/s |
---|---|---|---|---|---|---|---|
en_core_web_sm-2.0.0a4 |
v2.x | neural | 91.9 | 90.0 | 85.0 | 97.1 | 10,000 |
en_core_web_sm-2.0.0a3 |
v2.x | neural | 91.2 | 89.2 | 85.3 | 96.9 | 10,000 |
en_core_web_sm-2.0.0a2 |
v2.x | neural | 91.5 | 89.5 | 84.7 | 96.9 | 10,000 |
en_core_web_sm-1.1.0 |
v1.x | linear | 86.6 | 83.8 | 78.5 | 96.6 | 25,700 |
en_core_web_md-1.2.1 |
v1.x | linear | 90.6 | 88.5 | 81.4 | 96.7 | 18,800 |
✨ Major features and improvements
- NEW: Neural network model for English (comparable performance to the >1GB v1.x models) and multi-language NER (still experimental).
- NEW: GPU support via Chainer's CuPy module.
- NEW: Strings are now resolved to hash values, instead of mapped to integer IDs. This means that the string-to-int mapping no longer depends on the vocabulary state.
- NEW: Trainable document vectors and contextual similarity via convolutional neural networks.
- NEW: Built-in text classification component.
- NEW: Built-in displaCy visualizers with Jupyter notebook support.
- NEW: Alpha tokenization for Danish, Polish and Indonesian.
- Improved language data, support for lazy loading and simple, lookup-based lemmatization for English, German, French, Spanish, Italian, Hungarian, Portuguese and Swedish.
- Improved language processing pipelines and support for custom, model-specific components.
- Improved and consistent saving, loading and serialization across objects, plus Pickle support.
- Revised matcher API to make it easier to add and manage patterns and callbacks in one step.
- Support for multi-language models and new
MultiLanguage
class (xx
). - Entry point for
spacy
command to use instead ofpython -m spacy
.
🚧 Work in progress (not yet implemented)
- NEW: Neural network models for German, French and Spanish.
- NEW:
Binder
, a container class for serializing collections ofDoc
objects.
🔴 Bug fixes
- Fix issue #125, #228, #299, #377, #460, #606, #930: Add full Pickle support.
- Fix issue #152, #264, #322, #343, #437, #514, #636, #785, #927, #985, #992, #1011: Fix and improve serialization and deserialization of
Doc
objects. - Fix issue #512: Improve parser to prevent it from returning two
ROOT
objects. - Fix issue #524: Improve parser and handling of noun chunks.
- Fix issue #621: Prevent double spaces from changing the parser result.
- Fix issue #664, #999, #1026: Fix bugs that would prevent loading trained NER models.
- Fix issue #671, #809, #856: Fix importing and loading of word vectors.
- Fix issue #753: Resolve bug that would tag OOV items as personal pronouns.
- Fix issue #905, #1021, #1042: Improve parsing model and allow faster accuracy updates.
- Fix issue #995: Improve punctuation rules for Hebrew and other non-latin languages.
- Fix issue #1008:
train
command finally works correctly if used withoutdev_data
. - Fix issue #1012: Improve documentation on model saving and loading.
- Fix issue #1043: Improve NER models and allow faster accuracy updates.
- Fix issue #1051: Improve error messages if functionality needs a model to be installed.
- Fix issue #1071: Correct typo of "whereve" in English tokenizer exceptions.
- Fix issue #1088: Emoji are now split into separate tokens wherever possible.
🚧 Work in progress (not yet implemented)
📖 Documentation and examples
- NEW: spacy 101 guide with simple explanations and illustrations of the most important concepts and an overview of spaCy's features and capabilities.
- NEW: Visualizing spaCy guide on how to use the built-in
displacy
module. - NEW: API docs for top-level functions,
spacy.displacy
,spacy.util
,spacy.gold.GoldCorpus
. - NEW: Full code example for text classification (sentiment analysis).
- Improved rule-based matching guide with examples for matching entities and phone numbers, and social media analysis.
- Improved processing pipelines guide with examples for custom sentence segmentation logic and hooking in a sentiment analysis model.
- Re-wrote all API and usage docs and added more examples.
🚧 Work in progress (not yet implemented)
- NEW: Usage guide on scaling spaCy for production.
- NEW: Usage guide on text classification.
- NEW: API docs for
spacy.pipeline.TextCategorizer
,spacy.pipeline.Tensorizer
,spacy.tokens.binder.Binder
andspacy.vectors.Vectors
.- Improved training, NER training and deep learning usage guides with examples.
⚠️ Backwards incompatibilities
Note that the old v1.x models are not compatible with spaCy v2.0.0. If you've trained your own models, you'll have to re-train them to be able to use them with the new version. For a full overview of changes in v2.0, see the alpha documentation and guide on migrating from spaCy 1.x.
Loading models
spacy.load()
is now only intended for loading models – if you need an empty language class, import it directly instead, e.g. from spacy.lang.en import English
. If the model you're loading is a shortcut link or package name, spaCy will expect it to be a model package, import it and call its load()
method. If you supply a path, spaCy will expect it to be a model data directory and use the meta.json to initialise a language class and call nlp.from_disk()
with the data path.
nlp = spacy.load('en')
nlp = spacy.load('en_core_web_sm')
nlp = spacy.load('/model-data')
nlp = English().from.disk('/model-data')
# OLD: nlp = spacy.load('en', path='/model-data')
Hash values instead of integer IDs
The StringStore
now resolves all strings to hash values instead of integer IDs. This means that the string-to-int mapping no longer depends on the vocabulary state, making a lot of workflows much simpler, especially during training. However, you still need to make sure all objects have access to the same Vocab
. Otherwise, spaCy won't be able to resolve hashes back to their string values.
nlp.vocab.strings[u'coffee'] # 3197928453018144401
other_nlp.vocab.strings[u'coffee'] # 3197928453018144401
Serialization
spaCy's serialization API is now consistent across objects. All containers and pipeline components have .to_disk()
, .from_disk()
, .to_bytes()
and .from_bytes()
methods.
nlp.to_disk('/model')
nlp.vocab.to_disk('/vocab')
# OLD: nlp.save_to_directory('/model')
Processing pipelines
Models can now define their own processing pipelines as a list of strings, mapping to component names. Components receive a Doc
, modify it and return it to be processed by the next component in the pipeline. You can add custom components to nlp.pipeline
, and disable components by adding their name to the disable
keyword argument. The tokenizer can simply be overwritten with a custom function.
nlp = spacy.load('en', disable=['tagger', 'ner'])
nlp.tokenizer = my_custom_tokenizer
nlp.pipeline.append(my_custom_component)
doc = nlp(u"I don't want parsed", disable=['parser'])
Comparison table
For the complete table and more details, see the alpha guide on what's new in v2.0.
Old | New | Notes |
---|---|---|
spacy.en , spacy.de , ... |
spacy.lang.en , ... |
Language data moved to lang . |
.save_to_directory , .dump , .dump_vectors |
.to_disk , to_bytes |
Consistent serialization. |
.load , .load_lexemes , .load_vectors , .load_vectors_from_bin_loc |
.from_disk , .from_bytes |
Consistent serialization. |
Language.create_make_doc |
Language.tokenizer |
Tokenizer can now be replaced via nlp.tokenizer . |
Matcher.add_pattern , Matcher.add_entity |
Matcher.add |
Simplified API. |
Matcher.get_entity , Matcher.has_entity |
Matcher.get , Matcher.__contains__ |
Simplified API. |
Doc.read_bytes |
Binder |
Consistent API. |
Token.is_ancestor_of |
Token.is_ancestor |
Duplicate method. |
👥 Contributors
This release is brought to you by @honnibal and @ines. Thanks to @Gregory-Howard, @luvogels, @ferdous-al-imran, @uetchy, @akYoung, @kengz, @raphael0202, @ardeego, @yuvalpinter, @dvsrepo, @frascuchon, @oroszgy, @v3t3a, @Tpt, @thinline72, @jarle, @jimregan, @nkruglikov, @delirious-lettuce and @geovedi for the pull requests and contributions. Also thanks to everyone who submitted bug reports and took the spaCy user survey – your feedback made a big difference!