Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add spaCy analyzer #527

Merged
merged 13 commits into from
Jan 24, 2022
Merged

Add spaCy analyzer #527

merged 13 commits into from
Jan 24, 2022

Conversation

osma
Copy link
Member

@osma osma commented Sep 1, 2021

Initial draft PR of new spaCy based (optional) analyzer.
Fixes #374

TODO items:

  • Option to force lowercasing of lemmas
  • Test with Swedish (which doesn't have a complete pretrained model) and adapt the code as necessary Out of scope for now
  • Adapt YAKE backend so it doesn't do word-by-word normalization
  • Add English simple model to Docker image
  • Add wiki documentation: what the analyzer is, how to install, how to add more languages (native install & Docker case)

@osma osma self-assigned this Sep 1, 2021
@codecov
Copy link

codecov bot commented Sep 1, 2021

Codecov Report

Merging #527 (8ad7445) into master (ccb9982) will increase coverage by 0.01%.
The diff coverage is 98.55%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #527      +/-   ##
==========================================
+ Coverage   99.49%   99.50%   +0.01%     
==========================================
  Files          80       82       +2     
  Lines        5340     5458     +118     
==========================================
+ Hits         5313     5431     +118     
  Misses         27       27              
Impacted Files Coverage Δ
annif/analyzer/voikko.py 94.73% <0.00%> (ø)
annif/analyzer/__init__.py 100.00% <100.00%> (ø)
annif/analyzer/analyzer.py 100.00% <100.00%> (ø)
annif/analyzer/simple.py 100.00% <100.00%> (ø)
annif/analyzer/snowball.py 100.00% <100.00%> (ø)
annif/analyzer/spacy.py 100.00% <100.00%> (ø)
annif/backend/yake.py 98.23% <100.00%> (-0.05%) ⬇️
tests/test_analyzer.py 100.00% <100.00%> (ø)
tests/test_analyzer_spacy.py 100.00% <100.00%> (ø)
tests/test_analyzer_voikko.py 100.00% <100.00%> (ø)
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ccb9982...8ad7445. Read the comment docs.

@osma
Copy link
Member Author

osma commented Sep 1, 2021

It's working now (at least for the tfidf backend) but pretty slow - at least an order of magnitude slower than the Snowball analyzer. I think some batching must be used to make it more efficient, but that requires changes to the Analyzer API as well as to individual backends.

@sonarcloud
Copy link

sonarcloud bot commented Sep 3, 2021

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@osma
Copy link
Member Author

osma commented Nov 22, 2021

Rebased and force-pushed.

There is a new release of spaCy (3.2.0) available, should test that.

@sonarcloud
Copy link

sonarcloud bot commented Nov 22, 2021

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@osma
Copy link
Member Author

osma commented Jan 18, 2022

Rebased on current master, fixed conflicts and force-pushed.

@osma osma marked this pull request as ready for review January 19, 2022 09:03
@osma osma requested a review from juhoinkinen January 19, 2022 09:03
@osma
Copy link
Member Author

osma commented Jan 19, 2022

Ready for review!

Things I'm a bit unsure about:

  • Does this actually work with all relevant backends? Especially YAKE, where I adjusted the way the backend calls the analyzer.
  • Are the Dockerfile changes sensible? I.e. adding spacy as a default extension and downloading spacy models based on a build argument.

@juhoinkinen
Copy link
Member

Things I'm a bit unsure about:

* Does this actually work with all relevant backends? Especially YAKE, where I adjusted the way the backend calls the analyzer.

* Are the Dockerfile changes sensible? I.e. adding spacy as a default extension and downloading spacy models based on a build argument.

Works with YAKE, and Spacy analyzer also somewhat improved evaluation results compared to Snowball analyzer (on JYU test set F1@5 0.1706 -> 0.1870).

Dockerfile looks good to me. (The three tries for timeouts in downloading ntlk data was originally added for builds by/in Drone, as there were some network problems in Drone at that time, but I think the situation has improved now.)

Just one point to consider: if Spacy model has not been loaded, a bit lengthy traceback is shown ending OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory. The ending is quite self-explanatory, so no big deal.

@osma
Copy link
Member Author

osma commented Jan 19, 2022

I compared this to Snowball using the Annif-tutorial yso-nlf data sets and the three backend configurations (tfidf, mllm, omikuji-parabel) used in the tutorial. Results (best score for each backend type highlighted):

backend train set analyzer train jobs train time train RSS eval time eval RSS F1@5 NDCG
tfidf yso-finna snowball 1 384.65 1528848 166.86 590304 0.1812 0.2551
tfidf yso-finna spacy 1 4,676.27 1614240 1,171.48 3580224 0.1749 0.2476
tfidf yso-finna spacy-lower 1 4,692.22 1596052 1,184.68 3570112 0.1749 0.2476
mllm jyu (n=400) snowball 4 771.82 2082856 414.50 572140 0.2933 0.4043
mllm jyu (n=400) spacy 4 3,755.26 2124704 2,321.21 677976 0.2871 0.3966
mllm jyu (n=400) spacy-lower 4 3,724.93 2125776 2,234.06 687532 0.2843 0.3868
omikuji yso-finna snowball 8 2,874.95 5285764 166.33 1561728 0.3078 0.4253
omikuji yso-finna spacy 8 7,045.99 5513900 1,169.09 4713760 0.3148 0.4312
omikuji yso-finna spacy-lower 8 7,062.35 5830028 1,129.53 4708832 0.3290 0.4451

Observations:

  • spaCy is much much slower than Snowball (not entirely unexpected as it is a lot more sophisticated)
  • train memory usage was a bit higher for spaCy (vs Snowball) but the differences were quite small
  • eval memory usage of spaCy was much higher (compared to Snowball) for tfidf and omikuji, but not much higher for MLLM - not sure why but maybe it will become clear after some thinking
  • the eval results for tfidf and mllm were worse than Snowball, but with Omikuji there was a small improvement (and the lowercased version was better than plain)

The point of these experiments was to check that the analyzer works reasonably well with those backends, not that the results are necessarily better in terms of F1 scores etc. spaCy has other advantages, especially the many languages it supports.

@osma
Copy link
Member Author

osma commented Jan 19, 2022

I also tested svc and fasttext backends using the 20news data set in Annif-corpora. Results:

backend analyzer train jobs train time train RSS eval set eval time eval RSS P@1 NDCG
svc snowball not set 30.14 557936 20news-test 40.73 458092 0.6819 0.7999
svc spacy not set 203.95 795692 20news-test 156.83 706480 0.6840 0.8015
svc spacy-lower not set 203.23 789280 20news-test 155.10 698340 0.6840 0.8015
fasttext snowball not set 91.96 631684 20news-test 32.81 477348 0.4859 0.6924
fasttext spacy not set 259.54 769736 20news-test 157.06 501760 0.4792 0.6862
fasttext spacy-lower not set 251.19 775568 20news-test 157.20 497220 0.4846 0.6891

Observations:

  • spaCy was again much slower than Snowball
  • memory usage wasn't dramatically higher
  • results improved a bit for SVC, but were worse for fastText

I'd say this is good enough, I will check a few final things (including the error shown when a model doesn't exist, thanks @juhoinkinen!) and then merge this.

@sonarcloud
Copy link

sonarcloud bot commented Jan 20, 2022

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

No Coverage information No Coverage information
0.0% 0.0% Duplication

@juhoinkinen juhoinkinen added this to the 0.56 milestone Jan 21, 2022
@osma osma merged commit 1b47624 into master Jan 24, 2022
@osma osma deleted the issue374-spacy-analyzer branch January 24, 2022 08:05
@osma osma mentioned this pull request Feb 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

spaCy analyzer
2 participants