Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nn_ensemble backend #331

Merged
merged 31 commits into from
Oct 28, 2019
Merged
Show file tree
Hide file tree
Changes from 27 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
0cf82c5
First rough version of neural network ensemble backend
osma Sep 23, 2019
9b71ba7
Merge branch 'master' into issue330-nn-ensemble-backend
osma Sep 23, 2019
f0180a5
clean up imports
osma Sep 23, 2019
2a422b1
Install Keras and TensorFlow under Travis CI
osma Sep 23, 2019
069dab6
Add basic unit tests for nn_ensemble backend
osma Sep 23, 2019
fbf3419
Add more nn_ensemble unit tests + fix bug in suggest method
osma Sep 23, 2019
2691716
remove print statements
osma Sep 23, 2019
036027f
Merge branch 'master' into issue330-nn-ensemble-backend
osma Sep 24, 2019
10b0b00
Merge branch 'master' into issue330-nn-ensemble-backend
osma Sep 25, 2019
6ea609f
Specify explicit Keras and tensorflow versions. Use TF 1.15 RC to avo…
osma Sep 30, 2019
4d3a83d
fix setup.py syntax
osma Sep 30, 2019
1d08613
upgrade pip under Travis CI before installing anything (tensorflow ne…
osma Sep 30, 2019
c88c218
make Keras and tensorflow core dependencies, not optional; pin numpy …
osma Sep 30, 2019
1b94abe
upgrade pip on scrutinizer (tensorflow needs pip 19.*)
osma Sep 30, 2019
a789f46
nn_ensemble is now a core backend, remove conditional imports
osma Sep 30, 2019
94992b4
Merge branch 'master' into issue330-nn-ensemble-backend
osma Sep 30, 2019
bcb779b
Make hyperparameters configurable in nn_ensemble (with defaults)
osma Sep 30, 2019
9b7148b
fix syntax, pep8 and tests (doh)
osma Sep 30, 2019
3d6aff2
Merge branch 'master' into issue330-nn-ensemble-backend
osma Oct 4, 2019
5b90d25
Turn nn_ensemble into an optional feature again. I've had some trouble
osma Oct 4, 2019
f12929a
Adjust Pipfile, setup.py and .travis.yml to make nn feature optional
osma Oct 4, 2019
40c8af3
fix syntax (oops)
osma Oct 4, 2019
7c38754
Upgrade to TensorFlow 2.0
osma Oct 4, 2019
db8f893
Merge branch 'master' into issue330-nn-ensemble-backend
osma Oct 7, 2019
50e8202
more elegant handling of file name prefixes in annif.util.atomic_save
osma Oct 7, 2019
3c081f8
Refactor: Split learn method in nn_ensemble backend
osma Oct 7, 2019
8331a16
Avoid testing nn features on Python 3.6, to increase overall test cov…
osma Oct 7, 2019
bd6649b
set explicit dtype=float32 for numpy arrays to avoid wasting memory
osma Oct 9, 2019
b9093e7
Merge branch 'master' into issue330-nn-ensemble-backend
osma Oct 28, 2019
f0014df
Up the default to 100 nodes since it may produce better results
osma Oct 28, 2019
ce49174
Install tensorflow in Docker image
juhoinkinen Oct 28, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .scrutinizer.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ build:
dependencies:
override:
- pip install pipenv
- pipenv run pip install pip==18.0
- pipenv run pip install pip==19.*
- pipenv install --dev --skip-lock
tests:
override:
Expand Down
4 changes: 4 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,14 @@ cache: pip
before_install:
- export BOTO_CONFIG=/dev/null
install:
- pip install --upgrade pip
- pip install pipenv
- pip install --upgrade pytest
- pipenv install --dev --skip-lock
- travis_wait 30 python -m nltk.downloader punkt
# Install the optional neural network dependencies (Keras and TensorFlow)
# - except for one Python version (3.6) so that we can test also without them
- if [[ $TRAVIS_PYTHON_VERSION != '3.6' ]]; then pip install .[nn]; fi
# For Python 3.5, also install optional dependencies that were not specified in Pipfile
# For other Python versions we will only run the tests that depend on pure Python modules
# - fastText dependencies
Expand Down
3 changes: 2 additions & 1 deletion Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ sphinx = "*"
sphinx-rtd-theme = "*"

[packages]
"e1839a8" = {editable = true, path = "."}
"e1839a8" = {path = ".", editable = true}
connexion = {extras = ["swagger-ui"]}
swagger-ui-bundle = "*"
flask-cors = "*"
Expand All @@ -27,5 +27,6 @@ scikit-learn = "==0.21.*"
rdflib = "*"
gunicorn = "*"
sphinxcontrib-apidoc = "*"
numpy = "==1.17.*"

[requires]
7 changes: 7 additions & 0 deletions annif/backend/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,3 +43,10 @@ def get_backend(backend_id):
except ImportError:
annif.logger.debug("vowpalwabbit not available, not enabling " +
"vw_multi & vw_ensemble backends")

try:
from . import nn_ensemble
register_backend(nn_ensemble.NNEnsembleBackend)
except ImportError:
annif.logger.debug("Keras and TensorFlow not available, not enabling " +
"nn_ensemble backend")
126 changes: 126 additions & 0 deletions annif/backend/nn_ensemble.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
"""Neural network based ensemble backend that combines results from multiple
projects."""


import os.path
import numpy as np
from tensorflow.keras.layers import Input, Dense, Add, Flatten, Lambda, Dropout
from tensorflow.keras.models import Model, load_model
import tensorflow.keras.backend as K
import annif.corpus
import annif.project
import annif.util
from annif.exception import NotInitializedException
from annif.suggestion import VectorSuggestionResult
from . import ensemble


class NNEnsembleBackend(ensemble.EnsembleBackend):
"""Neural network ensemble backend that combines results from multiple
projects"""

name = "nn_ensemble"

MODEL_FILE = "nn-model.h5"

DEFAULT_PARAMS = {
'nodes': 60,
'dropout_rate': 0.2,
'optimizer': 'adam',
'epochs': 10,
}

# defaults for uninitialized instances
_model = None

def default_params(self):
params = {}
params.update(super().default_params())
params.update(self.DEFAULT_PARAMS)
return params

def initialize(self):
if self._model is not None:
return # already initialized
model_filename = os.path.join(self.datadir, self.MODEL_FILE)
if not os.path.exists(model_filename):
raise NotInitializedException(
'model file {} not found'.format(model_filename),
backend_id=self.backend_id)
self.debug('loading Keras model from {}'.format(model_filename))
self._model = load_model(model_filename)

def _merge_hits_from_sources(self, hits_from_sources, project, params):
score_vector = np.array([hits.vector * weight
for hits, weight in hits_from_sources])
results = self._model.predict(
np.expand_dims(score_vector.transpose(), 0))
return VectorSuggestionResult(results[0], project.subjects)

def _create_model(self, sources, project):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any comments @juhoinkinen or @mvsjober to the Keras model defined here? Would you do something differently?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

self.info("creating NN ensemble model")

inputs = Input(shape=(len(project.subjects), len(sources)))

flat_input = Flatten()(inputs)
drop_input = Dropout(
rate=float(
self.params['dropout_rate']))(flat_input)
hidden = Dense(int(self.params['nodes']),
activation="relu")(drop_input)
drop_hidden = Dropout(rate=float(self.params['dropout_rate']))(hidden)
delta = Dense(len(project.subjects),
kernel_initializer='zeros',
bias_initializer='zeros')(drop_hidden)

mean = Lambda(lambda x: K.mean(x, axis=2))(inputs)

predictions = Add()([mean, delta])

self._model = Model(inputs=inputs, outputs=predictions)
self._model.compile(optimizer=self.params['optimizer'],
loss='binary_crossentropy',
metrics=['top_k_categorical_accuracy'])

summary = []
self._model.summary(print_fn=summary.append)
self.debug("Created model: \n" + "\n".join(summary))

def train(self, corpus, project):
sources = annif.util.parse_sources(self.params['sources'])
self._create_model(sources, project)
self.learn(corpus, project)

def _corpus_to_vectors(self, corpus, project):
# pass corpus through all source projects
sources = [(annif.project.get_project(project_id), weight)
for project_id, weight
in annif.util.parse_sources(self.params['sources'])]

score_vectors = []
true_vectors = []
for doc in corpus.documents:
doc_scores = []
for source_project, weight in sources:
hits = source_project.suggest(doc.text)
doc_scores.append(hits.vector * weight)
score_vectors.append(np.array(doc_scores).transpose())
subjects = annif.corpus.SubjectSet((doc.uris, doc.labels))
true_vectors.append(subjects.as_vector(project.subjects))
# collect the results into a single vector, considering weights
scores = np.array(score_vectors)
# collect the gold standard values into another vector
true = np.array(true_vectors)
return (scores, true)

def learn(self, corpus, project):
scores, true = self._corpus_to_vectors(corpus, project)

# fit the model
self._model.fit(scores, true, batch_size=32, verbose=True,
epochs=int(self.params['epochs']))

annif.util.atomic_save(
self._model,
self.datadir,
self.MODEL_FILE)
5 changes: 4 additions & 1 deletion annif/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

import glob
import os
import os.path
import tempfile
import numpy as np
from annif import logger
Expand All @@ -14,7 +15,9 @@ def atomic_save(obj, dirname, filename, method=None):
filename, using a temporary file and renaming the temporary file to the
final name."""

tempfd, tempfilename = tempfile.mkstemp(prefix=filename, dir=dirname)
prefix, suffix = os.path.splitext(filename)
tempfd, tempfilename = tempfile.mkstemp(
prefix=prefix, suffix=suffix, dir=dirname)
os.close(tempfd)
logger.debug('saving %s to temporary file %s', str(obj), tempfilename)
if method is not None:
Expand Down
7 changes: 5 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,15 @@ def read(fname):
'nltk',
'gensim==3.8.*',
'scikit-learn==0.21.*',
'rdflib'],
'rdflib',
'numpy==1.17.*',
],
tests_require=['py', 'pytest', 'requests'],
extras_require={
'fasttext': ['fasttext', 'fasttextmirror==0.8.22'],
'voikko': ['voikko'],
'vw': ['vowpalwabbit==8.7.*', 'numpy'],
'vw': ['vowpalwabbit==8.7.*'],
'nn': ['tensorflow==2.0.*'],
},
entry_points={
'console_scripts': ['annif=annif.cli:cli']},
Expand Down
89 changes: 89 additions & 0 deletions tests/test_backend_nn_ensemble.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
"""Unit tests for the nn_ensemble backend in Annif"""

import time
import pytest
import annif.backend
import annif.corpus
import annif.project
from annif.exception import NotInitializedException

pytest.importorskip("annif.backend.nn_ensemble")


def test_nn_ensemble_suggest_no_model(datadir, project):
nn_ensemble_type = annif.backend.get_backend('nn_ensemble')
nn_ensemble = nn_ensemble_type(
backend_id='nn_ensemble',
config_params={'sources': 'dummy-en'},
datadir=str(datadir))

with pytest.raises(NotInitializedException):
results = nn_ensemble.suggest("example text", project)


def test_nn_ensemble_train_and_learn(app, datadir, tmpdir):
nn_ensemble_type = annif.backend.get_backend("nn_ensemble")
nn_ensemble = nn_ensemble_type(
backend_id='nn_ensemble',
config_params={'sources': 'dummy-en'},
datadir=str(datadir))

tmpfile = tmpdir.join('document.tsv')
tmpfile.write("dummy\thttp://example.org/dummy\n" +
"another\thttp://example.org/dummy\n" +
"none\thttp://example.org/none")
document_corpus = annif.corpus.DocumentFile(str(tmpfile))
project = annif.project.get_project('dummy-en')

with app.app_context():
nn_ensemble.train(document_corpus, project)
assert datadir.join('nn-model.h5').exists()
assert datadir.join('nn-model.h5').size() > 0

# test online learning
modelfile = datadir.join('nn-model.h5')

old_size = modelfile.size()
old_mtime = modelfile.mtime()

time.sleep(0.1) # make sure the timestamp has a chance to increase

nn_ensemble.learn(document_corpus, project)

assert modelfile.size() != old_size or modelfile.mtime() != old_mtime


def test_nn_ensemble_initialize(app, datadir):
nn_ensemble_type = annif.backend.get_backend("nn_ensemble")
nn_ensemble = nn_ensemble_type(
backend_id='nn_ensemble',
config_params={'sources': 'dummy-en'},
datadir=str(datadir))

assert nn_ensemble._model is None
with app.app_context():
nn_ensemble.initialize()
assert nn_ensemble._model is not None
# initialize a second time - this shouldn't do anything
with app.app_context():
nn_ensemble.initialize()


def test_nn_ensemble_suggest(app, datadir):
nn_ensemble_type = annif.backend.get_backend("nn_ensemble")
nn_ensemble = nn_ensemble_type(
backend_id='nn_ensemble',
config_params={'sources': 'dummy-en'},
datadir=str(datadir))

project = annif.project.get_project('dummy-en')

results = nn_ensemble.suggest("""Arkeologiaa sanotaan joskus myös
muinaistutkimukseksi tai muinaistieteeksi. Se on humanistinen tiede
tai oikeammin joukko tieteitä, jotka tutkivat ihmisen menneisyyttä.
Tutkimusta tehdään analysoimalla muinaisjäännöksiä eli niitä jälkiä,
joita ihmisten toiminta on jättänyt maaperään tai vesistöjen
pohjaan.""", project)

assert nn_ensemble._model is not None
assert len(results) > 0