You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It should be possible to support incremental learning (#225) in the PAV backend. The sklearn IsotonicRegression models unfortunately cannot be updated with new data, but they are relatively simple and fast to recompute and it should be possible to limit the update only to a small number of subject-specific models. But this requires a separate database (e.g. SQLite) containing all the input data that was used for creating the regression models. The database contains a table with the following information:
document (represented by e.g. sha256 checksum of text)
source project ID
subject ID (or URI)
raw score returned by source project
whether the subject was relevant or not (boolean value)
The general idea is:
Propagate the learn operation (that specifies document text and gold standard subjects) to source projects
Analyze the document using the (now possibly updated) source projects
Determine the affected subjects (union of subjects suggested by any of the source projects and gold standard subjects)
Update (possibly replacing existing rows for the same document) the database with information for the affected subjects
Recreate the regression models for all affected subjects
The result is imperfect, as updating the source projects may affect scores also for unrelated documents, but we can't analyze them all here - this is the nature of incremental learning.
The text was updated successfully, but these errors were encountered:
It makes more sense to implement this kind of ensemble with Vowpal Wabbit which supports online learning natively. See the jupyter notebook where this has been tested - results were similar to PAV
It should be possible to support incremental learning (#225) in the PAV backend. The sklearn
IsotonicRegression
models unfortunately cannot be updated with new data, but they are relatively simple and fast to recompute and it should be possible to limit the update only to a small number of subject-specific models. But this requires a separate database (e.g. SQLite) containing all the input data that was used for creating the regression models. The database contains a table with the following information:The general idea is:
The result is imperfect, as updating the source projects may affect scores also for unrelated documents, but we can't analyze them all here - this is the nature of incremental learning.
The text was updated successfully, but these errors were encountered: