-
Notifications
You must be signed in to change notification settings - Fork 4
Circulation Manager Scripts
The SimplyE Circulation Manager manages data in two primary ways:
- in response to real-time API requests, such as initiating a loan to a patron; and
- by executing data management scripts, either periodically or manually, to do things like update bibliographic metadata across the entire system.
These dual approaches are supported by deploying the codebase twice, once as a web application and once as a job runner. The job runner does not respond to web requests, and is instead responsible for executing scripts via cron
.
The timing of automated tasks is set in docker/services/simplified_crontab
, in the Circulation Manager repository. If crontab
syntax is unfamiliar, the website crontab.guru is helpful for decoding individual timing expressions.
Each script will, before beginning, look for any currently running instance of itself. If it is already active (likely still running from a previous cron job), the new script invocation will exit to avoid race conditions. The script will run again at the next scheduled time, as long as no copies of it are already running.
The executable lines in the crontab file all reference a Bash shell script, core/bin/run
, passing it the name of a Python script (mostly found in the Circulation Manager's bin/
directory) as an argument.
The core/bin/run
script is responsible for three things for each script it kicks off:
- Making sure that two copies of the same script don't run at the same time
- Setting up a log file for the script, in the
/var/log/simplified
directory - Passing any additional command line parameters to the script as arguments
The Python scripts in the bin/
directory are themselves typically quite compact, and mostly serve as hooks for the various Monitor subclasses discussed below. Take for example the axis_monitor
script:
#!/usr/bin/env python
"""Monitor the Axis 360 collection by asking about recently changed books."""
import os
import sys
bin_dir = os.path.split(__file__)[0]
package_dir = os.path.join(bin_dir, "..")
sys.path.append(os.path.abspath(package_dir))
from core.scripts import RunCollectionMonitorScript
from api.axis import Axis360CirculationMonitor
RunCollectionMonitorScript(Axis360CirculationMonitor).run()
Like many of the scripts in bin/
, it simply imports a particular Monitor and passes it to one of the execution functions in core.scripts
, such as RunCollectionMonitorScript
, which contain execution logic specific to the type of Monitor being run.
SimplyE scripts do not communicate with each other directly at run-time. However, there are tables in the database which are used to track items that scripts should address on their next run. The presence or absence of a record in one of these tables indicates that a related object should be reviewed or refreshed:
-
coveragerecords
- tracksIdentifier
records which need attention -
workcoveragerecords
- tracksWork
records which need attention
Some SimplyE data management depends on a cascade of steps executed by consecutive runs of different scripts, each of which checks these tables for relevant records. For example, a Collection Monitor might obtain new metadata for a work, indicating that its classifications have changed. The Collection Monitor script won't do the reclassification itself, but it will delete the work's record in workcoveragerecords
, so that when the work_classification
script next runs it will notice the lack of a work coverage record, and re-classify the work.
The majority of SimplyE's automated scripts fall into one (or more) of a few categories of common behavior. Base classes that define shared functionality across similar scripts can be found in the core.monitor
module.
Collection Monitors are scripts which synchronize data related to one or more Collection
records. Each collection in the set is processed separately, and the works within that collection may all be checked, or only checked if they require specific attention.
Timeline Monitor scripts (which are all also Collection Monitors, via multiple inheritance) are responsible for sychronizing data which changed during a specified window of time. While implementation details differ (because external APIs each represent event streams differently), each Timeline Monitor pulls in data from the time between its last run and the present. It then starts with the oldest data and iterates forward chronologically, so that in the event of a crash the timeline is maintained.
Sweep Monitors are specialized Collection Monitors which audit and potentially update every item in a Collection (or in some cases an entire database table). In contrast to tasks which look for records marked for attention, these are checks which must be run on every record periodically.
Reaper scripts remove or update records which have become incorrect due to passive expiration. Take as an example an ebook for which we have licensed a total of 10 checkouts. Each checkout by a patron creates an API call to the distributor that reduces the total available checkouts by 1. Knowing when we have hit zero remaining checkouts is important, but that event is not published in an event stream as metadata updates are. To make sure we correctly represent the current number of checkouts available for each item, a reaper script periodically audits the license usage for each work and adjusts the database accordingly.
Other uses for reaper scripts are to remove works we no longer have rights to loan, delete old or invalid cached versions of OPDS feeds, delete unused credentials for removed system integrations, etc.
Note: Some reaper scripts are specialized SweepMonitor
classes, such as api.axis.AxisCollectionReaper
, while others are derived from the ReaperMonitor
class, such as CredentialReaper
.
Where reapers remove entire records, a Scrubber deletes the content of one or more fields in a record, according to a set of redaction rules. For instance, a Scrubber Monitor might be used to remove location information from a record after a year, but otherwise leave the record intact.
Description: Refresh the top-level OPDS groups.
Schedule: */5 * * * *
(Every 5th minute)
Notes:
- Many parts of SimplyE are represented in transit by Open Publication Distribution System (OPDS) feeds. These are XML documents which describe different ways of aggregating and associating publications. For instance, an OPDS document might list all books in a catalog by a single author.
- Because they can represent arbitrary sets of publications aggregated along many different axes, feed documents can be inefficient to produce in real-time during a web request.
- This script is responsible for periodically generating OPDS documents for all grouped feeds (lanes with sub-lanes), across SimplyE, and making sure those documents are stored in the back end database as
CachedFeed
records.
Description: Re-index any Works whose entries in the search index have become out of date.
Schedule: */6 * * * *
(Every 6th minute)
Notes:
- Searches in SimplyE are supported by Elasticsearch, an open-source indexing and search server which runs alongside the Circulation Manager.
- We index works on a number of axes, including age and genre classification, membership in custom lists, and availability. A number of events can cause an entry in those indexes to become out of date, such as a patron checking out the last available copy of a work or a vendor updating the spelling of a work's title.
- When a work needs to be re-indexed, the script which notices that will flag the work for re-indexing by deleting the entry in the
workcoveragerecords
table for that work which hasupdate-search-index
as itsoperation
field. - This script looks for works with no current record in
workcoveragerecords
for theupdate-search-index
operation, re-indexes those works, and adds new success records toworkcoveragerecords
for those works. - The script runs very frequently, but it is worth noting that it is not efficient to refresh search indexes in real time, which is why we batch these updates via a periodic script.
Description: Gather information from the metadata wrangler. This script should be invoked as metadata_wrangler_collection_registrar. metadata_wrangler_collection_sync is a deprecated name for the same script.
Schedule: */10 * * * *
(Every 10th minute)
Notes:
- Because the Metadata Wrangler is currently deprecated, scripts related to it run perform no work.
Description: Monitor the Axis 360 collection by asking about recently changed books.
Schedule: */15 * * * *
(Every 15th minute)
Notes:
- A Collection Monitor / Timeline Monitor script responsible for querying the Baker & Taylor Axis360 API for works which have had their bibliographic metadata updated since the last time the script ran.
- We pass the Axis360 API a date in the past (the script determines this based on the last time it ran), and receive a set of all works with changes from that date to the present.
- While the data we receive resembles an event stream, it does consolidate multiple change events in the requested time window such that we receive a single record per work, representing the most current metadata.
Description: Monitor the Bibliotheca collections by asking about recently changed events.
Schedule: */15 * * * *
(Every 15th minute)
Notes:
- A Collection Monitor / Timeline Monitor script responsible for querying the Bibliotheca API for metadata change events relevant to works from that distributor.
- In contrast to the Axis360 API for change events, the Bibliotheca API serves an unconsolidated event stream for all events between supplied start and end dates.
- A notable limitation of the Bibliotheca API is that it will only return the first 100 events for a given slice of time, and does not indicate if that limit overflows. Consequently we ask for five minute slices, iterating backwards to reach the last time the script can account for.
- Once we reach the end time of a previous successful run, the script iterates forward through the stored event stream, replaying changes in chronological order onto our records.
- Because the event stream may silently overflow, we do not rely on this script alone to update Bibliotheca metadata--several other scripts cover similar operations, to keep our metadata in sync with Bibliotheca's.
Description: Updates an Odilo collection.
Schedule: */15 * * * *
(Every 15th minute)
Notes:
* Currently investigating to determine if Odilo collections should be supported going forward.
Description: Monitor the Overdrive collections by going through the recently changed list.
Schedule: */15 * * * *
(Every 15th minute)
Notes:
- A Collection Monitor / Timeline Monitor script responsible for querying the Overdrive API for updates to bibliographic metadata.
- The Overdrive API allows you to specify a time delta backwards from the present, and in response gives you a list of all Overdrive identifiers that have changed during that period. Once we obtain that list, we separately query for the metadata of those works, to find updates to apply to our records.
Description: Monitor the Overdrive collections by looking for books with lost licenses.
Schedule: */15 * * * *
(Every 15th minute)
Notes:
- For each work a Library licenses from Overdrive, there are N concurrent checkouts available. When the work is checked out, the number of checkouts is updated on the Overdrive servers via API. However, the available license count is not immediately reflected everywhere in SimplyE.
- This reaper script is responsible for iterating all Overdrive works, and updating records in SimplyE to reflect the current state of that work, including the number of available checkouts.
Description: Monitor bibliographic updates to the Metadata Wrangler remote collection.
Schedule: */59 * * * *
(Every 59th minute)
Notes:
- Because the Metadata Wrangler is currently deprecated, scripts related to it run perform no work.
Description: Re-classify any Works that need it.
Schedule: 0 */3 * * *
(At minute 0, every 3rd hour)
Notes:
- Whenever this script runs, it will re-classify any work which does not currently have a relevant record in
workcoveragerecords
. Other scripts may queue a work for classification by deleting from that table. - Works are classified on a number of axes, including audience, genre, etc. The end result of the classification process is sent to elasticsearch, to help sort works during search.
Description: Re-generate presentation editions of any Works that need it.
Schedule: 0 */3 * * *
(At minute 0, every 3rd hour)
Notes:
- Every work has a "presentation edition," which is used as the canonical display version of the work when it appears in client applications. When information about a work changes, it may be necessary to regenerate its presentation edition.
- Typically the presentation edition is regenerated if one of two things occurs:
- a work's distributor updates the metadata of that work
- a librarian makes edits to the metadata of the work
- Whenever there are multiple data sources for a work, a presentation edition is synthesized by combining the data sources, with each data source having a weight that determines how to use it when building the presentation edition.
Description: Monitor the RBdigital collections by going through the availability endpoint result list. Update RBDigital Licensepools to have either 0 or 1 available copies, based on availability flag returned from RBDigital.
Schedule: 0 */4 * * *
(At minute 0, every 4th hour)
Notes:
- Since Overdrive acquired RBDigital, this script's functionality has been superceded by the various Overdrive scripts. It will likely be removed in a future release.
Description: Ask the Bibliotheca API about license purchases, potentially purchases that happened many years in the past.
Schedule: 0 */5 * * *
(At minute 0, every 5th hour)
Notes:
- This script looks for works that have been recently added to the collection via Bibliotheca.
- It asks the Bibliotheca API about one day of acquisitions at a time, and for each new Bibliotheca ID it will create empty work records. Once the ID is in the system, a separate process will eventually backfill the data for that work and create a presentation edition.
Description: monitor the Enki collection by asking about recently changed books.
Schedule: 0 */6 * * *
(At minute 0, every 6th hour)
Description: Update the circulation manager server with new books from shared ODL collections.
Schedule: 5 */6 * * *
(At minute 5, every 6th hour)
Notes:
- Normal ODL is intended to allow a vendor to share information with a Circulation Manager. Shared ODL allows one Circulation Manager instance to connect to a different Circulation Manager instance.
- An example use case would be a state library association setting up a Circulation Manager for their statewide collection, into which they import a normal ODL feed, and set up a collection based on it as a shared ODL source. Libraries which belong to the association (some or all of whom are running their own Circulation Managers), then use shared ODL to source that collection's information from the state association's Circulation Manager.
- This enables hold management for shared collections. Normal ODL has no concept of a hold, expecting individual systems to handle that according to their own rules. Under shared ODL, the library Circulation Managers periodically ask the state association Circulation Manager about availability via this script and updates the local data accordingly.
Description: Check for ODL holds that have expired and delete them.
Schedule: 0 */8 * * *
(At minute 0, every 8th hour)
Notes:
- On a Circulation Manager which is the source of a shared ODL feed (such as a state library association's CM instance), this script will look for holds which have expired and remove them.
Description: Update the cached sizes of all custom lists.
Schedule: 0 0 * * *
(At minute 0 of hour 0)
Notes:
- It can be expensive to impute the size of a custom list in real time, and it isn't something that changes frequently. This script updates the count of items in every custom list.
Description: Mark all Works as having out-of-date search index entries.This guarantees that the search index is periodically rebuilt from scratch, providing automatic recovery from bugs and major metadata changes.
Schedule: 10 0 * * *
(At minute 10 of hour 0)
Notes:
- This script invalidates every entry in the search index, and is run once per day. The index itself is not destroyed, but by invalidating the index entries of all works in
workcoveragerecords
, we guarantee that the index entry for each one will be refreshed at least daily.
Description: Refresh and store the MARC files for lanes.
Schedule: 0 1 * * *
(At minute 0 of hour 1)
Notes:
- This script generates incremental MARC records and uploads them to an S3 bucket, in case a library wants to take data from their Circulation Manager and upload it to their ILS.
Description: Remove miscellaneous expired things (Credentials, CachedFeeds, Loans, etc.) from the database.
Schedule: 0 2 * * *
(At minute 0 of hour 2)
Notes:
- Acts as an umbrella script for a number of similar reaper jobs.
- Used principally when a model class includes a timestamp field, and has rules for how old that timestamp can be before the record is expired and should be deleted.
- The reapers run by this are child classes of
ReaperMonitor
, and each child class defines the model/table, the name of the timestamp field, and the maximum age of a record. - There is a registry of reaper monitors and scrubber monitors, and all of the registered classes are run by this script.
Description: Look for new titles added to Overdrive collections which slipped through the cracks.
Schedule: 0 3 * * *
(At minute 0 of hour 3)
Description: Bring in the entire history of all NYT best-seller lists.
Schedule: 30 3 * * *
(At minute 30 of hour 3)
Description: Monitor the Axis collection by looking for books that have been removed.
Schedule: 0 4 * * *
(At minute 0 of hour 4)
Description: Update the circulation manager server with new books from OPDS import collections that have authentication.
Schedule: 0 4 * * *
(At minute 0 of hour 4)
Description: Sweep through our Overdrive collections updating delivery mechanisms.
Schedule: 0 4 * * *
(At minute 0 of hour 4)
Description: Sweep through our Bibliotheca collections verifying circulation stats.
Schedule: 0 5 * * *
(At minute 0 of hour 5)
Description: Update the circulation manager server with new books from OPDS import collections.
Schedule: 0 5 * * *
(At minute 0 of hour 5)
Description: Refreshes SAML federated metadata. Please note that the monitor looks up for federations in samlfederations
table. Currently, there is no way to configure SAML federations in the admin interface.
Schedule: 0 5 * * *
(At minute 0 of hour 5)
Description: Update the circulation manager server with new books from OPDS 2.0 import collections.
Schedule: 30 5 * * *
(At minute 30 of hour 5)
Description: Update the circulation manager server with new books from ODL collections.
Schedule: 0 6 * * *
(At minute 0 of hour 6)
Description: Update the cached sizes of all lanes.
Schedule: 0 10 * * *
(At minute 0 of hour 10)
Description: Upload information to the metadata wrangler.
Schedule: 30 21 * * *
(At the 30th minute of hour 21)
Notes:
- Because the Metadata Wrangler is currently deprecated, scripts related to it run perform no work.
Description: Remove unlicensed items from the remote metadata wrangler Collection.
Schedule: 0 */22 * * *
(At minute 0 of every 22nd hour)
Notes:
- Because the Metadata Wrangler is currently deprecated, scripts related to it run perform no work.
Description: (Re)calculate the presentation of works associated with unchecked subjects.
Schedule: 30 22 * * *
(At minute 30 of hour 22)
Notes:
- In the
Subjects
model, there is a boolean field namedchecked
. This script looks for subjects whosechecked
value isFalse
, and reclassifies all works that fall under that subject. - This script is typically triggered by a database migration related to a set of changed rules around subject matter classification. As an example, at one point we had an incorrect rule set for classifying "Urban Fiction" and "Urban Fantasy," leading to works being incorrectly classified and indexed. We updated the classification rules, then created a database migration which set the
checked
field toFalse
for those subjects. The next time this script ran, it reclassified all works related to those subjects.
Description: Make sure all presentation-ready works have up-to-date OPDS entries.
Schedule: 40 22 * * *
(At the 40th minute of hour 22)
Notes:
- Normally when there is a change to a book's metadata or its presentation edition, we immediately generate a new cached OPDS entry for it. Consequently this script often has no work, unless a database migration has cleared cached feeds or changed the rules by which they are generated.
- This script will run for works which do not have corresponding records in
workcoveragerecords
. - As an example of migrations which create work for this script, see
20180117-regenerate-opds-for-dc-issued.sql
in theserver_core
repository.
Description: Make sure an RBDigital collection is up to date.
Schedule: 0 23 * * *
(At minute 0 of hour 23)
Notes:
- Since Overdrive acquired RBDigital, this script's functionality has been superceded by the various Overdrive scripts. It will likely be removed in a future release.
Description: Perform the initial population of an RBdigital collection with works from the RBdigital content server.
Schedule: 0 23 * * *
(At minute 0 of hour 23)
Notes:
- Since Overdrive acquired RBDigital, this script's functionality has been superceded by the various Overdrive scripts. It will likely be removed in a future release.
Description: Make sure all presentation-ready works have up-to-date MARC records.
Schedule: 40 23 * * *
(At the 40th minute of hour 23)
Notes:
- MARC records represent bibliographic information, and may be in binary format or XML. This script is responsible for generating a MARC record for any work which does not have a corresponding record in the
workcoveragerecords
table. - Since MARC records can contain library specific URLs, changing a library's configuration settings in the Circulation Manager can lead to MARC files needing to be regenerated.
Description: Update the circulation manager server with new books from Feedbooks collections.
Schedule: 0 0 */7 * *
(At minute 0 of hour 0 on every 7th day of the month)
Description: Import ProQuest OPDS 2.0 feeds into Circulation Manager.
Schedule: 0 7 * * 2,5
(At minute 0 of hour 7 on Tuesdays and Fridays)
Description: Monitor the Enki collection by looking for books with lost licenses.
Schedule: 0 0 1 * *
(At minute 0 of hour 0 on the 1st day of the month)
Description: Update the circulation manager server to remove books that have been removed from OPDS for distributors collections.
Schedule: 0 0 2 * *
(At minute 0 of hour 0 on the 2nd day of the month)
Description: Monitor metadata requests from the Metadata Wrangler remote collection.
Schedule: 0 3 * * 3
(At minute 0 of hour 3 on Wednesdays)
Notes:
- Because the Metadata Wrangler is currently deprecated, scripts related to it run perform no work.
Description: Get all ISBNs for all collections in a library and send to NoveList.
Schedule: 0 0 * * 0
(At minute 0 of hour 0 on Sundays)
Notes:
- SimplyE may display recommendations on a work's detail page of other, related works. Those recommendations come from Novelist, a paid service.
- Once a week, we make a large request to the Novelist API for every ebook ISBN in the system, in order to register those with Novelist. Then, when someone is actually using the app and requests a book detail page, we make a separate request for recommendations of related books in our collections.