Skip to content

Circulation Manager Scripts

Nick Ballenger edited this page Sep 7, 2021 · 1 revision

Periodic Server-side Scripts in the Circulation Manager

Overview

The SimplyE Circulation Manager manages data in two primary ways:

  1. in response to real-time API requests, such as initiating a loan to a patron; and
  2. by executing data management scripts, either periodically or manually, to do things like update bibliographic metadata across the entire system.

These dual approaches are supported by deploying the codebase twice, once as a web application and once as a job runner. The job runner does not respond to web requests, and is instead responsible for executing scripts via cron.

Script Scheduling

The timing of automated tasks is set in docker/services/simplified_crontab, in the Circulation Manager repository. If crontab syntax is unfamiliar, the website crontab.guru is helpful for decoding individual timing expressions.

Each script will, before beginning, look for any currently running instance of itself. If it is already active (likely still running from a previous cron job), the new script invocation will exit to avoid race conditions. The script will run again at the next scheduled time, as long as no copies of it are already running.

How Scripts Execute

The executable lines in the crontab file all reference a Bash shell script, core/bin/run, passing it the name of a Python script (mostly found in the Circulation Manager's bin/ directory) as an argument.

The core/bin/run script is responsible for three things for each script it kicks off:

  1. Making sure that two copies of the same script don't run at the same time
  2. Setting up a log file for the script, in the /var/log/simplified directory
  3. Passing any additional command line parameters to the script as arguments

The Python scripts in the bin/ directory are themselves typically quite compact, and mostly serve as hooks for the various Monitor subclasses discussed below. Take for example the axis_monitor script:

#!/usr/bin/env python
"""Monitor the Axis 360 collection by asking about recently changed books."""
import os
import sys

bin_dir = os.path.split(__file__)[0]
package_dir = os.path.join(bin_dir, "..")
sys.path.append(os.path.abspath(package_dir))

from core.scripts import RunCollectionMonitorScript
from api.axis import Axis360CirculationMonitor

RunCollectionMonitorScript(Axis360CirculationMonitor).run()

Like many of the scripts in bin/, it simply imports a particular Monitor and passes it to one of the execution functions in core.scripts, such as RunCollectionMonitorScript, which contain execution logic specific to the type of Monitor being run.

Inter-Script Communication

SimplyE scripts do not communicate with each other directly at run-time. However, there are tables in the database which are used to track items that scripts should address on their next run. The presence or absence of a record in one of these tables indicates that a related object should be reviewed or refreshed:

  • coveragerecords - tracks Identifier records which need attention
  • workcoveragerecords - tracks Work records which need attention

Some SimplyE data management depends on a cascade of steps executed by consecutive runs of different scripts, each of which checks these tables for relevant records. For example, a Collection Monitor might obtain new metadata for a work, indicating that its classifications have changed. The Collection Monitor script won't do the reclassification itself, but it will delete the work's record in workcoveragerecords, so that when the work_classification script next runs it will notice the lack of a work coverage record, and re-classify the work.

Types of Script

The majority of SimplyE's automated scripts fall into one (or more) of a few categories of common behavior. Base classes that define shared functionality across similar scripts can be found in the core.monitor module.

Collection Monitors

Collection Monitors are scripts which synchronize data related to one or more Collection records. Each collection in the set is processed separately, and the works within that collection may all be checked, or only checked if they require specific attention.

Timeline Monitors

Timeline Monitor scripts (which are all also Collection Monitors, via multiple inheritance) are responsible for sychronizing data which changed during a specified window of time. While implementation details differ (because external APIs each represent event streams differently), each Timeline Monitor pulls in data from the time between its last run and the present. It then starts with the oldest data and iterates forward chronologically, so that in the event of a crash the timeline is maintained.

Sweep Monitors

Sweep Monitors are specialized Collection Monitors which audit and potentially update every item in a Collection (or in some cases an entire database table). In contrast to tasks which look for records marked for attention, these are checks which must be run on every record periodically.

Reapers

Reaper scripts remove or update records which have become incorrect due to passive expiration. Take as an example an ebook for which we have licensed a total of 10 checkouts. Each checkout by a patron creates an API call to the distributor that reduces the total available checkouts by 1. Knowing when we have hit zero remaining checkouts is important, but that event is not published in an event stream as metadata updates are. To make sure we correctly represent the current number of checkouts available for each item, a reaper script periodically audits the license usage for each work and adjusts the database accordingly.

Other uses for reaper scripts are to remove works we no longer have rights to loan, delete old or invalid cached versions of OPDS feeds, delete unused credentials for removed system integrations, etc.

Note: Some reaper scripts are specialized SweepMonitor classes, such as api.axis.AxisCollectionReaper, while others are derived from the ReaperMonitor class, such as CredentialReaper.

Scrubber Monitors

Where reapers remove entire records, a Scrubber deletes the content of one or more fields in a record, according to a set of redaction rules. For instance, a Scrubber Monitor might be used to remove location information from a record after a year, but otherwise leave the record intact.

Periodic / Automated Scripts, by Frequency

Scripts that run multiple times per hour

cache_opds_blocks

Description: Refresh the top-level OPDS groups.

Schedule: */5 * * * * (Every 5th minute)

Notes:

  • Many parts of SimplyE are represented in transit by Open Publication Distribution System (OPDS) feeds. These are XML documents which describe different ways of aggregating and associating publications. For instance, an OPDS document might list all books in a catalog by a single author.
  • Because they can represent arbitrary sets of publications aggregated along many different axes, feed documents can be inefficient to produce in real-time during a web request.
  • This script is responsible for periodically generating OPDS documents for all grouped feeds (lanes with sub-lanes), across SimplyE, and making sure those documents are stored in the back end database as CachedFeed records.

search_index_refresh

Description: Re-index any Works whose entries in the search index have become out of date.

Schedule: */6 * * * * (Every 6th minute)

Notes:

  • Searches in SimplyE are supported by Elasticsearch, an open-source indexing and search server which runs alongside the Circulation Manager.
  • We index works on a number of axes, including age and genre classification, membership in custom lists, and availability. A number of events can cause an entry in those indexes to become out of date, such as a patron checking out the last available copy of a work or a vendor updating the spelling of a work's title.
  • When a work needs to be re-indexed, the script which notices that will flag the work for re-indexing by deleting the entry in the workcoveragerecords table for that work which has update-search-index as its operation field.
  • This script looks for works with no current record in workcoveragerecords for the update-search-index operation, re-indexes those works, and adds new success records to workcoveragerecords for those works.
  • The script runs very frequently, but it is worth noting that it is not efficient to refresh search indexes in real time, which is why we batch these updates via a periodic script.

metadata_wrangler_collection_registrar

Description: Gather information from the metadata wrangler. This script should be invoked as metadata_wrangler_collection_registrar. metadata_wrangler_collection_sync is a deprecated name for the same script.

Schedule: */10 * * * * (Every 10th minute)

Notes:

  • Because the Metadata Wrangler is currently deprecated, scripts related to it run perform no work.

axis_monitor

Description: Monitor the Axis 360 collection by asking about recently changed books.

Schedule: */15 * * * * (Every 15th minute)

Notes:

  • A Collection Monitor / Timeline Monitor script responsible for querying the Baker & Taylor Axis360 API for works which have had their bibliographic metadata updated since the last time the script ran.
  • We pass the Axis360 API a date in the past (the script determines this based on the last time it ran), and receive a set of all works with changes from that date to the present.
  • While the data we receive resembles an event stream, it does consolidate multiple change events in the requested time window such that we receive a single record per work, representing the most current metadata.

bibliotheca_monitor

Description: Monitor the Bibliotheca collections by asking about recently changed events.

Schedule: */15 * * * * (Every 15th minute)

Notes:

  • A Collection Monitor / Timeline Monitor script responsible for querying the Bibliotheca API for metadata change events relevant to works from that distributor.
  • In contrast to the Axis360 API for change events, the Bibliotheca API serves an unconsolidated event stream for all events between supplied start and end dates.
  • A notable limitation of the Bibliotheca API is that it will only return the first 100 events for a given slice of time, and does not indicate if that limit overflows. Consequently we ask for five minute slices, iterating backwards to reach the last time the script can account for.
  • Once we reach the end time of a previous successful run, the script iterates forward through the stored event stream, replaying changes in chronological order onto our records.
  • Because the event stream may silently overflow, we do not rely on this script alone to update Bibliotheca metadata--several other scripts cover similar operations, to keep our metadata in sync with Bibliotheca's.

odilo_monitor_recent

Description: Updates an Odilo collection. Schedule: */15 * * * * (Every 15th minute) Notes: * Currently investigating to determine if Odilo collections should be supported going forward.


overdrive_monitor_recent

Description: Monitor the Overdrive collections by going through the recently changed list. Schedule: */15 * * * * (Every 15th minute) Notes:

  • A Collection Monitor / Timeline Monitor script responsible for querying the Overdrive API for updates to bibliographic metadata.
  • The Overdrive API allows you to specify a time delta backwards from the present, and in response gives you a list of all Overdrive identifiers that have changed during that period. Once we obtain that list, we separately query for the metadata of those works, to find updates to apply to our records.

overdrive_reaper

Description: Monitor the Overdrive collections by looking for books with lost licenses.

Schedule: */15 * * * * (Every 15th minute)

Notes:

  • For each work a Library licenses from Overdrive, there are N concurrent checkouts available. When the work is checked out, the number of checkouts is updated on the Overdrive servers via API. However, the available license count is not immediately reflected everywhere in SimplyE.
  • This reaper script is responsible for iterating all Overdrive works, and updating records in SimplyE to reflect the current state of that work, including the number of available checkouts.

Scripts that run multiple times per day


metadata_wrangler_collection_updates

Description: Monitor bibliographic updates to the Metadata Wrangler remote collection.

Schedule: */59 * * * * (Every 59th minute)

Notes:

  • Because the Metadata Wrangler is currently deprecated, scripts related to it run perform no work.

work_classification

Description: Re-classify any Works that need it.

Schedule: 0 */3 * * * (At minute 0, every 3rd hour)

Notes:

  • Whenever this script runs, it will re-classify any work which does not currently have a relevant record in workcoveragerecords. Other scripts may queue a work for classification by deleting from that table.
  • Works are classified on a number of axes, including audience, genre, etc. The end result of the classification process is sent to elasticsearch, to help sort works during search.

work_presentation_editions

Description: Re-generate presentation editions of any Works that need it.

Schedule: 0 */3 * * * (At minute 0, every 3rd hour)

Notes:

  • Every work has a "presentation edition," which is used as the canonical display version of the work when it appears in client applications. When information about a work changes, it may be necessary to regenerate its presentation edition.
  • Typically the presentation edition is regenerated if one of two things occurs:
    • a work's distributor updates the metadata of that work
    • a librarian makes edits to the metadata of the work
  • Whenever there are multiple data sources for a work, a presentation edition is synthesized by combining the data sources, with each data source having a weight that determines how to use it when building the presentation edition.

rbdigital_availability_monitor

Description: Monitor the RBdigital collections by going through the availability endpoint result list. Update RBDigital Licensepools to have either 0 or 1 available copies, based on availability flag returned from RBDigital.

Schedule: 0 */4 * * * (At minute 0, every 4th hour)

Notes:

  • Since Overdrive acquired RBDigital, this script's functionality has been superceded by the various Overdrive scripts. It will likely be removed in a future release.

bibliotheca_purchase_monitor

Description: Ask the Bibliotheca API about license purchases, potentially purchases that happened many years in the past.

Schedule: 0 */5 * * * (At minute 0, every 5th hour)

Notes:

  • This script looks for works that have been recently added to the collection via Bibliotheca.
  • It asks the Bibliotheca API about one day of acquisitions at a time, and for each new Bibliotheca ID it will create empty work records. Once the ID is in the system, a separate process will eventually backfill the data for that work and create a presentation edition.

enki_import

Description: monitor the Enki collection by asking about recently changed books.

Schedule: 0 */6 * * * (At minute 0, every 6th hour)


shared_odl_import_monitor

Description: Update the circulation manager server with new books from shared ODL collections.

Schedule: 5 */6 * * * (At minute 5, every 6th hour)

Notes:

  • Normal ODL is intended to allow a vendor to share information with a Circulation Manager. Shared ODL allows one Circulation Manager instance to connect to a different Circulation Manager instance.
  • An example use case would be a state library association setting up a Circulation Manager for their statewide collection, into which they import a normal ODL feed, and set up a collection based on it as a shared ODL source. Libraries which belong to the association (some or all of whom are running their own Circulation Managers), then use shared ODL to source that collection's information from the state association's Circulation Manager.
  • This enables hold management for shared collections. Normal ODL has no concept of a hold, expecting individual systems to handle that according to their own rules. Under shared ODL, the library Circulation Managers periodically ask the state association Circulation Manager about availability via this script and updates the local data accordingly.

odl_hold_reaper

Description: Check for ODL holds that have expired and delete them.

Schedule: 0 */8 * * * (At minute 0, every 8th hour)

Notes:

  • On a Circulation Manager which is the source of a shared ODL feed (such as a state library association's CM instance), this script will look for holds which have expired and remove them.

Scripts that run once a day

update_custom_list_size

Description: Update the cached sizes of all custom lists.

Schedule: 0 0 * * * (At minute 0 of hour 0)

Notes:

  • It can be expensive to impute the size of a custom list in real time, and it isn't something that changes frequently. This script updates the count of items in every custom list.

search_index_clear

Description: Mark all Works as having out-of-date search index entries.This guarantees that the search index is periodically rebuilt from scratch, providing automatic recovery from bugs and major metadata changes.

Schedule: 10 0 * * * (At minute 10 of hour 0)

Notes:

  • This script invalidates every entry in the search index, and is run once per day. The index itself is not destroyed, but by invalidating the index entries of all works in workcoveragerecords, we guarantee that the index entry for each one will be refreshed at least daily.

cache_marc_files

Description: Refresh and store the MARC files for lanes.

Schedule: 0 1 * * * (At minute 0 of hour 1)

Notes:

  • This script generates incremental MARC records and uploads them to an S3 bucket, in case a library wants to take data from their Circulation Manager and upload it to their ILS.

database_reaper

Description: Remove miscellaneous expired things (Credentials, CachedFeeds, Loans, etc.) from the database.

Schedule: 0 2 * * * (At minute 0 of hour 2)

Notes:

  • Acts as an umbrella script for a number of similar reaper jobs.
  • Used principally when a model class includes a timestamp field, and has rules for how old that timestamp can be before the record is expired and should be deleted.
  • The reapers run by this are child classes of ReaperMonitor, and each child class defines the model/table, the name of the timestamp field, and the maximum age of a record.
  • There is a registry of reaper monitors and scrubber monitors, and all of the registered classes are run by this script.

overdrive_new_titles

Description: Look for new titles added to Overdrive collections which slipped through the cracks.

Schedule: 0 3 * * * (At minute 0 of hour 3)


update_nyt_best_seller_lists

Description: Bring in the entire history of all NYT best-seller lists.

Schedule: 30 3 * * * (At minute 30 of hour 3)


axis_reaper

Description: Monitor the Axis collection by looking for books that have been removed.

Schedule: 0 4 * * * (At minute 0 of hour 4)


opds_for_distributors_import_monitor

Description: Update the circulation manager server with new books from OPDS import collections that have authentication.

Schedule: 0 4 * * * (At minute 0 of hour 4)


overdrive_format_sweep

Description: Sweep through our Overdrive collections updating delivery mechanisms.

Schedule: 0 4 * * * (At minute 0 of hour 4)


bibliotheca_circulation_sweep

Description: Sweep through our Bibliotheca collections verifying circulation stats.

Schedule: 0 5 * * * (At minute 0 of hour 5)


opds_import_monitor

Description: Update the circulation manager server with new books from OPDS import collections.

Schedule: 0 5 * * * (At minute 0 of hour 5)


saml_monitor

Description: Refreshes SAML federated metadata. Please note that the monitor looks up for federations in samlfederations table. Currently, there is no way to configure SAML federations in the admin interface.

Schedule: 0 5 * * * (At minute 0 of hour 5)


opds2_import_monitor

Description: Update the circulation manager server with new books from OPDS 2.0 import collections.

Schedule: 30 5 * * * (At minute 30 of hour 5)


odl_import_monitor

Description: Update the circulation manager server with new books from ODL collections.

Schedule: 0 6 * * * (At minute 0 of hour 6)


update_lane_size

Description: Update the cached sizes of all lanes.

Schedule: 0 10 * * * (At minute 0 of hour 10)


metadata_upload_coverage

Description: Upload information to the metadata wrangler.

Schedule: 30 21 * * * (At the 30th minute of hour 21)

Notes:

  • Because the Metadata Wrangler is currently deprecated, scripts related to it run perform no work.

metadata_wrangler_collection_reaper

Description: Remove unlicensed items from the remote metadata wrangler Collection.

Schedule: 0 */22 * * * (At minute 0 of every 22nd hour)

Notes:

  • Because the Metadata Wrangler is currently deprecated, scripts related to it run perform no work.

work_classify_unchecked_subjects

Description: (Re)calculate the presentation of works associated with unchecked subjects.

Schedule: 30 22 * * * (At minute 30 of hour 22)

Notes:

  • In the Subjects model, there is a boolean field named checked. This script looks for subjects whose checked value is False, and reclassifies all works that fall under that subject.
  • This script is typically triggered by a database migration related to a set of changed rules around subject matter classification. As an example, at one point we had an incorrect rule set for classifying "Urban Fiction" and "Urban Fantasy," leading to works being incorrectly classified and indexed. We updated the classification rules, then created a database migration which set the checked field to False for those subjects. The next time this script ran, it reclassified all works related to those subjects.

opds_entry_coverage

Description: Make sure all presentation-ready works have up-to-date OPDS entries.

Schedule: 40 22 * * * (At the 40th minute of hour 22)

Notes:

  • Normally when there is a change to a book's metadata or its presentation edition, we immediately generate a new cached OPDS entry for it. Consequently this script often has no work, unless a database migration has cleared cached feeds or changed the rules by which they are generated.
  • This script will run for works which do not have corresponding records in workcoveragerecords.
  • As an example of migrations which create work for this script, see 20180117-regenerate-opds-for-dc-issued.sql in the server_core repository.

rbdigital_collection_delta

Description: Make sure an RBDigital collection is up to date.

Schedule: 0 23 * * * (At minute 0 of hour 23)

Notes:

  • Since Overdrive acquired RBDigital, this script's functionality has been superceded by the various Overdrive scripts. It will likely be removed in a future release.

rbdigital_initial_import

Description: Perform the initial population of an RBdigital collection with works from the RBdigital content server.

Schedule: 0 23 * * * (At minute 0 of hour 23)

Notes:

  • Since Overdrive acquired RBDigital, this script's functionality has been superceded by the various Overdrive scripts. It will likely be removed in a future release.

marc_record_coverage

Description: Make sure all presentation-ready works have up-to-date MARC records.

Schedule: 40 23 * * * (At the 40th minute of hour 23)

Notes:

  • MARC records represent bibliographic information, and may be in binary format or XML. This script is responsible for generating a MARC record for any work which does not have a corresponding record in the workcoveragerecords table.
  • Since MARC records can contain library specific URLs, changing a library's configuration settings in the Circulation Manager can lead to MARC files needing to be regenerated.

Scripts that run less than once a day

feedbooks_import_monitor

Description: Update the circulation manager server with new books from Feedbooks collections.

Schedule: 0 0 */7 * * (At minute 0 of hour 0 on every 7th day of the month)


proquest_import_monitor

Description: Import ProQuest OPDS 2.0 feeds into Circulation Manager.

Schedule: 0 7 * * 2,5 (At minute 0 of hour 7 on Tuesdays and Fridays)


enki_reaper

Description: Monitor the Enki collection by looking for books with lost licenses.

Schedule: 0 0 1 * * (At minute 0 of hour 0 on the 1st day of the month)


opds_for_distributors_reaper_monitor

Description: Update the circulation manager server to remove books that have been removed from OPDS for distributors collections.

Schedule: 0 0 2 * * (At minute 0 of hour 0 on the 2nd day of the month)


metadata_wrangler_auxiliary_metadata

Description: Monitor metadata requests from the Metadata Wrangler remote collection.

Schedule: 0 3 * * 3 (At minute 0 of hour 3 on Wednesdays)

Notes:

  • Because the Metadata Wrangler is currently deprecated, scripts related to it run perform no work.

novelist_update

Description: Get all ISBNs for all collections in a library and send to NoveList.

Schedule: 0 0 * * 0 (At minute 0 of hour 0 on Sundays)

Notes:

  • SimplyE may display recommendations on a work's detail page of other, related works. Those recommendations come from Novelist, a paid service.
  • Once a week, we make a large request to the Novelist API for every ebook ISBN in the system, in order to register those with Novelist. Then, when someone is actually using the app and requests a book detail page, we make a separate request for recommendations of related books in our collections.
Clone this wiki locally