Skip to content
Enrique Noriega edited this page Jan 27, 2021 · 3 revisions

Running multiple papers

Step 1: Prepare your input and output directories

Create a directory for the documents to be read by Reach:

mkdir -p path/to/my/input/directory

Move the papers you wish Reach to process to this directory, but please ensure that they are in one of our supported formats. Details on these formats, including instructions on how to retrieve papers formatted as .nxml from Open Access, can be found here.

Step 2: Configure application.conf

NOTE: This section assumes you've already cloned the reach repository locally (git clone https://github.com/clulab/reach.git).

Before running things, there are a few properties that may need to be updated in the project's config file. You can find the application.conf file at reach/main/src/main/resources/application.conf

  • rootDir

    • This is the default top-level root directory path for input and output files and subdirectories. All other paths are based on this path but any or all can be changed individually.
  • papersDir

    • This is the path to the directory that stores the input .nxml, .csv, .tsv, and .txt files (i.e., the path/to/my/input/directory mentioned in Step 1.
    • This directory must exist before Reach is run.
  • outDir

    • This is the directory where the output files containing the extracted mentions (results) will be saved.
    • If this directory doesn't already exist, it will be created at runtime.
  • outputTypes

    • A list of output formats for the results (more than one may be specified).
    • fries will produce a series of JSON files, one for each paper.
  • threadLimit

    • Use this to specify the number of papers to attempt to process in parallel.
    • Note that as you increase parallelization, you will also need to allocate more memory (RAM) in the project's .sbtopts file.

Additional properties

  • logging.logfile

    • Specify the path where the log file should be written.
  • logging.loglevel

    • Specify the level for logging. Default: INFO level.
  • ignoreSections

    • A list of paper sections that should be ignored when processing the input papers. In order to be ignored, these strings must match the relevant fields in the nxml or tsv input files exactly.
  • restart.useRestart

    • Specify whether to log successfully processed input papers and/or whether to skip logged papers on subsequent processing runs. See restart section below for more information.
  • restart.logfile

    • Specify the path where the restart log file should be written. By default, the restart log file is written to the output directory; that is the directory specified by the outDir configuration varible (above).

Step 3: Run ReachCLI

Once the Server is started (Step 3), Reach can be run:

sbt 'runMain org.clulab.reach.RunReachCLI'

Restart Capability

If the restart capability is enabled by the restart.useRestart flag (true by default), ReachCLI will append the name of each successfully processed input file (one per line) to a log file (by default restart.log). The restart log file is located, by default, in the OUTPUT directory, as Reach might not have write permission on the input directory.

When ReachCLI starts up it looks for and reads the restart log file to find which input files it can SKIP. The restart log file can be empty or missing, in which case ReachCLI will process (or reprocess if restarted) all input files. Input files which fail to process are not written to the restart log file. You can manually edit this text file to control which files are skipped during the run.

Reach API and Internal Web Service

Reach now includes a small internal web service which supports an HTTP-based API (Application Programming Interface) and a web GUI to upload and process a single paper at a time.

NOTE: The internal web service is intended for limited, local, private use. It is not secure and should not be exposed outside of your local firewall.

NOTE: Do not confuse this internal web service with the separate Server process which Reach now requires to run.

To run the web service, you must first start the Server process (same as Step 3 above), then start the web service. Both steps are accomplished by the following:

sbt 'runMain org.clulab.reach.export.server.ApiServer'

Open a browser window to port 8080 on your localhost to view documentation about using the HTTP-based API: http://localhost:8080/

Please remember that the first submission can take several minutes, as Reach must load the necessary model files. Subsequent calls for processing will be much faster.

File Uploading GUI

The Reach internal web service also supports a 'File Uploader' page for uploading and processing a single input file. The File Uploader page is available at the URL:

http://localhost:8080/uploader

Several different output types are supported and are selected in the GUI dropdown box:

  1. "fries"
  • The default output; in FRIES consortium JSON format. Results are formatted as JSON with sentence, entity, and event mentions separated.
  1. "serial-json"
  • All document annotations and mentions are serialized to JSON.
  • WARNING: this format is voluminous and can produce 10s of megabytes of output. Use it with caution.
  • see a small example output