Supported Input Formats

Input formats

Currently, Reach can read papers formatted as .nxml, .txt, or .tsv/.csv.

`.nxml` files

Nearly all of the papers in the Open Access subset of PubMed can be retrieved as .nxml.

As an example, we'll retrieve the nxml for PMC1234 and PMC1235. We'll use a python script to do this.

wget https://gist.githubusercontent.com/myedibleenso/f233359445461a71ad37017393fe921f/raw/982275ad8d5070e8c0bc5c07edcfec1cd804c611/fetch_nxml.py

python fetch_nxml.py --pmcids PMC1234 PMC1235

`.tsv` and `.csv` files

The template for .tsv/.csv files can be retrieved with the following command:

wget https://gist.githubusercontent.com/myedibleenso/fb1f858a5664e12ff0448f4468b60842/raw/4eab1991eae4c89b1d5dffcb8c317bcd2f3cadd1/input-template.tsv

The first three columns of the .tsv/.csv files should be 1) the paper's name, 2) the name of the section, and 3) the text for that section.

NOTE: Include a header in each .tsv/.csv file. By default, the system will drop the first row of the file when reading, since it expects this to be the header.

`.txt` files

You can simply dump the raw text that Reach should read into a .txt file. Note that Reach will not attempt extensive preprocessing of such files, so you are advised to perform your own cleanup of the file first (removing LaTeX, acknowledgements, references, etc.).

`.ser` files

Pre-processing files generates by org.clulab.reach.RunAnnotationsCLI will be finished and their output mentions will be saved in all the specified output formats

Pre-processing

You can elect to pre-process input files and store serialized files containing the dependency parses and the numerous tag sequences for every document in the input directory.

To generate the serialized files, run the class org.clulab.reach.RunAnnotationsCLI, which uses the same configuration fields as org.clulab.reach.RunReachCLI. A .ser file will be generated in the output directory for each input file. To finish processing the serialized files, move them into the input directory, and run org.clulab.reach.RunReachCLI

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supported Input Formats

Input formats

`.nxml` files

`.tsv` and `.csv` files

`.txt` files

`.ser` files

Pre-processing

Clone this wiki locally

Supported Input Formats

Input formats

.nxml files

.tsv and .csv files

.txt files

.ser files

Pre-processing

Clone this wiki locally

`.nxml` files

`.tsv` and `.csv` files

`.txt` files

`.ser` files