This program converts TSV files from the ConsensusPathDB (CPDB) into JSON-LD formatted JSON Lines.
.
┣ data/ # Default data path
┣ src/
┃ ┣ column_mapper/
┃ ┃ ┣ human.json # Definition file for mapping human TSV headers to IRIs
┃ ┃ ┣ mouse.json # Definition file for mapping mouse TSV headers to IRIs
┃ ┃ ┗ yeast.json # Definition file for mapping yeast TSV headers to IRIs
┃ ┣ cpdb2jsonld.py # Python program for CPDB conversion
┃ ┣ settings.py # Configuration file for the conversion program
┃ ┣ taxonomy.json # Definition file for mapping taxonomy names to taxonomy IDs
┃ ┣ urls.txt # List of URLs for CPDB TSV files
┃ ┗ context.jsonld # File defining the JSON-LD context
┣ Dockerfile # Definition file for the Docker image
┣ build_tsv2jsonld_cpdb.sh # Script to build the Docker image
┣ run_tsv2jsonld_cpdb.sh # Script to execute the conversion program
┣ pyproject.toml
┣ poetry.lock
┗ README.md
Execution within a Docker container is recommended.
- Python 3.11
- PyLD
- Typer
- Loguru
- Rich
Execute the script to build the Docker image.
bash build_tsv2jsonld_cpdb.sh
Once the build completes without errors, proceed to executing the program.
Execute the script to run the conversion program.
Two types of conversion scripts are available:
run_flow_tsv2jsonld_cpdb.sh
# Executes the entire flow from file download to conversion based on the URL definition filerun_tsv2jsonld_cpdb.sh
# Passes the path of the manually downloaded DB TSV file as an argument for conversion
bash run_flow_tsv2jsonld_cpdb.sh [options...]
Optional arguments.
--hide-progress
If specified, hides the progress bar that indicates the conversion progress.
--jsonld-output
If specified, after converting from TSV to .jsonl, this option generates .jsonld files based on the .jsonl file.
The .jsonld files are split and output in the <output_file_basename>_jsonld
folder created at the same directory level as the file specified for the .jsonl output.
The size of each .jsonld file is determined by the JSONLD_MAX_FILE_SIZE
in src/settings.py
.
bash run_tsv2jsonld_cpdb.sh <input_file> <output_file> [options...]
Example execution:
bash run_tsv2jsonld_cpdb.sh data/ConsensusPathDB_human_PPI output/cpdb_converted.jsonl
Specifies the path to the ConsensusPathDB TSV file to be input.
Specifies the path to the output JSON Lines file.
Optional arguments.
--taxonomy <taxonomy_name>
Specifies the taxonomy. If this option is not included, the taxonomy name contained in the file name will be used.
--hide-progress
If specified, hides the progress bar that indicates the conversion progress.
--jsonld-output
After converting to JSON Lines, this generates JSON-LD formatted JSON files from the output .jsonl file.
The .jsonld files are split and output in the <output_file_basename>_jsonld
folder created at the same directory level as the file specified for the .jsonl output.
The size of each .jsonld file is determined by the JSONLD_MAX_FILE_SIZE
in src/settings.py
.
Configure the conversion program using src/settings.py
, src/column_mapper/*.json
, src/context.jsonld
, src/taxonomy.json
, and src/urls.txt
.
Below is a list of settings in settings.py with their default values.
Setting Name | Default Value | Description |
---|---|---|
src_dir |
os.path.dirname(__file__) |
Path to the src directory |
DEBUG |
False |
Enable/disable debug mode |
URL_LIST_FILE_PATH |
os.path.join(src_dir, "urls.txt") |
Path to the file storing the URL list |
OUTPUT_DIR |
os.path.join(src_dir, "output/") |
Path to the directory where output files are saved |
COLUMN_MAPPER_DIR |
os.path.join(src_dir, "column_mapper/") |
Path to the directory storing column mappers |
HEADER_ROW_NUMBER |
2 |
Which line contains the header |
HEADER_ROW_PREFIX |
"# " |
Prefix for the header row |
CONTEXT_LOCAL_FILE_PATH |
os.path.join(src_dir, "context.jsonld") |
Path to the local context file |
CONTEXT_FILE_URI |
"http://example.com/context.jsonld" |
URI for the context file |
INFO_LOG_FILE_PATH |
os.path.join(src_dir, "info.log") |
Path to the log file |
ERROR_LOG_FILE_PATH |
os.path.join(src_dir, "error.log") |
Path to the error log file |
NODE_ID_COLUMN |
"uniprot_entry" |
Column name used as node ID |
NODE_ID_PREFIX |
"cpdb:" |
Prefix for the node ID |
NODE_TYPE |
"m2r:MacromolecularComplex" |
Type of the node |
DATA_SOURCE_PREFIX |
"http://identifiers.org/" |
Prefix for data sources |
REFERENCE_PREFIX |
"pmid:" |
Prefix for references |
PARTICIPANTS |
[ "uniprot_entry", "uniprot_id", ] |
List of participants |
TAXONOMY_FILE_PATH |
os.path.join(src_dir, "taxonomy.json") |
Path to the taxonomy definition file |
JSONLD_MAX_FILE_SIZE |
3 * 1024 * 1024 |
Maximum size of a JSON-LD file (in bytes) |
If the column headers in the ConsensusPathDB TSV file are changed, modifications to these definitions are necessary.
This is the definition file for the context used when converting to JSON-LD.
This definition file is for translating taxonomy names to taxonomy IDs.
This lists the URLs for the CPDB TSV files to be converted.