Skip to content

Commit

Permalink
Improve the project structure and all codes in the subdirectories of …
Browse files Browse the repository at this point in the history
…relations directory.
  • Loading branch information
yjcyxky committed Nov 12, 2024
1 parent 0d8f1b2 commit 39c1c7d
Show file tree
Hide file tree
Showing 34 changed files with 305,939 additions and 85,656 deletions.
2 changes: 1 addition & 1 deletion graph_data/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
!/relations/README.md
/entities.tsv
/entities_full.tsv
/obsolete_entities.tsv
/entities_obsolete.tsv
/.entities.tmp.tsv
/entities.log
/relations.tsv
Expand Down
16 changes: 7 additions & 9 deletions graph_data/KG_README.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,12 +68,7 @@ NOTE: If you add a new entity type, you should change the merge_entities.py file

```bash
# Merge formatted entity files into one file, we will get three files: entities.tsv [after deduplication], entities_full.tsv [before deduplication], entities.log [the log file for deduplication]
python graph_data/scripts/merge_entities.py to-single-file -i graph_data/formatted_entities -o graph_data/entities.tsv --deep-deduplication

# Remove all obsolete entities
cp graph_data/entities.tsv graph_data/.entities.tmp.tsv
grep -v '\tobsolete' graph_data/.entities.tmp.tsv > graph_data/entities.tsv
grep '\tobsolete' graph_data/.entities.tmp.tsv > graph_data/obsolete_entities.tsv
python graph_data/scripts/merge_entities.py to-single-file -i graph_data/formatted_entities -o graph_data/entities.tsv --deep-deduplication --remove-obsolete
```

### Relations
Expand All @@ -83,11 +78,14 @@ grep '\tobsolete' graph_data/.entities.tmp.tsv > graph_data/obsolete_entities.ts
```bash
# Extract relations from a set of databases

# Clean the formatted relations folder
## Clean the formatted relations folder
rm -rf graph_data/formatted_relations

# NOTE: You might need to prepare a relation_types.tsv file from the relation_types.xlsx file.
graph-builder --database ctd --database drkg --database primekg --database hsdn -d ./graph_data/relations -o ./graph_data/formatted_relations -f ./graph_data/entities.tsv -n 20 --download --skip -l ./graph_data/log.txt --debug --relation-type-dict-fpath ./graph_data/relation_types.tsv
## STEP1: The graph-builder tool only supports the following databases CTD, DRKG, PrimeKG, HSDN automatically. Other databases are included in the relations folder. You may need to format them manually by running the main.ipynb files in each subfolder. Like `biosnap`, `cbcg`, `dgidb`, `ttd`.

## STEP2: Run the graph-builder tool to format the preset databases. You might need to prepare a relation_types.tsv file from the relation_types.xlsx file. If you don't want to format the relation types at this step, please don't provide the --relation-type-dict-fpath option.
# graph-builder --database ctd --database drkg --database primekg --database hsdn -d ./graph_data/relations -o ./graph_data/formatted_relations -f ./graph_data/entities.tsv -n 20 --download --skip -l ./graph_data/log.txt --debug --relation-type-dict-fpath ./graph_data/relation_types.tsv
graph-builder --database ctd --database drkg --database primekg --database hsdn -d ./graph_data/relations -o ./graph_data/formatted_relations -f ./graph_data/entities.tsv -n 20 --download --skip -l ./graph_data/log.txt --debug
```

#### Merge all formatted relations into one file
Expand Down
4 changes: 3 additions & 1 deletion graph_data/extra/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
### Extra Directory

It's used to expand the convertible ids of the graph. Such as adding the MedDRA IDs to related entities by using the MedDRA to UMLS mapping.
All dependent files for the graph building process are stored in the `extra` folder.

Like mapping files, such as adding the MedDRA IDs to related entities by using the MedDRA to UMLS mapping (It's used to expand the convertible ids of the graph).
8 changes: 5 additions & 3 deletions graph_data/relations/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,22 +14,24 @@ graph_data/relations

We have already implemented a python package [graph-builder](https://github.com/open-prophetdb/graph-builder) to download, extract, format and build the knowledge graph from the several databases, such as [ctd](http://ctdbase.org/), [drkg](https://github.com/gnn4dr/DRKG), [hsdn](https://www.nature.com/articles/ncomms5212), [primekg](https://github.com/mims-harvard/PrimeKG) etc.

If you want to build a knowledge graph for BioMedGPS project step by step by yourself, you can follow the instructions in the [KG_README.md](./graph_data/KG_README.md) file.
If you want to build a knowledge graph for BioMedGPS project step by step by yourself, you can follow the `Relations section` in the [KG_README.md](./graph_data/KG_README.md) file.


## For Developers

### How to add a new database

If you want to add a new database to the knowledge graph, you need to finish the following two steps:
If you want to add a new database to the knowledge graph, you need to finish the one of the following two steps:

1. Create a new folder in the `graph_data/relations` folder, and write a main.ipynb files to introduce the database and show how to extract/convert the database to the BioMedGPS format.

> You need to write codes to download, extract, convert the database to the BioMedGPS format in the main.ipynb file. You can refer to the existing main.ipynb files in the `graph_data/relations` folder as examples.
>
> The main.ipynb file should read the raw data from the database and write the processed data to the database folder in the BioMedGPS format. The output files should be named as `processed_xxx.tsv` or `invalid_xxx.tsv`, where `xxx` is the name of the database. These files will be added to the git repository.
> The main.ipynb file should read the raw data from the database and write the formatted data to the database folder in the BioMedGPS format. The output files should be named as `formatted_xxx.tsv` or `invalid_xxx.tsv`, where `xxx` is the name of the database. These files will be added to the git repository.
>
> Also, if possible, you can write descriptions about the database, the data source, the data license, the data usage, etc. in the main.ipynb file.
2. Add a new parser in the `graph-builder` package to mapping the entities and relation_types in the new database to the entities.tsv and relation_types.tsv files that are used in the BioMedGPS project.

If you're going to add a custom database, we recommend you to follow the first step. If it's a public database, you can follow the second step or try the first step first and then add a new parser in the `graph-builder` package when the process code is ready.

5 changes: 0 additions & 5 deletions graph_data/relations/biosnap/README.md

This file was deleted.

Loading

0 comments on commit 39c1c7d

Please sign in to comment.