Improve the project structure and all codes in the subdirectories of …

…relations directory.
open-prophetdb · Nov 12, 2024 · 39c1c7d · 39c1c7d
1 parent 0d8f1b2
commit 39c1c7d
Show file tree

Hide file tree

Showing 34 changed files with 305,939 additions and 85,656 deletions.
diff --git a/graph_data/.gitignore b/graph_data/.gitignore
@@ -7,7 +7,7 @@
 !/relations/README.md
 /entities.tsv
 /entities_full.tsv
-/obsolete_entities.tsv
+/entities_obsolete.tsv
 /.entities.tmp.tsv
 /entities.log
 /relations.tsv

diff --git a/graph_data/KG_README.md b/graph_data/KG_README.md
@@ -68,12 +68,7 @@ NOTE: If you add a new entity type, you should change the merge_entities.py file
 
 ```bash
 # Merge formatted entity files into one file, we will get three files: entities.tsv [after deduplication], entities_full.tsv [before deduplication], entities.log [the log file for deduplication]
-python graph_data/scripts/merge_entities.py to-single-file -i graph_data/formatted_entities -o graph_data/entities.tsv --deep-deduplication
-
-# Remove all obsolete entities
-cp graph_data/entities.tsv graph_data/.entities.tmp.tsv
-grep -v '\tobsolete' graph_data/.entities.tmp.tsv > graph_data/entities.tsv
-grep '\tobsolete' graph_data/.entities.tmp.tsv > graph_data/obsolete_entities.tsv
+python graph_data/scripts/merge_entities.py to-single-file -i graph_data/formatted_entities -o graph_data/entities.tsv --deep-deduplication --remove-obsolete
 ```
 
 ### Relations
@@ -83,11 +78,14 @@ grep '\tobsolete' graph_data/.entities.tmp.tsv > graph_data/obsolete_entities.ts
 ```bash
 # Extract relations from a set of databases
 
-# Clean the formatted relations folder
+## Clean the formatted relations folder
 rm -rf graph_data/formatted_relations
 
-# NOTE: You might need to prepare a relation_types.tsv file from the relation_types.xlsx file.
-graph-builder --database ctd --database drkg --database primekg --database hsdn -d ./graph_data/relations -o ./graph_data/formatted_relations -f ./graph_data/entities.tsv -n 20 --download --skip -l ./graph_data/log.txt --debug --relation-type-dict-fpath ./graph_data/relation_types.tsv
+## STEP1: The graph-builder tool only supports the following databases CTD, DRKG, PrimeKG, HSDN automatically. Other databases are included in the relations folder. You may need to format them manually by running the main.ipynb files in each subfolder. Like `biosnap`, `cbcg`, `dgidb`, `ttd`.
+
+## STEP2: Run the graph-builder tool to format the preset databases. You might need to prepare a relation_types.tsv file from the relation_types.xlsx file. If you don't want to format the relation types at this step, please don't provide the --relation-type-dict-fpath option.
+# graph-builder --database ctd --database drkg --database primekg --database hsdn -d ./graph_data/relations -o ./graph_data/formatted_relations -f ./graph_data/entities.tsv -n 20 --download --skip -l ./graph_data/log.txt --debug --relation-type-dict-fpath ./graph_data/relation_types.tsv
+graph-builder --database ctd --database drkg --database primekg --database hsdn -d ./graph_data/relations -o ./graph_data/formatted_relations -f ./graph_data/entities.tsv -n 20 --download --skip -l ./graph_data/log.txt --debug
 ```
 
 #### Merge all formatted relations into one file

diff --git a/graph_data/extra/README.md b/graph_data/extra/README.md
@@ -1,3 +1,5 @@
 ### Extra Directory
 
-It's used to expand the convertible ids of the graph. Such as adding the MedDRA IDs to related entities by using the MedDRA to UMLS mapping.
+All dependent files for the graph building process are stored in the `extra` folder.
+
+Like mapping files, such as adding the MedDRA IDs to related entities by using the MedDRA to UMLS mapping (It's used to expand the convertible ids of the graph).
diff --git a/graph_data/relations/README.md b/graph_data/relations/README.md
@@ -14,22 +14,24 @@ graph_data/relations
 
 We have already implemented a python package [graph-builder](https://github.com/open-prophetdb/graph-builder) to download, extract, format and build the knowledge graph from the several databases, such as [ctd](http://ctdbase.org/), [drkg](https://github.com/gnn4dr/DRKG), [hsdn](https://www.nature.com/articles/ncomms5212), [primekg](https://github.com/mims-harvard/PrimeKG) etc.
 
-If you want to build a knowledge graph for BioMedGPS project step by step by yourself, you can follow the instructions in the [KG_README.md](./graph_data/KG_README.md) file.
+If you want to build a knowledge graph for BioMedGPS project step by step by yourself, you can follow the `Relations section` in the [KG_README.md](./graph_data/KG_README.md) file.
 
 
 ## For Developers
 
 ### How to add a new database
 
-If you want to add a new database to the knowledge graph, you need to finish the following two steps:
+If you want to add a new database to the knowledge graph, you need to finish the one of the following two steps:
 
 1. Create a new folder in the `graph_data/relations` folder, and write a main.ipynb files to introduce the database and show how to extract/convert the database to the BioMedGPS format.
 
     > You need to write codes to download, extract, convert the database to the BioMedGPS format in the main.ipynb file. You can refer to the existing main.ipynb files in the `graph_data/relations` folder as examples.
     > 
-    > The main.ipynb file should read the raw data from the database and write the processed data to the database folder in the BioMedGPS format. The output files should be named as `processed_xxx.tsv` or `invalid_xxx.tsv`, where `xxx` is the name of the database. These files will be added to the git repository.
+    > The main.ipynb file should read the raw data from the database and write the formatted data to the database folder in the BioMedGPS format. The output files should be named as `formatted_xxx.tsv` or `invalid_xxx.tsv`, where `xxx` is the name of the database. These files will be added to the git repository.
     > 
     > Also, if possible, you can write descriptions about the database, the data source, the data license, the data usage, etc. in the main.ipynb file.
 
 2. Add a new parser in the `graph-builder` package to mapping the entities and relation_types in the new database to the entities.tsv and relation_types.tsv files that are used in the BioMedGPS project.
 
+If you're going to add a custom database, we recommend you to follow the first step. If it's a public database, you can follow the second step or try the first step first and then add a new parser in the `graph-builder` package when the process code is ready.
+
diff --git a/graph_data/relations/biosnap/README.md b/graph_data/relations/biosnap/README.md