-
Notifications
You must be signed in to change notification settings - Fork 1
Developing BioRels
Thank you for your interest in contributing to and developing Biorels! We’re thrilled to have you on board. In this section, we’ll start by reviewing some essential prerequisites before moving on to high-level concepts related to job definitions. After that, we’ll dive deeper into the process of writing code. Let’s get started!
If you haven’t already, we strongly recommend that you review the following sections of this documentation:
-
Configuration of the environment
-
Directory structure
-
CONFIG_GLOBAL
-
CONFIG_JOB
-
CONFIG_USER – User configuration file
We define a data source as any information generated and regularly provided by an external organization, whether private or public. Examples of data sources include UniProt and DrugBank. Typically, a data source relies on another source known as a parent data source. These dependencies can vary in importance, ranging from critical to non-critical.
Important
Critical dependency: If you ignore that parent, you will lose some important scientific concept.
Example: A UniProt record provides information about a protein in a given organism. The organism here is critical. If ignored, you will lose some important scientific knowledge – which is in which organism this protein is defined in.
Important
Non-critical dependency: If you ignore that parent, you will lose some related information that augment the data source.
Example: A Uniprot record provides a list of external identifiers related to this protein record. If you ignore those identifiers, you don’t lose any scientific information that is necessary to define a protein record.
Before registering your new data source, you will need to ensure that all critical parents are already in Biorels. If not, then you will need to register them and create the corresponding scripts.
Please open biorels.sing.txt in $TG_DIR/BACKEND/CONTAINER directory. Biorels.sing.txt contains all the instructions to build the singularity container, defined in different sections. To add the tools or packages that you need, please follow the different sections below.
If you wish to add a Linux package or program that is available via yum, please go to the end of PART 1- Linux packages. A commented line is available for you to add your own packages. Just remove the # and add your packages.
Python is compiled during the container’s build to offer one of the latest python versions. Python includes a package manager, called pip to install any additional package. Please refer to PART 2 – Python Packages and add your packages.
If the package of interest is a native PHP library, please scroll down to PART 3 – PHP Packages and add the PHP library. Alternatively, if you need a non-native PHP library, such as ones developed by the PHP community, you can use composer to add such library. Composer is an equivalent of pip for Python. To search for the different packages in PHP, you can go to Packagist: https://packagist.org. Once you have the package name, please add it at the end of PART 3.
The container’s build includes gcc and g++ compilers. These compilers are used to compile and build bowtie, bowtie2, blast, EMBOSS, samtools and LillyMol. If you wish to add your own tools that requires compilation, you will first have to add a line in the Download section to download the source code. Please refer to PART 4 – Compilation to add the download line using wget. Once done, add the compiling steps. A few guidelines are to be followed:
- Create a directory in $TG_DIR/BACKEND/APPS/[APP_NAME]
- Source code should be unzip in a directory under $TG_DIR/BACKEND/APPS/[APP_NAME]
- Remove the archive after unzip to save space in the container
- Configure and compile within the APPS/[APP_NAME directory]
- Clean the objects file after compiling
For all compiled application, it is strongly recommended to create a TOOL line in CONFIG_GLOBAL providing the path of the tool in the container. This allows to avoid hard-coded path in the scripts. Please refer to Tools (TOOL section) for format.
To process a new data source, you’ll need to create one or more jobs (or scripts), each of which may have dependencies on other jobs within that data source. All work will be conducted in the $TG_DIR/BACKEND/SCRIPT directory. We understand that navigating numerous files and configuration changes in BioRels can be overwhelming. To simplify this process, we’ve developed an experimental script that generates these configurations for you based on predefined templates. While you can jump straight to the "Experimental Script Generator" section, we recommend reviewing the sections below first to familiarize yourself with the high-level concepts.
The first step is to create a new directory called by the name of the data source. Example: UNIPROT, CHEMBL, CHEBI … . For our next steps, we will call it DATASOURCE. A few guidelines are proposed:
- All uppercase
- With the same name as the data source. For Uniprot => UNIPROT
- Spaces replaced by _
The second step is creating the different jobs in that DATASOURCE directory. We recommend the following architecture below. Please note that we don’t provide here the extension since we can currently develop in both PHP and python. Therefore, please choose the language of your choice and add the corresponding file extension
-
A ck_DATASOURCE_rel script: to be run daily and check for the DATASOURCE to release a new version.
-
A dl_DATASOURCE script: to be run after ck_DATASOURCE_rel to download the files
-
A db_DATASOURCE script: to be run after dl_DATASOURCE to process the files
-
A prd_DATASOURCE script: to be run after db_DATASOURCE to cleanup and move to production
When the data source is very small, such as just one file being processed and pushed into one or a few tables, we’d recommend doing an all in one – that we dubbed whole:
- A wh_DATASOURCE script: the perform the whole job of checking for release, download, processed and push to production
A few additional scripts should be added depending on the situation:
- If you are processing small molecules, an additional db_DATASOURCE_cpd would be required prior to db_DATASOURCE.
-
If the data source you are processing is very large and require parallel jobs, the following scripts should be added:
-
pmj_DATASOURCE: to prepare the job batch (pmj stands for prepare master job)
-
process_DATASOURCE: to be called in individual job and process the data
-
rmj_DATASOURCE: to submit the batch and monitor its execution (rmj stands for run master job)
-
db_DATASOURCE would then process the results of those batch jobs and push them to database
-
Please look at the picture below to understand the different processing paths:
Each data source typically begins with a check script (ck_). For smaller data sources, a whole script (wh_) can manage all steps leading up to production. For more complex data sources, a download script allows you to obtain the necessary files without any dependencies that could hinder the download process.
Depending on the data type and complexity, several approaches may be taken. If the data source is simple or small, a database script (db_) can prepare, process, and push the data into the database. If the data includes molecular structures, it’s advisable to use a separate script (db_*_cpd) to process those compounds independently before handling the rest of the data.
For larger or more complex data sets, you can break the process into parallel jobs. In such cases, an optional preparation script (pp_) can evaluate the number of records to process or perform cleanup tasks before processing begins. Following this, a job management script (pmj_) will generate the necessary shell scripts for the parallel jobs, while the execution scripts (rmj_) will carry out the processing.
CONFIG_GLOBAL stores the global variables, the path to different tools as well as their parameters, but also web path to FTP servers. Thus, if you need to add a new web path, you can add it as a link:
LINK FTP_ENSEMBL http://ftp.ensembl.org/pub/
Guidelines:
- A link is composed of 3 columns separated by one or multiple tabs
- The number of tabs between columns doesn’t matter – as long as there are 3 values.
- The naming convention is FTP_DATASOURCE
- https paths are recommended
CONFIG_JOB is the core of the automation. It defines when a job is triggered depending on many different factors. In this section we will review how to properly configure a job.
All scripts for a given data source must be with the same line blocks – called a data source block. Each data source block is separated from other data source blocks by an empty line for more clarity. The position of a data source block is important. If you choose to incorporate a new data source in Biorels, you shouldn’t necessarily put the data source block at the end of the CONFIG_JOB.
Tip
Put your data source block after the last critical parent data source block.
In the example above, we have 2 data source blocks: 1 for taxonomy, made of 1 script, and one for gene, composed of 6 scripts. GENE will come after TAXONOMY because any given gene is related to a specific organism.
Once you have defined where in the file you want to incorporate the data source block, you can create the job lines. Each line will define a job, with its ruleset on when to run or not run it. Below is the walkthrough of each column:
Column 1: Each line starting with SC provides the ruleset for a SCript.
Column 2: The second column represent the Job identifier (JOB_ID). If you want to process a new data source, provide the first job a JOB_ID with a round number, usually 10 above the previous job in the file.
Column 3: Job name. Must be the same the script name (minus the extension)
Column 4: List of job identifiers that are required to be successful prior to trigger this job – if enabled.
Column 5: List of job identifiers that would trigger the jobs if the required jobs and the required_updated jobs have been run successfully.
Column 6: List of job identifiers that are required to be successfully run at least once.
Important
For Columns 4 to 6:* the behavior can be different between critical and non critical dependencies. If a user chooses not to enable a critical dependency, then its requirement as a dependency to your job will be ignored. Therefore you must enable in your script logic a failsafe if a critical data isn’t there to stop and fail the job.
Column 7: Directory name. Must be the same as the DATASOURCE directory name.
Column 8: Triggering requirement, based on dependent jobs:
-
C: All parent jobs must be updated (Complete)
-
A: Any parent jobs must be updated to trigger the updated
-
D: All parent jobs that are NOT disabled must be updated
Column 9: Job type. D: Processing job / P: Moving to PRD job
Column 10: Update frequency. For jobs without parents:
-
Time format (24h): HH:MM (00:10 is 10 past midnight)
-
D[N]: Every N day. D3: every 3 days
-
W[N]: Every N Week. W2: every 2 weeks
For jobs with parent:
- P (when Parent jobs are successfully completed)
Column 11: Runtype job: S (Script) R (Runtime, i.e. batch)
Column 12: Concurrent jobs.
Jobs that alter the database in some ways, whether via an insertion, a delete or an update, are susceptible to conflict with other running jobs. Two situations can arise:
-
A job is currently modifying a table X that another job depends upon.
-
A job is currently modifying a table X that another job is modifying.
In those two situations, there is a risk that one job is modifying a specific record required for the other, leading to potential failures. To avoid it, Column 12 lists all the Concurrent jobs, i.e. jobs that if they are running, the job of interest will wait. The guideline for concurrent job is as follow:
-
Parent jobs, i.e. those defined in columns 4 to 6 does not need to be provided (Situation 1)
-
All child jobs, i.e. jobs that have for critical dependency this data source must be considered as concurrent. (situation 1 but reversed)
-
All jobs that would modify the same database tables. (Situation 2). Example:
Important
Any job that you will have in CONFIG_JOB must be defined as a concurrent job of any parent job.
If you review the image above, db_gene (job id 12), which needs taxon information, is a concurrent job of wh_taxonomy. Indeed, we do not want to add new genes to a taxon that is being deleted by wh_taxonomy for example.
-
Column 4, 5 and 6: must be -1, i.e no dependencies
-
Column 8: set to A (Any). Since there’s no dependency, that’s equivalent to none
-
Column 9: set to **D (**Processing job)
-
Column 10: set to 00:10
-
Column 11: set to S (for script)
-
Column 12: -1. No concurrent jobs
The purpose of a ck_*_rel job is to check, on a daily basis for a new release of the data source. It therefore shouldn’t be requiring any dependencies. It is a script that does a processing job that should be triggered every day, here at 10 past midnight. Since it’s just checking, there is no risk for a collision with another job.
-
Column 4: Must have the ck_DATASOURCE_rel as dependency
-
Column 5 and 6: must be -1, i.e no dependencies
-
Column 8: set to C (Complete). All dependencies should have been successfully run.
-
Column 9: set to **D (**Processing job)
-
Column 10: set to P (Parent)
-
Column 11: set to S (for script)
-
Column 12: -1. No concurrent jobs
The purpose of a dl_ job is to download the new version of the data source. As such, it should only be triggered if ck_DATASOURCE_rel has been successfully run and found a new version (as defined by column 8). Therefore, a dl_ job usually has one dependency, the ck_DATASOURCE_rel job (column 4). Column 10 specifies to not run this job daily but to wait that Parent dependencies are successful. Since it’s just downloaded, there is no risk for a collision with another job.
A db_ job aim at processing the newly downloaded files and push the data in the database. This imply that all dependent data must be already in the database prior to processing this data source. For each dependent data source, if it is:
-
Updated frequently, then the identifier of the prd_ script or the db_script must be provided in Column 4
-
Updated rarely, or at a lesser frequency than your data source:
- Identifier of ck_DATASOURCE_rel script must be provided in Column 4
- Identifier of prd_DATASOURCE must be provided in Column 6
-
Column 4: Must have the dl_DATASOURCE and any other critical or non dependency
-
Column 5: should be -1, i.e no dependencies
-
Column 6: For a rarely updated data source, prd_DATASOURCE of dependent data sources
-
Column 8: set to C (Complete). All dependencies should have been successfully run.
-
Column 9: set to **D (**Processing job)
-
Column 10: set to P (Parent)
-
Column 11: set to S (for script)
-
Column 12: Review dependent jobs
A db_DATASOURCE must only be triggered when dl_DATASOURCE and all critical dependencies have been successful (Column 8 to Complete and Column 10 to Parent).
Important
Don’t forget to define this job as concurrent to its critical and non-critical dependencies
Production scripts are mainly existing to cleanup the processing files, delete the former version and create an alias of the current version to PRD_DIR. It requires any db_ jobs to be successful prior to being run.
-
Column 4: Must have the db_DATASOURCE. No need for dl_DATASOURCE since it’s covered by db_DATASOURCE
-
Column 5: should be -1, i.e no dependencies
-
Column 6: should be -1, no dependencies
-
Column 8: set to C (Complete). All dependencies should have been successfully run.
-
Column 9: set to **P (**PRD job)
-
Column 10: set to P (Parent)
-
Column 11: set to S (for script)
-
Column 12: Only if a specific job uses an actual file in the DATASOURCE directory.
Wh_ jobs are used when the data source is small and everything can be handled in one script, i.e. the check, download, process, push to database and move to prod. For each dependent data source, if it is:
-
Updated frequently, then the identifier of the prd_ script or the db_script must be provided in Column 4
-
Updated rarely, or at a lesser frequency than your data source:
-
Identifier of ck_DATASOURCE_rel script must be provided in Column 4
-
Identifier of prd_DATASOURCE must be provided in Column 6
-
-
Column 4: Must have any (non)critical dependency(ies)
-
Column 5: should be -1, i.e no dependencies
-
Column 6: For a rarely updated data source, prd_DATASOURCE of dependent data sources
-
Column 8: set to C (Complete). All dependencies should have been successfully run.
-
Column 9: set to **P (**PRD job)
-
Column 10: set to P (Parent)
-
Column 11: set to S (for script)
-
Column 12: Only if a specific job uses an actual file in the DATASOURCE directory.
Important: Don’t forget to define this job as concurrent to its critical and non-critical dependencies
Batch jobs are a special case to be used for long computing exercise where it is necessary to parallelize. A batch process is divided into 4 scripts:
-
pmj_DATASOURCE: to prepare the job batch
-
process_DATASOURCE: to be called in individual job and process the data
-
rmj_DATASOURCE: to submit the batch and monitor its execution
-
db_DATASOURCE would then process the results of those batch jobs and push them to database
pmj_ jobs are used to create the shell scripts that will be run in parallel.
If they export data from the database to prepare those jobs, the jobs inserting those data in the database must be listed as dependency.
-
Column 4: Must have any (non)critical dependency(ies)
-
Column 5: should be -1, i.e no dependencies
-
Column 6: For a rarely updated data source, prd_DATASOURCE of dependent data sources
-
Column 8: set to C (Complete). All dependencies should have been successfully run.
-
Column 9: set to **D (**Processing job)
-
Column 10: set to P (Parent)
-
Column 11: set to S (for script)
-
Column 12: Only if a specific job uses an actual file in the DATASOURCE directory.
rmj_ jobs are used to run all the shell scripts that will be run in parallel.
To avoid hundreds, sometimes thousands of jobs run in parallel, all other rmj_ jobs should be listed as concurrent jobs. This will avoid having multiple rmj_ jobs running in parallel of themselves.
-
Column 4: Must have the pmj_ job. Other critical dependencies must be handled by pmj_ job
-
Column 5: should ALWAYS BE -1
-
Column 6: For a rarely updated data source, prd_DATASOURCE of dependent data sources
-
Column 8: set to C (Complete). All dependencies should have been successfully run.
-
Column 9: set to **D (**Processing job)
-
Column 10: set to P (Parent)
-
Column 11: set to R (for running)
-
Column 12: List all other rmj_ jobs
The process_ jobs are the scripts called by rmj_ job to execute code in parallel.
Once the job is configured in CONFIG_JOB, we need to add it to the CONFIG_USER file. As a reminder, the CONFIG_USER file is aimed to be a user defined configuration file allowing user to specify which script is enabled or not – among other things. In the JOBS section of the CONFIG_USER file, after #[JOB] line and before #[/JOB] line, you will need to add your script(s). The format for each job is made of 3 columns, separated by tabs. The number of tabs between 2 columns doesn’t matter, as long as 3 non-empty textual values are provided. The format is as follow:
JOB SCRIPT_NAME STATUS
Where SCRIPT_NAME is the name of the script/job as defined in column 3 of CONFIG_GLOBAL. The STATUS must be either T (job enabled) or F (job disabled).
If you are developing scripts for a new data source, please create a section starting with # and followed by the name of the data source:
**\# PMC**
JOB ABCD T
JOB ADCE T
**\# CLINVAR**
JOB FFAT F
From there, we need to create a few shell wrappers. The first one is located in $TG_DIR/BACKEND/SCRIPT/SHELL/ and should be name by the script name and the shell extension. Below is an example with wh_taxonomy, where the shell script will be named wh_taxonomy.sh.
Job name | Script name | Shell script |
---|---|---|
wh_taxonomy | wh_taxonomy.php | wh_taxonomy.sh |
The shell script should be made of usually two to three lines, such as in the example below with wh_taxonomy.sh:
1. #!/bin/sh
2. source $TG_DIR/BACKEND/SCRIPT/SHELL/setenv.sh
3. php $TG_DIR/BACKEND/SCRIPT/TAXONOMY/wh_taxonomy.php
Line 1 is the shell interpreter command.
Line 2 MUST be always present and is the main reason for this shell script. It enables to source environment variables.
Line 3 is calling the job’s script.
Note
Please note that in this shell script, $TG_DIR should already be defined.
Important
This shell wrapper allows you to call a script using any language, not just PHP.**
The last script to generate is the container shell script. All container shell scripts (CS) are located in $TG_DIR/BACKEND/CONTAINER_SHELL. The goal of each CS wrapper is to allow a script to be executed from a container. In a very similar way, it is made of 3 lines, the shell interpreter command, sourcing the environment variables and the script execution. Please not however that we call biorels_exe to run the script within the container. This allows job submission more easily.
\#!/bin/sh
source \$TG_DIR/BACKEND/SCRIPT/SHELL/setenv.sh
biorels_exe php \$TG_DIR/BACKEND/SCRIPT/TAXONOMY/wh_taxonomy.php
To help in navigating the process of creating those scripts for a new data source, we have created a process that is in $TG_DIR/BACKEND/DEVELOP. To execute the process, simply call the following command:
cd $TG_DIR/BACKEND/DEVELOP
biorels_php ./new_script_startup.php
The script will walk you through a series of question to help assess the requirements for your data source. The first series of question will aim at assessing the type of scripts you will need (Figure 2).
Is your process small enough to be covered by 1 script? (Y/N)
- Answering Y will limit number of scripts to 2: a ck_*_rel script and a wh_ script
- Answering N will automatically add dl_* and prd_* scripts as well as trigger 3 additional questions:
Does your process includes processing compounds (Y/N):
- Answering Y will add a db_*_cpd script.
Does your process requires to run parallel jobs? (Y/N):
- Answering Y will add pmj_*, process_*, rmj_* scripts
Do you need a preparation/cleanup script prior to the processing script? (Y/N):
- Answering Y will add a pp_* script
Next, generic questions about the data source requirements and programming language will be asked:
Do you prefer to code in PHP (N for Python)? (Y/N):
- Answering Y will generate PHP scripts
- Answering N will generate Python scripts
Is this a private data source? (Y/N):
- Answering Y will move the code to PRIVATE_SCRIPT
- Answering N will move the code to SCRIPT
Does it require a login/password? (Y/N):
Answering Y will add to CONFIG_USER an additional line in the GLOBAL section for the user to provide a login/password
What would be a good root FTP/HTTPS Path for the location of the files:
Please provide the root FTP or HTTPS path for the location of the data files, starting with https:// or ftp://
What is the name of the data source (no space):
Please provide the name of the data source. A few guidelines:
- Only alphabetic characters allowed – mixed of Upper/lowercase is acceptable
- Anything else WILL result in undefined behavior.
Which data sources are critical dependencies of your data source?
This last question is absolutely critical. You will be asked to list all of the critical dependencies for your data source. In order terms, list all of the data sources that your data source is a child of. Please refer to the pictures in the publication to understand which data source it might be. You do not need to go all the way to a L1, but just the immediate layer of dependency is enough.
All done! This will generate a new directory, named by the name of your data source, in uppercase. In it, you will see a directory of the same name, which will contain the PHP or Python scripts. In addition, you will see a CONFIG_CHANGES file that will explain you in detail what you will need to change or add. All set! Just try it out now. Please bear in mind that this is experimental, so if it doesn’t work, just reach out to us!
Everything is set! You can now execute your script.