Skip to content

Cloud Function for detecting new files on the clinvar ftp site

Notifications You must be signed in to change notification settings

clingen-data-model/clinvar-ftp-watcher

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

clinvar-ftp-watcher

Code for detecting new files on the ClinVar FTP site.

It can be configured to look for new dated files of the form 'filename_YYYY-MMDD.xml.gz'. By default, will look for the file 'ClinVarVariationRelease_YYYY-MMDD.xml.gz'.

It works as follows - all values in this description are default configured values:

  • read the last offset from the 'clinvar_vcv_ftp\watcher' containing records of dated files
  • retrieve the date of the last file recorded from that offset (there may be multiple dated files in a single recorded entry)
  • use that date to compare against all of the 'ClinVarVariationRelease_YYYY-MMDD.xml.gz' dated files in https://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/clinvar_variation/weekly_release/
  • when there are more recent files, store them in a new message in the configured cloud topic.
  • initiate the 'clinvar_vcv_ingest' cloud run job in google cloud.

The 'weekly_release' directory contains the following files:

  • 'ClinVarVariationRelease_00-latest_weekly.xml.gz' - file that is a symbolic link to a dated release file. At the beginning of the month, ClinVar moves all of the dated files for the previous month to the parent clinvar_variation directory. As new weekly dated files are added to the weekly_release directory throught any given month, this file will be symbolically linked to the latest weekly release dated file.
  • md5 checksum files - for every .gz file in this directory there is a checksum file in the form .gz.md5
  • ClinVarVariationRelease_YYYY-MMDD.xml.gz - these are the dated files that this process reports on.

This code has been deployed as a cloud run job named 'clinvar-vcv-ftp-watcher' and an associated cloud scheduler trigger scheduled to run every hour.

A new build requires an edit to the cloud run job to update the container image.

Environment Variables:

"DX_JAAS_CONFIG" must be defined, this is the kafka permission string

"CLINVAR_FTP_WATCHER_TOPIC" Kafka topic to read/write, defaults to 'clinvar-vcv-ftp-watcher' when not explicitly defined

"NCBI_CLINVAR_WEEKLY_FTP_DIR" defaults to "/pub/clinvar/xml/clinvar_variation/weekly_release" when not explicitly defined "NCBI_CLINVAR_FTP_SITE" defaults to "https://ftp.ncbi.nlm.nih.gov" when not explicitly defined "NCBI_CLINVAR_FILE_NAME_BASE" - the base file name to look for - defaults to 'ClinVarVariationRelease' when not explicitly defined

"GCP_WORKFLOW_PROJECT_ID" this is the GCP Project ID where the workflow resides, defaults to "clingen-dev" when not explicitly defined "GCP_WORKFLOW_LOCATION" this is the region where the workflow resides, such as "us-central1" when not explicitly defined "GCP_WORKFLOW_NAME" this is the name of the workflow to invoke, such as "clinvar-ingest" when not explicitly defined

Command Line Arguments: --kafka = do not write the release information to the kafka topic. --workflow = do not make calls to initiate the workflow processing of the clinvar release.

About

Cloud Function for detecting new files on the clinvar ftp site

Resources

Stars

Watchers

Forks

Packages

No packages published