Skip to content

Periodically retrieve some data from various sources.

License

Notifications You must be signed in to change notification settings

jonas-hagen/databird

Repository files navigation

Build Status

databird

Periodically retrieve data from different sources.

The databird package only provides a framework to plan and run the tasks needed to keep a local data-file-store up do date with various remote sources. The remote sources can be anything (e.g. FTP Server, ECMWF, HTTP Api, SQL database, ...), as long as there is a databird-driver available for the specific source.

Usage

Databird is configured with configuration files and invoked by

$ databird retrieve -c /etc/databird/databird.conf

# or (as the above is the default)
$ databird retrieve

You can store the configuration files anywhere and for example run the above command periodically as cron job.

Also, some rq workers are required:

$ rq worker databird

This will start one worker. You should use a supervisor to start multiple workers.

Configuration

The following example configuration defines a repository, which is populated with daily GNSS data from ftp://cddis.nasa.gov/gnss/data/daily/.

The main configuration file (usually databird.conf) could look like that:

general:
  root: /data/repos # root path for data repositories
  num-workers: 16   # max number of async workers
  include: "databird.conf.d/*.conf"  # include config files

Generally you can configure anything in any file, as all configuration files are merged to one configuration tree. The include option is an exception, as it can only be declared in the top config file.

Then in databird.conf.d/cddis.conf you can configure a profile and a repository:

profiles:
  nasa_cddis:
    driver: standard.FtpDriver
    configuration:
      host: cddis.nasa.gov
      user: anonymous
      password: ""
      tls: False
       
repositories:
  nasa_gnss:
    description: Data from NASAs Archive of Space Geodesy Data
    profile: nasa_cddis
    period: 1 day
    delay: 2 days
    start: 2019-01-01
    targets:
      status: "{time:%Y}/cddis_gnss_{iso_date}.status"
    configuration:
      user: anonymous  # this could override 'user' from profile
      root: "/gnss/data/daily"
      patterns:
        status: "{time:%Y}/{time:%j}/{time:%y%j}.status"

When calling databird with this configuration the following is achieved:

  • A repository in the folder /data/repos/nasa_gnss/ is created
  • For every day, a file like 2019/nasa_gnss_2019-01-20.status is expected
  • If that file is missing, retrieve it from ftp://cddis.nasa.gov/gnss/data/daily/2019/020/19020.status
  • If there are many files missing, the data is retrieved asynchronously

This example used the standard.FTPDriver.

Monitoring

Use databird webmonitor [PORT] to start the web interface.

Since databird uses RQ for managing jobs, you also check the options at RQ/docs/monitoring.

Drivers

Anyone can write drivers (see below). Currently, the following drivers are available:

Included:

  • standard.FilesystemDriver: Retrieve data from the local filesystem
  • standard.CommandDriver: Run an arbitrary shell command
  • standard.FtpDriver: Retrieve data from an FTP server

Climate:

  • climate.EcmwfDriver: Retrieve data from the European Centre for Medium-Range Weather Forecasts (ECMWF) via their API
  • climate.C3SDriver: Retrieve data from the Copernicus Climate Change Service (C3S) via their API
  • climate.GesDiscDriver: Retrieve data from the NASA EarthData GES DISC service.

Development

  1. Create a Python environment and activate it
    $ python3 -m venv . && source bin/activate
  2. Install the development environment:
    (databird) $ pip install -r requirements-dev.txt

Writing a new driver

Drivers are published in a namespace package databird-drivers. Everyone can develop drivers and share them.

Install databird and run mr.bob to create a new driver package:

(databird) $ cd $HOME/projects
(databird) $ python -m mrbob.cli databird.blueprints:driver

After answering some questions, a new directory databird-driver-<chosen_name> is created. Lets asume <chosen_name> = foo, then your driver is usually implemented in databird/drivers/foo/foo.py in a class named FooDriver(). Until more documentation is available, you have to look at the code to figure out how to write a driver.

Other people will be able to use it with driver: foo.FooDriver.

Tell me if you wrote a new driver, so I can include it in the list.

About

Periodically retrieve some data from various sources.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages