Download EGA data and arrange for analysis

This is a Nextflow workflow designed to download data for an EGA dataset and arrange in a form suitable for analysis, with raw FASTQ files and a metadata table.

The workflow uses the pyega3 client to pull files directly from EGA.

Prerequisites

Nextflow installed. Current version tested with nextflow 24.04.3
Formal access to the datasets of interest in EGA
SLURM cluster management and job scheduling system
An account and access the Seqera Platform (optional)

Setup

Directory structure

Create a directory for the dataset (with appropriate access restrictions - this is controlled access data)
Create 'metadata' and 'credentials' subdirectories

Set up authentication

Python client method

The python client authenticates via your user account, place a file called 'ega.credentials' in the credentials folder. It will look like:

{
        "username": "me@foo.bar.uk",
        "password": "abc123",
}

See the EGA documentation for more info.

Obtain metadata

Download the metadata bundle from the EGA page for each dataset. You'll have the option to download a zipped file with metadata in TSV, CSV and JSON. Download the CSV version, unzip it and place the files under 'metadata':

metadata
    |- analyses.csv
    |- analysis_sample.csv
    |- ...

We now have all the information we need to download the raw data and process the metadata.

Run download pipeline

Clone this repository to the top directory.

Then run:

source envs.sh
nextflow run main.nf -c nextflow.config --EGA_DATASET_ID $EGA_DATASET_ID

... or

nextflow run main.nf -c nextflow.config --EGA_DATASET_ID $EGA_DATASET_ID -with-tower

to leverage the Seqera Platform capabilities (optional). You'll need to obtain a token and add it to envs.sh.

The result will be:

A metadata summary at a location like work/..../ega_metadata/EGAD00011223344.merged.csv
FASTQ files at data/.../ega_data/(EGAFxxxxx)/fastq

Clean up

Nexflow leaves a few things lying around, so once the above has succeeded, remove them:

rm -rf .nextflow*

Legacy code

A previous implementation was used that required BAM/ CRAM files downloaded from EGA to be converted to fastq in an endedness-specific manner (i.e. paired endedness detected and handled correctly). The previous workflow, in addition of using the pyega3 client to pull files directly from EGA, included also the Aspera dropbox method to download files from a 'dropbox' provided to you from EGA staff - this method is now deprecated. See release v1.0.0 for more information on the earlier version.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
bin		bin
README.md		README.md
envs.sh		envs.sh
main.nf		main.nf
nextflow.config		nextflow.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Download EGA data and arrange for analysis

Prerequisites

Setup

Directory structure

Set up authentication

Python client method

Obtain metadata

Run download pipeline

Clean up

Legacy code

About

Releases 2

Packages

Contributors 3

Languages

ebi-gene-expression-group/ega_downloader

Folders and files

Latest commit

History

Repository files navigation

Download EGA data and arrange for analysis

Prerequisites

Setup

Directory structure

Set up authentication

Python client method

Obtain metadata

Run download pipeline

Clean up

Legacy code

About

Resources

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 3

Languages

Packages