-
Notifications
You must be signed in to change notification settings - Fork 28
Cloud URI support in GenomicsDB
There are two levels of Cloud URI support in GenomicsDB. Using URIs to read in VCF files is covered here.
Cloud URIs can also be specified to all tools and GenomicsDB classes(com.intel.genomicsdb.importer.GenomicsDBImporter and com.intel.genomicsdb.reader.GemomicsDBFeatureReader)
that create workspaces, load data into arrays and query the arrays from the workspaces via the loader and query json files. The following schemes are currently supported -
- HDFS e.g. hdfs://my-master:9000/my_workspace
- EMRFS e.g. s3://my-bucket/my_repository/my_workspace
- GCS e.g. gs://my-bucket/my_workspace
Examples:
- create_tiledb_workspace hdfs://my_master:9000/ws
- consolidate_tiledb_array gs://my-bucket/ws/gdb_partition_1
- Set workspace paths in the JSON files to point to either local filesystems(e.g. /my_home/my_workspace) or to cloud URIs(e.g hdfs://my-master:9000/my_workspace).
TileDB/GenomicsDB relies on libhdfs to access HDFS. libhdfs invokes the Java VM and uses HDFS' Java API to transfer data.
Check that libjvm is configured into the library paths(${LD_LIBRARY_PATH})
. Also, configure the JAVA classpath using hadooop classpath --glob
.
Cloud object storage systems such as S3 and GCS are supported in TileDB/GenomicsDB using the HDFS API as well.
-
EMRFS is an implementation of HDFS and GenomicsDB works out of the box as long as the buckets are accessible from HDFS.
-
For GCS
- If you are using the a Cloud Dataproc cluster on Google cloud, all the prerequisites are installed already.
- If you are setting up HDFS manually (whether in vanilla GCE VMs or external systems), install the GCS Cloud Connector to work with gs:// URIs. Full instructions are provided here and here. The full set of configuration entries is listed here.
The following shows some of the common settings that are modified:
<property> <name>google.cloud.auth.service.account.enable</name> <value>true</value> </property> <property> <name>google.cloud.auth.service.account.json.keyfile</name> <value>MY_KEY.json</value> </property> <property> <name>fs.gs.project.id</name> <value>MYPROJECT</value> </property> <property> <name>fs.gs.working.dir</name> <value>/</value> </property>
For Google cloud credentials, you can also set the environment variable GOOGLE_APPLICATION_CREDENTIALS to point to your JSON file containing the key (instead of modifying the XML file).
- Overview of GenomicsDB
- Compiling GenomicsDB
-
Importing variant data into GenomicsDB
- Create a TileDB workspace
- Importing data from VCFs/gVCFs into TileDB/GenomicsDB
- Importing data from CSVs into TileDB/GenomicsDB
- Incremental import into TileDB/GenomicsDB
- Overlapping variant calls in a sample
- Java interface for importing VCF/CSV files into TileDB/GenomicsDB
- Dealing with multiple GenomicsDB partitions
- Querying GenomicsDB
- HDFS or S3 or GCS support in GenomicsDB
- MPI with GenomicsDB
- GenomicsDB utilities
- Try out with Docker
- Common issues
- Bug report
- External Contributions