Cloud URI support in GenomicsDB

There are two levels of Cloud URI support in GenomicsDB. Using URIs to read in VCF files is covered here.

Cloud URIs can also be specified to all tools and GenomicsDB classes(com.intel.genomicsdb.importer.GenomicsDBImporter and com.intel.genomicsdb.reader.GemomicsDBFeatureReader) that create workspaces, load data into arrays and query the arrays from the workspaces via the loader and query json files. The following schemes are currently supported -

HDFS e.g. hdfs://my-master:9000/my_workspace
EMRFS e.g. s3://my-bucket/my_repository/my_workspace
GCS e.g. gs://my-bucket/my_workspace

Examples:

create_tiledb_workspace hdfs://my_master:9000/ws
consolidate_tiledb_array gs://my-bucket/ws/gdb_partition_1
Set workspace paths in the JSON files to point to either local filesystems(e.g. /my_home/my_workspace) or to cloud URIs(e.g hdfs://my-master:9000/my_workspace).

TileDB/GenomicsDB relies on libhdfs to access HDFS. libhdfs invokes the Java VM and uses HDFS' Java API to transfer data.

Check that libjvm is configured into the library paths(${LD_LIBRARY_PATH}). Also, configure the JAVA classpath using hadooop classpath --glob.

Cloud object storage systems such as S3 and GCS are supported in TileDB/GenomicsDB using the HDFS API as well.

EMRFS is an implementation of HDFS and GenomicsDB works out of the box as long as the buckets are accessible from HDFS.
For GCS
- If you are using the a Cloud Dataproc cluster on Google cloud, all the prerequisites are installed already.
- If you are setting up HDFS manually (whether in vanilla GCE VMs or external systems), install the GCS Cloud Connector to work with gs:// URIs. Full instructions are provided here and here. The full set of configuration entries is listed here.
The following shows some of the common settings that are modified:
```
  	<property>
  		<name>google.cloud.auth.service.account.enable</name>
  		<value>true</value>
  	</property>
  	<property>
  		<name>google.cloud.auth.service.account.json.keyfile</name>
  		<value>MY_KEY.json</value>
  	</property>
  	<property>
  		<name>fs.gs.project.id</name>
  		<value>MYPROJECT</value>
  	</property>
  	<property>
  		<name>fs.gs.working.dir</name>
  		<value>/</value>
  	</property>
```
For Google cloud credentials, you can also set the environment variable GOOGLE_APPLICATION_CREDENTIALS to point to your JSON file containing the key (instead of modifying the XML file).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cloud URI support in GenomicsDB

Clone this wiki locally