Refactor data pipeline output #24

nawatts · 2020-09-11T23:19:45Z

Currently, all outputs of the data pipeline are written to the output.staging_path specified in pipeline_config.ini.

exome-results-browsers/data_pipeline/pipeline_config.ini

Lines 45 to 47 in 86c8b62

    
           [output] 
        
           # Path for intermediate Hail files. 
        
           staging_path = gs://exome-results-browsers/data/200911

Thus, preserving older versions of the combined Hail table requires changing the staging path setting every time data is updated. This in turn leads to requiring multiple copies of gene models and individual dataset files.

Instead, gene models could be output separately, individual dataset Hail tables written to staging path, and combined Hail tables written to timestamped paths. This way, updating one dataset would require running prepare_dataset only on that one dataset and then generating a new combined Hail table.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor data pipeline output #24

Refactor data pipeline output #24

nawatts commented Sep 11, 2020

Refactor data pipeline output #24

Refactor data pipeline output #24

Comments

nawatts commented Sep 11, 2020