Download external data once and reuse them during run #27

kltm · 2018-03-10T02:27:48Z

Download ontologies and "annotation" upstreams once and reuse them during run in all stages. This would be accomplished with some combo of catalogs and/or robot.

This serves two important purposes.

allows the pipeline to fail early -- will not fail several hours in, wasting less time and worry
have effective retries at the pipeline level -- as we are failing early, we can add retries to cover temporary network issues
allows us to package used ontologies to create actually reuseable/repeatable environments

Ideally, once the initial data grabs are done up front, the pipeline stops talking to the outside world until it starts publishing.

kltm · 2018-03-13T19:41:06Z

We are also having problems with getting our hands on other artifacts. I've generalized this ticket so we can figure out a joint strategy to get all data in place upfront so we can ensure that we don't have failures after (in this case) about six hours.

E.g.:

02:42:34 wget --quiet --retry-connrefused --waitretry=10 -t 10 --no-check-certificate ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/CHICKEN/goa_chicken_isoform.gaf.gz -O target/groups/goa_chicken_isoform/goa_chicken_isoform-src.gaf.gz.tmp && mv target/groups/goa_chicken_isoform/goa_chicken_isoform-src.gaf.gz.tmp target/groups/goa_chicken_isoform/goa_chicken_isoform-src.gaf.gz && touch target/groups/goa_chicken_isoform/goa_chicken_isoform-src.gaf.gz
02:42:37 target/Makefile:302: recipe for target 'target/groups/goa_chicken_isoform/goa_chicken_isoform-src.gaf.gz' failed
02:42:37 make: *** [target/groups/goa_chicken_isoform/goa_chicken_isoform-src.gaf.gz] Error 4

also tagging @cmungall @dougli1sqrd

kltm · 2019-02-04T23:25:21Z

Noting that the ontology portion of this ticket would largely be covered by geneontology/go-ontology#16876 (@balhoff )

Noting to @dougli1sqrd that the GAF/GPAD upstream part of this could be covered by the following steps:

new Makefile target something like download
- downloads necessary upstream file to local filesystem
- make download could be wrapped in a retry and fail the pipeline early
run regular mega target
- if the mega-Makefile validate/downloader already sees the file on the filesystem, it accepts it without attempting a download

This would help address recent issues we've had with our upstreams where even though a download fails, the pipeline may continue running for many hours with a different parallel job, both increasing the amount of time it takes to notice a problem and burying the error message somewhere in a bajillion log lines.

kltm · 2019-06-06T18:35:58Z

Further commentary on #27 : if every tool we realistically use (Noctua, AmiGO, etc.) loads a single realized ontology, if it had the right pedigree information in it, we could just reference those locally and not have to worry about catalogs.

kltm · 2019-06-06T20:54:04Z

@balhoff With the closure of ontodev/robot#6, we should be able to reference the merged ontologies that we produce withing the pipeline, taking out the guesswork. At this stage, what are the ones being produced?

In pipeline, mostly for AmiGO:

Tangentially, in minerva:
http://purl.obolibrary.org/obo/go/extensions/go-lego.owl

The vast majority are for AmiGO/GOlr (basically everything not go-lego). Already, they are looking at the internal version where it is our product. Could we not have another product for AmiGO/GOlr that could bind the rest of these up?

@dougli1sqrd Where are you getting the ontology information for ontobio?

dougli1sqrd · 2019-06-26T22:28:29Z

@kltm ontobio uses the go.json from the purl. It's downloaded by the go-site/pipeline makefile before ontobio runs.

kltm · 2019-06-28T20:24:19Z

@balhoff I wanted to follow up on #27 (comment) above in reference to:

pipeline/Jenkinsfile

Line 88 in 3d3043f

GOLR_INPUT_ONTOLOGIES = [

Are there plans to have these as a merged ontology, or should we work with @dougli1sqrd to make sure these are all available locally by the time we hit this point in the pipeline?

kltm · 2019-06-28T20:55:41Z

Talking to @balhoff

http://purl.obolibrary.org/obo/cl/cl-basic.owl
http://purl.obolibrary.org/obo/ncbitaxon/subsets/taxslim.owl

are already in go-gaf

Leaving us to decide whether/how to fold in:
"http://purl.obolibrary.org/obo/eco/eco-basic.owl",
"http://purl.obolibrary.org/obo/pato.owl",
"http://purl.obolibrary.org/obo/po.owl",
"http://purl.obolibrary.org/obo/chebi.owl",
"http://purl.obolibrary.org/obo/uberon/basic.owl",
"http://purl.obolibrary.org/obo/wbbt.owl"

dougli1sqrd · 2019-11-27T23:12:30Z

GAF upstream sources are now being downloaded and used in the pipeline. Is there anything left in this ticket?

kltm · 2019-11-27T23:21:10Z

We still have ontologies all over the place. We might want to make another issue, but essentially we need to have an enforce catalogs (or similar) so that there is no leaking during a run. For example, currently, I believe there is a place that tags the public NEO load, meaning that it can be a month behind.

kltm · 2019-12-06T01:23:38Z

Talking to @dougli1sqrd, it turns out that the "mixin" process in ontobio will still grab the remote file (paint in this case, possibly causing errors is resource down, as experienced on 2019-12-05).
So, we're almost there, but still leaky.

That said, we probably don't want to keep going down the path of "tricking" ontobio by laying things out on the filesystem, but rather a more "catalog-like" system where the downloader can generate the mapping file for the run that is then consumed by ontobio.

dougli1sqrd · 2020-03-10T18:20:09Z

Just a reminder to ourselves, this is occurring still: #27 (comment)

kltm added the enhancement label Mar 10, 2018

kltm changed the title ~~Download ontologies once and reuse them during run~~ Download external data once and reuse them during run Mar 13, 2018

kltm mentioned this issue May 8, 2019

switching to snapshot url geneontology/go-site#1076

Merged

kltm mentioned this issue Jun 6, 2019

Add and maintain UUID (or similar identifying element) to ontologies built in pipeline #96

Closed

dougli1sqrd mentioned this issue Jul 1, 2019

Download and organize scripts to be used in the pipeline to grab all GAF sources geneontology/go-site#1133

Closed

kltm mentioned this issue Jul 1, 2019

Using the new downloader script in go-site to download and then organ… #103

Merged

kltm mentioned this issue Aug 7, 2019

only let the downloader download GAFs, turn off GPI #118

Merged

kltm added a commit that referenced this issue Oct 24, 2019

update tests and fixes for #27 and #1133

0c1b8f8

kltm mentioned this issue Mar 17, 2020

GAF/Source validation processing should occur with only one pass geneontology/go-site#1384

Open

dougli1sqrd mentioned this issue Aug 4, 2020

Jenkinsfile should set a variable for the URL of the go.json ontology used by ontobio during the Mega-make step #200

Closed

kltm mentioned this issue Jul 1, 2022

Create and test "materialized" ontology for AmiGO /GOlr (go-amigo) geneontology/go-ontology#19120

Open

kltm mentioned this issue Mar 9, 2023

Slow and unstable downloads from EBI upstreams increasing pipeline failures #305

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Download external data once and reuse them during run #27

Download external data once and reuse them during run #27

kltm commented Mar 10, 2018 •

edited

Loading

kltm commented Mar 13, 2018

kltm commented Feb 4, 2019 •

edited

Loading

kltm commented Jun 6, 2019

kltm commented Jun 6, 2019

dougli1sqrd commented Jun 26, 2019

kltm commented Jun 28, 2019

kltm commented Jun 28, 2019 •

edited

Loading

dougli1sqrd commented Nov 27, 2019

kltm commented Nov 27, 2019

kltm commented Dec 6, 2019

dougli1sqrd commented Mar 10, 2020

Download external data once and reuse them during run #27

Download external data once and reuse them during run #27

Comments

kltm commented Mar 10, 2018 • edited Loading

kltm commented Mar 13, 2018

kltm commented Feb 4, 2019 • edited Loading

kltm commented Jun 6, 2019

kltm commented Jun 6, 2019

dougli1sqrd commented Jun 26, 2019

kltm commented Jun 28, 2019

kltm commented Jun 28, 2019 • edited Loading

dougli1sqrd commented Nov 27, 2019

kltm commented Nov 27, 2019

kltm commented Dec 6, 2019

dougli1sqrd commented Mar 10, 2020

kltm commented Mar 10, 2018 •

edited

Loading

kltm commented Feb 4, 2019 •

edited

Loading

kltm commented Jun 28, 2019 •

edited

Loading