-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Download external data once and reuse them during run #27
Comments
We are also having problems with getting our hands on other artifacts. I've generalized this ticket so we can figure out a joint strategy to get all data in place upfront so we can ensure that we don't have failures after (in this case) about six hours. E.g.:
also tagging @cmungall @dougli1sqrd |
Noting that the ontology portion of this ticket would largely be covered by geneontology/go-ontology#16876 (@balhoff ) Noting to @dougli1sqrd that the GAF/GPAD upstream part of this could be covered by the following steps:
This would help address recent issues we've had with our upstreams where even though a download fails, the pipeline may continue running for many hours with a different parallel job, both increasing the amount of time it takes to notice a problem and burying the error message somewhere in a bajillion log lines. |
Further commentary on #27 : if every tool we realistically use (Noctua, AmiGO, etc.) loads a single realized ontology, if it had the right pedigree information in it, we could just reference those locally and not have to worry about catalogs. |
@kltm ontobio uses the |
@balhoff I wanted to follow up on #27 (comment) above in reference to: Line 88 in 3d3043f
Are there plans to have these as a merged ontology, or should we work with @dougli1sqrd to make sure these are all available locally by the time we hit this point in the pipeline? |
Talking to @balhoff http://purl.obolibrary.org/obo/cl/cl-basic.owl are already in go-gaf Leaving us to decide whether/how to fold in: |
GAF upstream sources are now being downloaded and used in the pipeline. Is there anything left in this ticket? |
We still have ontologies all over the place. We might want to make another issue, but essentially we need to have an enforce catalogs (or similar) so that there is no leaking during a run. For example, currently, I believe there is a place that tags the public NEO load, meaning that it can be a month behind. |
Talking to @dougli1sqrd, it turns out that the "mixin" process in ontobio will still grab the remote file (paint in this case, possibly causing errors is resource down, as experienced on 2019-12-05). That said, we probably don't want to keep going down the path of "tricking" ontobio by laying things out on the filesystem, but rather a more "catalog-like" system where the downloader can generate the mapping file for the run that is then consumed by ontobio. |
Just a reminder to ourselves, this is occurring still: #27 (comment) |
Download ontologies and "annotation" upstreams once and reuse them during run in all stages. This would be accomplished with some combo of catalogs and/or robot.
This serves two important purposes.
Ideally, once the initial data grabs are done up front, the pipeline stops talking to the outside world until it starts publishing.
The text was updated successfully, but these errors were encountered: