-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement an identifier mapping service #41
Conversation
Codecov Report
@@ Coverage Diff @@
## main #41 +/- ##
===========================================
- Coverage 100.00% 99.76% -0.24%
===========================================
Files 5 7 +2
Lines 339 422 +83
Branches 76 95 +19
===========================================
+ Hits 339 421 +82
- Partials 0 1 +1
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Great work! I did want to note that the generic RDF store can be in the same endpoint as the mapping service. I can try to have a look at it this afternoon. |
The first pass implementation seemed to go well, but after I set this up to run as a service, I noticed the following problem. Given a stupid simple triple store with two triples: https://www.ebi.ac.uk/chebi/searchId.do?chebiId=1 rdfs:label "dummy label"
http://purl.obolibrary.org/obo/CHEBI_1 rdfs:label http://purl.obolibrary.org/obo/CHEBI_2 And the following query (Which should return the dummy label of ChEBI:1 after mapping the URIs): SELECT ?label WHERE {
?child rdfs:subClassOf <http://purl.obolibrary.org/obo/CHEBI_2> .
SERVICE <http://127.0.0.1:5000/sparql> {
?child owl:sameAs ?child_mapped .
}
?child_mapped rdfs:label ?label .
} I get a query to the service that looks like SELECT REDUCED * WHERE {
?child owl:sameAs ?child_mapped .
}
VALUES (?child) {
(<http://purl.obolibrary.org/obo/CHEBI_1>)
(<http://purl.obolibrary.org/obo/CHEBI_2>)
} I'm not sure if this is even valid SPARQL since the How to ReproduceFirst, run the service with Code to run the examplefrom rdflib import RDFS, Graph, Literal, URIRef
from tabulate import tabulate
def main():
graph = Graph()
graph.add(
(
URIRef("https://www.ebi.ac.uk/chebi/searchId.do?chebiId=1"),
RDFS.label,
Literal("label 1"),
)
)
graph.add(
(
URIRef("http://purl.obolibrary.org/obo/CHEBI_1"),
RDFS.subClassOf,
URIRef("http://purl.obolibrary.org/obo/CHEBI_2"),
)
)
# Get labels of children of 2
res = graph.query(
"""
SELECT ?label WHERE {
?child rdfs:subClassOf <http://purl.obolibrary.org/obo/CHEBI_2> .
SERVICE <http://127.0.0.1:5000/sparql> {
?child owl:sameAs ?child_mapped .
}
?child_mapped rdfs:label ?label .
}
"""
)
print(tabulate(list(res)))
if __name__ == "__main__":
main() |
The query is valid, but broken. SELECT REDUCED * WHERE {
?child owl:sameAs ?child_mapped .
}
VALUES (?s) {
(<http://purl.obolibrary.org/obo/CHEBI_1>)
(<http://purl.obolibrary.org/obo/CHEBI_2>)
} Should be SELECT REDUCED * WHERE {
?child owl:sameAs ?child_mapped .
}
VALUES (?child) { #note the difference here
(<http://purl.obolibrary.org/obo/CHEBI_1>)
(<http://purl.obolibrary.org/obo/CHEBI_2>)
} That looks like an error in the SPARQL engine sending the query. The values outside is valid and usually only done in federated queries. So as in the broken query |
The s was only a typo when I changed the variable names to make better context. The issue is with the values on the outside, the triples function in rdflib doesn’t get passed anything to let it know what the actual values are |
Apparently the way the service SPARQL is getting generated is based on the following code: https://github.com/RDFLib/rdflib/blob/d2c9edc7a3db57e1b447c2c4ba06e1838b7ada0c/rdflib/plugins/sparql/evaluate.py#L372-L393 still - the issue is if the |
The final values is SPARQL ok. The question is why that does not give a join with the inline data being the feeding algebra element. It's been a while since I did any python rdflib development, and I had not migrated my development environment. The error is certainly not in your code. At most I need to add in an optimizer like step to make sure inline data has higher priority (more likely to be left side of a join). At least that is what I think. Let me get a debugger on it and have a look. Looking at it if the order of p1 and p2 is inverted at line 707 of algebra.py of rdflib the code works fine. I think it is valid optimization to by default join from the known bindings into the local store, and I think it is worth raising this on the rdflib mailinglist/issue tracker. |
I think RDFLib/rdflib#2125 might be related. I monkey patched the rdflib code for the function I mentioned with def _buildQueryStringForServiceCall(ctx: QueryContext, match: re.Match) -> str:
service_query = match.group(2)
try:
parser.parseQuery(service_query)
except ParseException:
prefixes = "\n".join(
f"PREFIX {prefix}: {ns.n3()}"
for prefix, ns in ctx.prologue.namespace_manager.store.namespaces()
)
body = dedent(f"""\
SELECT REDUCED * WHERE {{
{_get_init(ctx)}
{service_query.strip()}
}}
""")
return f"{prefixes}\n{body}"
else:
return service_query + _get_init(ctx)
def _get_init(ctx):
sol = ctx.solution()
if not len(sol):
return ""
variables = " ".join([v.n3() for v in sol])
variables_bound = " ".join([ctx.get(v).n3() for v in sol])
return "VALUES (" + variables + ") {(" + variables_bound + ")}" and now it creates SPARQL that is formatted |
@cthoyt I asked on stackoverflow. I suspect the best way is to make our own extension of the SPARQLProcessor that does this join reordering before evaluation. |
This seems doable, but because the SPARQLProcessor delegates to some module level functions, I am worried there will be no way to do this other than with lots of code duplication |
Yeah, let's give the stackoverflow a few days to see if we get a reply. RDFlib developers prefer to answer there, but if needed we can try the the dev mailing list. |
Hi @cthoyt and @JervenBolleman , I had a look at the implementation and might be able to give some insights on exposing the RDFLib graph as a SPARQL endpoint The main thing rdflib-endpoint does is to take a rdflib model and defines an API endpoint that handles all stuff a SPARQL endpoint is expected to handle to pass queries to the RDFLib graph: query through GET and POST, content negotiation through Accept headers, etc Afaik those things are required if you want your API endpoint to be considered as a valid API endpoint by others SPARQL endpoints. So that those SPARQL endpoints are able to send and resolve Deploying an endpoint from your custom from rdflib_endpoint import SparqlEndpoint
curieG = CURIEServiceGraph()
app = SparqlEndpoint(
graph=curieG,
cors_enabled=True,
# Metadata used for the SPARQL service description and Swagger UI:
title="SPARQL endpoint to serve CURIE mappings",
description="A SPARQL endpoint to serve CURIE mappings",
version="0.1.0",
public_url='https://your-endpoint-url/sparql',
) Then run the app with I can add it with tests if you want @cthoyt , but since you have already everything setup you might want to do it directly, let me know what you prefer |
@vemonet we've got a minimal version of this implemented in https://github.com/cthoyt/curies/blob/1c05478ff12764e5fb30d70799e9fd8984fa1ab4/src/curies/mapping_service.py#L178-L207 - I think the best way to go right now is to focus on making the SPARQL and the service get interpreted correctly, then in a follow-up we can get fancy and compliant. Thanks for the comments! |
@cthoyt I had a thought about the normal bioregistry.io sparql endpoint being both this special mapping one and the normal one. Pull request regarding that added as material for inspiration :) |
@JervenBolleman I didn't see any activity on your stack overflow post. do you want to try messaging the RDFLib dev mailing list? |
I tried their matrix first, and if that does not get a reply. dev mailing list it is. |
@JervenBolleman in the mean time, I have vendored the algebra code and made the modification you suggested. Interestingly, this works on py37 and locally, but the py311 test shows the subjects and objects are getting out of order |
See alternative approach for optimizing query after it’s been parsed: RDFLib/rdflib#2257 |
@JervenBolleman in b243e33, I implemented the post-processor you suggested in RDFLib/rdflib#2257. It seems to work non-deterministically - sometimes it's fine and other times, it returns results with the subjects and objects switched. Do you think you know what might be going on? |
@cthoyt not at first sight. I will try to have a look at it this week but no promises. well I can reproduce the issue with subject, object switched with it being ok in one run but not the other. I see, I expect this is due to the iteration order in the _stmt not being stable in the test code. |
This is some really interesting insight! I assumed that the ResultRow objects were like tuples that corresponded to the order of the variables in the query, but maybe that's not the case. I updated the implementation in 32d1044 - hopefully this leads to deterministic tests passing :) If so, I will call this PR finished |
This will allow for hacking in a custom SPARQL processor, that e.g., rewrites some nodes as we demonstrated in biopragmatics/curies#41
Closes #686 This adds the URI mapping service implemented in biopragmatics/curies#41. It will allow for SPARQL queries to be written that call the Bioregistry as a service for generating URI mappings (e.g., between OBO PURLs, Identifiers.org URIs, and first-party URIs whose URI prefixes are stored in the Bioregistry). Here's a simplified example that doesn't require any triple store, and can be directly executed with RDFLib: ```sparql SELECT DISTINCT ?s ?o WHERE { VALUES ?s { <http://purl.obolibrary.org/obo/CHEBI_24867> <http://purl.obolibrary.org/obo/CHEBI_24868> } SERVICE <https://bioregistry.io/sparql> { ?s owl:sameAs ?o } } ``` returns the following (some not shown, you should get the idea): | subject | object | |---------------------------------------|------------------------------------------------- | | http://purl.obolibrary.org/obo/CHEBI_24867 | http://purl.obolibrary.org/obo/CHEBI_24867 | | http://purl.obolibrary.org/obo/CHEBI_24867 | http://identifiers.org/chebi/24867 | | http://purl.obolibrary.org/obo/CHEBI_24867 | https://www.ebi.ac.uk/chebi/searchId.do?chebiId=24867 | | ... | ... | This is built on top of the [`curies.Converter.expand_pair_all`](https://curies.readthedocs.io/en/latest/api/curies.Converter.html#curies.Converter.expand_pair_all), which itself is populated by all of the URI format strings available in the Bioregistry. To see examples of the possible ChEBI URIs, see https://bioregistry.io/registry/chebi.
References biopragmatics/bioregistry#686.
This pull request implements the identifier mapping service described in SPARQL-enabled identifier conversion with Identifiers.org. The goal of such a service is to act as an interoperability in SPARQL queries that federate data from multiple places and potentially use different IRIs for the same things.
This can be demonstrated in the following SPARQL query for proteins in a specific model in the BioModels Database and their associated domains in the UniProt:
The SPARQL endpoint running at the web address XXX takes in the bound values for
?biomodels_protein
one at a time and dynamically generates triples withowl:sameAs
as the predicate mapping and other equivalent IRIs (based on the definition of the converter) as the objects. This allows for gluing together multiple services that use different URIs for the same entities - in this example, there are two ways of referring to UniProt Proteins:Implementation Notes
By @JervenBolleman's suggestion in biopragmatics/bioregistry#686 (comment), this was done by overriding the
triples
method of RDFLib's graph data structure.Therefore, any arbitrary (extended) prefix map loaded in a
curies.Converter
can be used to run this service. The main use will be to deploy this as part of the Bioregistry, but it's nice that it can be reused for any arbitrary use case when implemented in this package (which is lower level and more generic than the Bioregistry)Follow-up
Alternate/Past Ideas
Hack in the query parser
Hack the RDF generator for services