-
Notifications
You must be signed in to change notification settings - Fork 560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BREAKING CHANGE: Don't use publicID
as the name for the default graph.
#2406
BREAKING CHANGE: Don't use publicID
as the name for the default graph.
#2406
Conversation
I marked this ready for review as I want to get feedback on the essential changes. I will probably clean up the docstrings a bit still. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is a supported use of publicID. The test succeeds with current main but fails with the change to Graph.parse
:
def test_parse_graph_with_publicid_as_new_cg_subgraph():
from test.data import tarek, hates, cheese
cg = ConjunctiveGraph()
assert len(list(cg.contexts())) == 0 # Empty contexts exclded from listing
# Add a statement to the default graph/context
cg.add((tarek, hates, cheese))
assert len(cg) == 1
assert len(list(cg.contexts())) == 1
assert list(cg.contexts())[0] == cg.default_context
# Parse ttl graph into subgraph identified with publicID
cg.parse(
data="<urn:example:tarek> <urn:example:likes> <urn:example:pizza> .",
publicID="urn:x-rdflib:context-a", # Now ignored
format="ttl",
)
# New subgraph should be created
assert len(list(cg.contexts())) == 2
# And the newly-created subgraph wth the publicID in the above turtle should exist
assert URIRef("urn:x-rdflib:context-a") in [c.identifier for c in list(cg.contexts())]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, the test you shared will work on main and fail on this PR branch, and this is because on main publicID
is used as the identifier for the default graph in addition to the base for relative URI resolution in source documents, while in this PR branch, it is only used as the base for relative URI resolution.
Do note, this PR together with every issue it addresses is labelled as breaking change, so with that as baseline, with the issues that this PR tries to address, what we need to answer is:
- How should parse work? This is a more critical question than how does parse work in main, given this is labelled as a breaking change, so that is clearly signalling something will be different from the main branch.
- What should we do about the issues this tries to address?
If parse
should work the way it works in main, we close the issues and move on as then they are not issues. If however the issues are valid, then parse
should not work the way it works in main
, we have to change how it works. And then the question is only what we should change it to, or even more pertinently, how do we make it better than it is.
I think the way it is working in main, i.e. using publicID
as the name of the graph that triples inside the default graph is loaded into, is wrong. This is also not what the documentation of publicID
suggests will happen.
Lines 1416 to 1418 in ad56044
- ``publicID``: the logical URI to use as the document base. If None | |
specified the document location is used (at least in the case where | |
there is a document location). |
Given that description, what I would expect is that the publicID
is what relative URIs will be resolved against. And that is also what it is used for, and the only thing it should be used for, I think. And I don't see why the base for URI relative URI resolution should be the same as the graph name that default triples are being loaded into, which should not be named to begin with because as per the spec, has no name:
https://www.w3.org/TR/rdf11-concepts/#section-dataset
Exactly one default graph, being an RDF graph. The default graph does not have a name and may be empty.
Another way to think about this is, what is worse:
- Not having the content of default graph in sources loaded into the default graph of Dataset/ConjunctiveGraph
- Having to explicitly load something into a sub-graph if you don't want it in the default graph
I think 1 is worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All that being said, the long and short of it is: How should parse work and how do we address the issues at hand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if you think the code should work the way it does in main for loading Turtle into a ConjunctiveGraph, how should it work when you change ttl
to trig
in your code?
I think if it is difficult to find consensus on what {Dataset,ConjunctiveGraph}.parse(...)
should do if the source does not support named graphs, the best option is to just refuse to parse sources that don't support named graphs into Datasets or ConjuctiveGraphs. So then the right behaviour for your example would be to raise an exception, as the operation cannot be defined in an unambiguous way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to understand alternative ways how the issues this tries to address can be addressed from a public API perspective. If there are more elegant ways, we should try it, but I do think it is critical that we address the issues in one way or another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some documentation and updated the description to make it clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the new changes here. publicID
is used to resolve relative IRIs (I actually never knew it was used as the name of the graph to load the source's default graph statements for ConjunctiveGraph/Dataset
).
So what is the behaviour now with this change when loading source data in ttl
format into a Dataset
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually never knew it was used as the name of the graph to load the source's default graph statements for
ConjunctiveGraph/Dataset
.
Me neither until I started to investigate the ramifications of the change - but as aucampia has noted, it is documented.
It's probably worth bearing in mind that the implementation of ConjunctiveGraph
predates the W3's definition of Dataset. In RDFLib, a Graph
object always has an identifier so I guess it was both unavoidable and natural for the Graph
object that was bound to ConjunctiveGraph.default_context
to be able to have a user-definable identifier (somewhere in the code/issues someone notes the difficulty of distinguishing a BNode-identified default graph from a BNode-identified named graph).
And this is probably why ConjunctiveGraph
can behave a little unintuitively when passed an identifier:
cg = ConjunctiveGraph()
assert cg.default_context.identifier == cg.identifier # False
# AssertionError: assert BNode('Nd463bea483a649c6918c222b20ad8832') == BNode('N1feef746236f44cea47c83ed58f3e752')
cg = ConjunctiveGraph(identifier=URIRef('urn:example:context-0'))
assert cg.default_context.identifier == cg.identifier # True
Dataset
is a different beast and benefits from a concrete assertion:
“Exactly one default graph, being an RDF graph. The default graph does not have a name.”
Unlike ConjunctiveGraph
, Dataset.__init__()
does not take an identifier
param, a BNode is bound to the Dataset
object’s identifier
property but the (ahem, unnamed, sigh) default graph has a fixed, unchageable identifier
property bound to <urn:x-rdflib:default>
:
ds = Dataset()
assert isinstance(ds.identifier, BNode)
assert ds.default_context.identifier == URIRef('urn:x-rdflib:default')
(I s'pose it's technically user-modifiable by rebinding DATASET_DEFAULT_ID)
So what is the behaviour now with this change when loading source data in ttl format into a Dataset?
Current state-of-play is unchanged, source ttl is loaded into the Graph
bound to default_context
:
ds = Dataset()
ds.parse(location=TEST_DATA_DIR / "variants" / "diverse_triples.ttl", format="ttl")
assert len(ds) == 5
assert list(ds.contexts()) == [ds.default_context]
assert len(ds.default_context) == 5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if you think the code should work the way it does in main
I wouldn't aspire to hold an opinion. My intention was merely to bring the issue to the attention of reviewers in as clear and straightforward a manner as I could so that any potential impact on existing downstream code could be discussed and advice prepared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@edmondchuc I never answered you explicitly on this question
So what is the behaviour now with this change when loading source data in ttl format into a Dataset?
The new behaviour is that it goes into the default graph.
Lines 2214 to 2215 in 528465c
If the source is in a format that does not support named graphs it's triples | |
will be added to the default graph (i.e. `Dataset.default_context`). |
When parsing data into a `ConjunctiveGraph` or `Dataset`, the triples in the default graphs in the sources were loaded into a graph named `publicID`. This behaviour has been changed, and now the triples from the default graph in source RDF documents will be loaded into `ConjunctiveGraph.default_context` or `Dataset.default_context`. The `publicID` parameter to `ConjunctiveGraph.parse` and `Dataset.parse` constructors will now only be used as the base URI for relative URI resolution.
0cf29b0
to
db7313a
Compare
publicID
as the name for the default graph.
Moving a topic from the code review thread into the main Issue thread:
This is an issue I have been facing in my PySHACL application. In the past I used to use This gave arise to an issue when parsing ttl and ntriples files, where the Dataset has a default context graph (named For me personally the ideal behaviour would be:
I know that would introduce inconsistent behaviour when any of those conditions differ slightly. |
Yeah, that's wrong, the contents of statements without a context (i.e. triples) shouldn't be placed in an arbitrary new graph, it's a behaviour inherited from a longstanding bug in ConjunctiveGraph which aucampia has addressed with this PR.
and
What we have atm is that ds = Dataset()
ds.parse(location=TEST_DATA_DIR / "variants" / "diverse_triples.ttl", format="ttl")
assert len(ds.default_context) == 5
ds.parse(location=TEST_DATA_DIR / "variants" / "simple_triple.ttl", format="ttl")
assert len(ds.default_context) == 6 |
Fix a typo
What I suggested in the "Upgrading 6 to 7" document I added here is to do: dataset = Dataset()
dataset.get_context("example:graph_name").parse("http://example.com/source.ttl", format="turtle") I guess if you don't know whether the source supports named graphs or not, this is not that helpful.
If the default context is empty, would it not be fine to just parse it directly into the empty default graph instead of parsing it into a new graph, and setting the new graph as the default graph? Regarding the second point, I think we should try and make that a given - or at least we should not support cases where the default context has another name and try and prevent it, as per the spec the default graph is not named. I guess for now the special reserved name is okay, but I would say we should not allow changing the way the default graph is named as it is not clear in what sense that is still the default graph. Just to understand this better, what would happen if the default context graph is not empy? Would the triples from the source then be added to it? |
I will add some tests in separate PRs for:
|
Just as a final warning, with this change However, that being said, if all conditions you give in your steps hold (i.e. "If the default context graph is empty, has no triples in it"), the behaviour with this PR will have the same outcome. |
I'm going to merge this by this weekend if there is no further feedback, I may include the additional tests for unrelated functionality first, but I am uncertain if I will have time and there are other changes that are urgent. |
Adding tests for relative URIs will be easier with this, as I want to use the variants test infrastructure for it, and this has some changes there that eliminate the hacks related to the default graph not being the default graph. If there are problems discovered with the tests that I will add for relative URIs and parsing multiple documents or one document multiple times I will fix it. |
Thanks for sharing @gjhiggins - @pchampin as you raised the original issue it may be worth getting your input here. |
Sorry for the late response. I am satisfied with this change. Reusing |
`LOAD ... INTO GRAPH` stopped working correctly after the change to handling of the `publicID` `Graph.parse` parameter in RDFLib 7.0.0 (<RDFLib#2406>). This is because `LOAD` evaluation relied on `publicID` to select the graph name. So after <RDFLib#2406> data would be loaded into the default graph even if a named graph is specified. This change adds tests for `LOAD ... INTO GRAPH` and fixes the load evaluation.
`LOAD ... INTO GRAPH` stopped working correctly after the change to handling of the `publicID` `Graph.parse` parameter in RDFLib 7.0.0 (<RDFLib#2406>). This is because `LOAD` evaluation relied on `publicID` to select the graph name. So after <RDFLib#2406> data would be loaded into the default graph even if a named graph is specified. This change adds tests for `LOAD ... INTO GRAPH` and fixes the load evaluation.
`LOAD ... INTO GRAPH` stopped working correctly after the change to handling of the `publicID` `Graph.parse` parameter in RDFLib 7.0.0 (<RDFLib#2406>). This is because `LOAD` evaluation relied on `publicID` to select the graph name. So after <RDFLib#2406> data would be loaded into the default graph even if a named graph is specified. This change adds tests for `LOAD ... INTO GRAPH` and fixes the load evaluation. A consequence of this change is also that relative IRI lookup for graphs loaded with `LOAD ... INTO GRAPH` is now relative to the source document URI instead of the base URI of the graph being loaded into, which is more correct.
`LOAD ... INTO GRAPH` stopped working correctly after the change to handling of the `publicID` `Graph.parse` parameter in RDFLib 7.0.0 (<RDFLib#2406>). This is because `LOAD` evaluation relied on `publicID` to select the graph name. So after <RDFLib#2406> data would be loaded into the default graph even if a named graph is specified. This change adds tests for `LOAD ... INTO GRAPH` and fixes the load evaluation. A consequence of this change is also that relative IRI lookup for graphs loaded with `LOAD ... INTO GRAPH` is now relative to the source document URI instead of the base URI of the graph being loaded into, which is more correct.
`LOAD ... INTO GRAPH` stopped working correctly after the change to handling of the `publicID` `Graph.parse` parameter in RDFLib 7.0.0 (<#2406>). This is because `LOAD` evaluation relied on `publicID` to select the graph name. So after <#2406> data would be loaded into the default graph even if a named graph is specified. This change adds tests for `LOAD ... INTO GRAPH` and fixes the load evaluation. A consequence of this change is also that relative IRI lookup for graphs loaded with `LOAD ... INTO GRAPH` is now relative to the source document URI instead of the base URI of the graph being loaded into, which is more correct.
Summary of changes
When parsing data into a
ConjunctiveGraph
orDataset
, the triples in thedefault graphs in the sources were loaded into a graph named
publicID
.This behaviour has been changed, and now the triples from the default graph in
source RDF documents will be loaded into
ConjunctiveGraph.default_context
orDataset.default_context
.The
publicID
parameter toConjunctiveGraph.parse
andDataset.parse
constructors will now only be used as the base URI for relative URI resolution.
Checklist
the same change.
./examples
.so maintainers can fix minor issues and keep your PR up to date.