Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal to solve issue #74 #78

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 75 additions & 1 deletion spec/docs/joinconditions.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ has exactly one value for each of the following two properties:
* a [child map]() (`rml:childMap`),
whose value is an [Expression Map]() (`rml:ExpressionMap`), which
MUST include references that exists in the [Logical Source]()
of the [Parent Triples Map]() that contains the [Referencing Object Map]()
of the [Triples Map]() that contains the [Referencing Object Map]()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This correction makes sense, but the whole sentence doesn't IMO.

I don't think we can say that references exists in a logical source. And I also don't think it's a MUST.

I think the sentence should be something along the lines of:

  • a child map property (rr:childMap) whose value is a child map (rml:ChildMap).

And then there should be explanation about what a rml:ChildMap is.
Whether or not the child map expression resolves or not is not really a concern for the spec IMO.

or it should have a constant value.

* a [parent map]() (`rml:parentMap`),
Expand Down Expand Up @@ -96,3 +96,77 @@ then the `rml:child` shortcut could be used.
rml:logicalSource <LS2> ;
rml:subjectMap <#SM2> .
```
## Join types
elsdvlee marked this conversation as resolved.
Show resolved Hide resolved

If the [Logical Source]() of the [Triples Map]() that contains the [Referencing Object Map]()
and the [Logical Source]() of the [Referencing Object Map]()'s [Parent Triples Map]() are not identical,
then the referencing object map must have at least one join condition.
elsdvlee marked this conversation as resolved.
Show resolved Hide resolved

A [Logical Source]() is considered as identical to another [Logical Source]()
elsdvlee marked this conversation as resolved.
Show resolved Hide resolved
when the set of objects at the end of the property paths starting with `rml:source` and starting with `rml:iterator` are identical.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why property paths?

Suggested change
when the set of objects at the end of the property paths starting with `rml:source` and starting with `rml:iterator` are identical.
when
* the value of the source property (`rml:source`) of both logical sources is equal, and
* the value of the reference formulation property (`rml:referenceFormulation`) of both logical sources is equal, and
* the value of the iterator property (`rml:iterator`) of both logical sources is equal, or both logical sources do not specify the iterator property.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmaria I couldn't think of another way to specify that also the descriptions of nested sources should be equals (so nested source descriptions can have different identifiers, but still have the same values for all nested properties (e.g. and are equal in the example, even is they have different identifiers.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe infinite nesting is not something we want to support in a first go. Couldn't we just extend Pano's suggestion with something like

when
* the value of the reference formulation property (`rml:referenceFormulation`) of both logical sources is equal,
* the value of the iterator property (`rml:iterator`) of both logical sources is equal, or both logical sources do not specify the iterator property, and
* the value of the source property (`rml:source`) of both logical sources is either equal when the source properties both point to a literal object, or when the source properties point to resource objects, all triples of said resources of both logical sources are equal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed today:

when
* the value of the reference formulation property (`rml:referenceFormulation`) of both logical sources is equal,
* the value of the iterator property (`rml:iterator`) of both logical sources is equal, or both logical sources do not specify the iterator property, and
* the sub RDF graphs of the source property (`rml:source`) of both logical sources that only contain RML actionable properties of the source access descriptions are isomorph.

this last point means that each source access description needs to list its actionable properties

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My last understanding of our discussion was that we were not going to go for fully isomorph sources, but for explicitly defining which parts of each source type should be isomorph. This would be my preference.

In below examples `<LS1>` and `<LS2>` are identical, but `<LS1>` and `<LS3>` are not identical.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading this right, engines must check if rml:source and rml:iterator are string/IRI identical.
If they are, the LS is considered the same right?

```
<LS1>
a rml:LogicalSource;
rml:source <S1>;
rml:referenceFormulation rml:JSONPath;
rml:iterator "$.jsonpath.expression".
<S1>
a rml:Source, void:Dataset;
void:dataDump <file:///data/dump.nt>.

<LS2>
a rml:LogicalSource;
rml:source <S2>;
rml:referenceFormulation rml:JSONPath;
rml:iterator "$.jsonpath.expression".
<S2>
a rml:Source, void:Dataset;
void:dataDump <file:///data/dump.nt>.

<LS3>
a rml:LogicalSource;
rml:source <S1>;
rml:referenceFormulation rml:JSONPath;
rml:iterator "$.jsonpath.expression2".
```

```
<LS1>
a rml:LogicalSource;
rml:source [ a rml:Source, a csvw:Table
csvw:url "/absolute/path/to/data.csv";
];
rml:referenceFormulation rml:CSV.

<LS2>
a rml:LogicalSource;
rml:source <S2>;
rml:referenceFormulation rml:CSV.

<S2>
a rml:Source, a csvw:Table
csvw:url "/absolute/path/to/data.csv".

<LS3>
a rml:LogicalSource;
rml:source [ a rml:Source, a csvw:Table
csvw:url "/relative/path/to/data.csv";
].
rml:referenceFormulation rml:CSV.
```

If the [Referencing Object Map]() has no join condition
(which is only allowed when the [Logical Source]() of the [Triples Map]() that contains the [Referencing Object Map]()
elsdvlee marked this conversation as resolved.
Show resolved Hide resolved
and the [Logical Source]() of the [Referencing Object Map]()'s [Parent Triples Map]() are identical), a natural join is executed.
In reality this means that the [Logical Source]() is used in its original form when generating the related RDF triples.
elsdvlee marked this conversation as resolved.
Show resolved Hide resolved

If the [Referencing Object Map]() has one or more join conditions, an inner join is executed.
elsdvlee marked this conversation as resolved.
Show resolved Hide resolved
The related RDF triples are generated using the [=n-ary Cartesian product=]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can we should refer to a general description of generating triples. so that we don't have to repeat here that we use the n-ary cartesion product.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmaria Can you please make a suggestion?

of the logical iteration of the [Logical Source]() of the [Triples Map]() that contains the [Referencing Object Map]()
and the logical iteration of the [Logical Source]() of the [Referencing Object Map]()'s [Parent Triples Map](), and
retaining only the combination of those logical iterations for which the values of the [Child Map]() and [Parent Map]() of each join condition are identical.

**NOTE**
If the [Referencing Object Map]() has no join condition and the [Logical Source]() of the [Triples Map]() that contains the [Referencing Object Map]()
and the [Logical Source]() of the [Referencing Object Map]()'s [Parent Triples Map]() are not identical, the mapping engine MUST report an error.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this break R2RML support? AFAIK you can join without a condition which results in joining everything from LS1 with everything of LS2 (Cartesian product)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, R2RML states:

If the child query and parent query of a referencing object map are not identical, then the referencing object map must have at least one join condition.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK most engines allow this, thus violating the spec?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would say that is a violation of the spec. We could of course argue about the usefulness of such behavior.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm this is of course not enforced in the shapes, it is kinda hard to do that I think.

Copy link
Collaborator Author

@elsdvlee elsdvlee Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmaria @DylanVanAssche
I am totally fine with the other solution: no join condition means cartesian product. I just want that a decision is taken on this matter, that decision is documented in the spec and that we all know that is not in line with the R2RML spec (what is the consequence?).

If we decide to move away from the R2RML spec, I wonder why we still need the exception for 'same logical source'. It would be much clearer if no join condition means cartesian product for any join.

Since no such decisions were taken until now, I tried to write a PR in line with R2RML spec.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would vote for the cartesian product: given that most engines implement it as such, it feels like it's the more intuitive interpretation of 'no join condition', and I'm all for increasing intuitivity! :). And we can see it as an extension of R2RML: cartesian allows you to do "more" than when you throw an error

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI. R2RML --> if the queries of two different triples maps use different attributes for the generation of subject maps, then you must have a join condition (in other words, you must do a theta-join). When no join conditions are provided, then the rows of the child queries are used to populate both child and parent subject maps. In that sense, the no-join-conditions case simulates a natural join.

If the referencing object map has no join condition:
SELECT * FROM ({child-query}) AS tmp

This is sufficient, and the quote Pano mentioned is confusing. Testing the equivalence of two queries is "ignored" by the community. It was even the subject of a thread a while ago. Are SELECT * FROM foo X and (SELECT * FROM foo Y) and (SELECT * FROM (SELECT * FROM foo)) identical or not?

elsdvlee marked this conversation as resolved.
Show resolved Hide resolved
Loading