sparql `creator` query is too narrowly defined #48

iannesbitt · 2023-10-09T22:55:01Z

Edit 2023-10-11: updating in light of new info
Both OpenTopography and CanWIN have an issue where no creator is found in the SO doc, making the citations incorrect.

CanWIN example: DataONE representation, dataset landing, schema validation
Citation:

(2023). Dease Strait Mooring Chl-a and Light Data - 2017 and 2019. Canadian Watershed Information Network (CanWIN). [10.34992/jt0y-en67](https://doi.org/10.34992/jt0y-en67), version: sha256:bf5660b425f554166b3a97d1a827792ec62a526925e332a3f805018ce12d0843.

OpenTopography example: DataONE representation, dataset landing, schema validation
Citation:

(2022). Lidar Survey of Dangermond Preserve, CA. OpenTopography. OTSDEM.042022.32611.1, version: sha256:359b20070883f69c5cd5f5b77d7f91c95a6cf31cffaa12fcc05c515946d3c86e.

~~sonormal (and by extension mnlite) needs to be more adaptable when looking for dataset creators. I think this is set in sonormal.normalize._forceSODatasetLists.~~

The text was updated successfully, but these errors were encountered:

iannesbitt · 2023-10-10T19:32:13Z

I'm having trouble figuring out where to find this problem. It might actually have something to do with how mnlite serves creator information to the requestor.

mbjones · 2023-10-10T19:55:35Z

What do you mean, to the requestor? mnlite as a member node serves its SO metadata documents in json-ld to the DataONE syncronization system, which grabs an exact copy of the SO document. That is transferred to and stored on the CN in its Metacat object store, and then an event is triggered to queue indexing for that document. The dataone indexer then picks up that queue entry, and attempts to extract the information from the SO JSON-LD document using a JSON-LD subprocessor, which loads the SO document and a number of other vocabularies into an in-memory triple store and runs SPARQL queries on the content to extract values, which are then updated in SOLR.

For example, here's a link to the subprocessor configuration for extracting a single creator from SO to populate the SOLR author field, showing its SPARQL query. Similar queries are used to populate the other SOLR fields.

iannesbitt · 2023-10-10T22:20:35Z

I didn't understand the simplicity of how mnlite serves records, so that makes sense.

In that case, maybe it makes sense to look at the SPARQL query. I'm not that familiar with SPARQL but I understand some SQL. Here is a CANWIN creator field:

    "creator": {
      "@type": "Role",
      "creator": {
        "@type": "Person",
        "Affiliation": {
          "@type": "Organization",
          "name": "Centre for Earth Observation Science - University of Manitoba"
        },
        "Email": "yendamuk@myumanitoba.ca",
        "Identifier": {
          "@type": "PropertyValue",
          "propertyID": "https://registry.identifiers.org/registry/orcid",
          "url": "https://orcid.org/0009-0001-2454-4614",
          "value": "0009-0001-2454-4614"
        },
        "Name": "Yendamuri, Kiran\t"
      }
    },

And here is the SPARQL query:

PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX list: <http://jena.hpl.hp.com/ARQ/list#>
PREFIX SO:   <http://schema.org/>

SELECT (?name as ?author)
WHERE {
    ?dsId rdf:type SO:Dataset .
    ?dsId SO:creator ?list .
    ?list list:index (?pos ?member) .
    ?member SO:name ?name .
}
order by (?pos)
limit 1

mbjones · 2023-10-11T01:47:12Z

@iannesbitt the actual SPARQL query for the origin field, which is what is used for the creator list in citations, is here:

https://github.com/DataONEorg/dataone-indexer/blob/develop/src/main/resources/application-context-schema-org.xml#L304

PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX list: <http://jena.hpl.hp.com/ARQ/list#>
PREFIX SO:   <http://schema.org/>

SELECT (?name as ?origin)
WHERE {
    ?dsId rdf:type SO:Dataset .
    ?dsId SO:creator ?list .
    ?list list:index (?pos ?member) .
    ?member SO:name ?name .
}
order by (?pos)

Note that it assumes list structure. Here's your example doc (slightly enhanced) that does not use a list but rather a creator Role embedded in creator:

{    
"@context": "https://schema.org", 
"schema:Dataset": {
  "@type": "schema:Dataset",
  "name": "Test dataset",
  "creator": {
      "@type": "Role",
      "creator": {
        "@type": "Person",
        "affiliation": {
          "@type": "Organization",
          "name": "Centre for Earth Observation Science - University of Manitoba"
        },
        "email": "yendamuk@myumanitoba.ca",
        "identifier": {
          "@type": "PropertyValue",
          "propertyID": "https://registry.identifiers.org/registry/orcid",
          "url": "https://orcid.org/0009-0001-2454-4614",
          "value": "0009-0001-2454-4614"
        },
        "name": "Yendamuri, Kiran"
      }
    }
  }
}

Here's a SPARQL query to retrieve both the name and email from that. Somehow we need to support these multiple encoding approaches:

PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX list: <http://jena.hpl.hp.com/ARQ/list#>
PREFIX SO:   <http://schema.org/>

SELECT ?name ?email
WHERE {
    ?dsId rdf:type SO:Dataset .
    ?dsId SO:creator $creator .
    $creator SO:creator ?role .
    $role SO:name ?name .
    $role SO:email $email .
}

This produces the following results:

{
  "head": {
    "vars": [
      "name",
      "email"
    ]
  },
  "results": {
    "bindings": [
      {
        "name": {
          "value": "Yendamuri, Kiran",
          "type": "literal"
        },
        "email": {
          "value": "yendamuk@myumanitoba.ca",
          "type": "literal"
        }
      }
    ]
  },
  "metadata": {
    "httpRequests": 46
  }
}

In an ideal world we would also capture the ORCID and Affiliation too.

iannesbitt · 2023-10-11T15:18:11Z

@mbjones should I open an issue or PR in the indexer to track this?

mbjones · 2023-10-11T17:30:09Z

yeah, or we could transfer this issue over to the indexer repo if it has what we need...

iannesbitt · 2023-10-11T18:36:08Z

Ok, should be good now. How should we test a change like this?

iannesbitt · 2023-10-11T20:48:10Z

@taojing2002, @mbjones, @artntek and I met for a while to discuss this problem today. We came up with a three-point proposal to try and broaden the configurations of creator that DataONE can parse correctly. @datadavev perhaps you can assess whether you think this is a worthwhile plan.

Find more JSON-LD documents with different configurations of creator to test against (for example, the CanWIN "@type": "Role" setup). The folder for jsonld test documents in the indexer is here.
Use sonormal to normalize multiple configurations of the creator field (i.e. allow for a broader range of SOSO creator definitions) Issue: Normalize broader configurations of creator sonormal#3
Modify our sparql query in dataone-indexer and cn-index-processor to accept more iterations of the creator field (not just creator fields that have @list structure as exists now)

iannesbitt · 2023-10-11T21:22:10Z

Note: the above comment was edited to include @taojing2002 who was mis-tagged in the original

iannesbitt · 2023-10-24T19:00:03Z

I noticed that there was systemmetadata for each test document, but I didn't see any documentation on how to create it. Is there a method for creating it automatically?

datadavev · 2023-10-24T20:09:21Z

the system metadata needs to accompany an indexer test document primarily to indicate the type of object being sent to the indexer. Other than that, I think the sys meta can contain any valid values

datadavev · 2023-10-24T20:27:53Z

wrt the creator steps - the normalization of schema.org metadata on mnlite is used to compare against subsequent retrieval from the same URL to determine if there was a change to the content. The schema.org content forwarded to the CN is I believe (should be) the original content that was extracted from the landing page since we've generally been following a principle of not changing content from the sources.

I believe that means the indexer needs to do the ops described in step 2.

That said, it would certainly be much simpler to index if the json-ld content could be pre-processed to a common representation prior to passing on to the CNs. Perhaps such pre-processing should be part of the indexer?

iannesbitt self-assigned this Oct 9, 2023

iannesbitt added the bug Something isn't working label Oct 9, 2023

iannesbitt transferred this issue from DataONEorg/sonormal Oct 11, 2023

iannesbitt changed the title ~~creator is not populated correctly~~ sparql creator query is too narrowly defined Oct 11, 2023

iannesbitt mentioned this issue Oct 11, 2023

Normalize broader configurations of creator DataONEorg/sonormal#3

Open

iannesbitt added a commit that referenced this issue Oct 24, 2023

adding jsonld w/ alt config of "creator" (#48)

70f1d37

iannesbitt mentioned this issue Jul 11, 2024

SparQL query patches for SO creator/author lists and spatial coverage DataONEorg/dataone-cn-index#1

Open

iannesbitt added a commit that referenced this issue Aug 17, 2024

adding sparql queries to address #48

b12e74b

iannesbitt linked a pull request Aug 17, 2024 that will close this issue

Bugfix 48 creator query too narrow #124

Open

iannesbitt linked a pull request Aug 20, 2024 that will close this issue

Bugfix 48 creator query too narrow #124

Open

iannesbitt mentioned this issue Sep 17, 2024

Incorporate changes for creator SparQL query expansion DataONEorg/d1_cn_index_processor#35

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sparql `creator` query is too narrowly defined #48

sparql `creator` query is too narrowly defined #48

iannesbitt commented Oct 9, 2023 •

edited

Loading

iannesbitt commented Oct 10, 2023

mbjones commented Oct 10, 2023

iannesbitt commented Oct 10, 2023

mbjones commented Oct 11, 2023

iannesbitt commented Oct 11, 2023

mbjones commented Oct 11, 2023

iannesbitt commented Oct 11, 2023

iannesbitt commented Oct 11, 2023 •

edited

Loading

iannesbitt commented Oct 11, 2023

iannesbitt commented Oct 24, 2023

datadavev commented Oct 24, 2023

datadavev commented Oct 24, 2023

sparql creator query is too narrowly defined #48

sparql creator query is too narrowly defined #48

Comments

iannesbitt commented Oct 9, 2023 • edited Loading

iannesbitt commented Oct 10, 2023

mbjones commented Oct 10, 2023

iannesbitt commented Oct 10, 2023

mbjones commented Oct 11, 2023

iannesbitt commented Oct 11, 2023

mbjones commented Oct 11, 2023

iannesbitt commented Oct 11, 2023

iannesbitt commented Oct 11, 2023 • edited Loading

iannesbitt commented Oct 11, 2023

iannesbitt commented Oct 24, 2023

datadavev commented Oct 24, 2023

datadavev commented Oct 24, 2023

sparql `creator` query is too narrowly defined #48

sparql `creator` query is too narrowly defined #48

iannesbitt commented Oct 9, 2023 •

edited

Loading

iannesbitt commented Oct 11, 2023 •

edited

Loading