Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exclude others namespace from harvesting "oai_dc" metadata prefix #10837

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from

Conversation

jeromeroucou
Copy link
Contributor

@jeromeroucou jeromeroucou commented Sep 12, 2024

What this PR does / why we need it:

This PR allows the harvesting of certain repository who expose metadata with specific namespace.

Some repository extend the "oai_dc" with specific namespace. For example, SEANOE expose specific metadata with dct namespace. Below, the result of https://www.seanoe.org/oai/OAIHandler?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:seanoe.org:41307

<?xml version="1.0" encoding="UTF-8"?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:dcterms="http://purl.org/dc/terms/"
    xmlns:dct="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
    <responseDate>2024-09-12T12:09:21Z</responseDate>
    <request verb="GetRecord" metadataPrefix="oai_dc" identifier="oai:seanoe.org:41307">
        https://www.seanoe.org/oai/OAIHandler</request>
    <GetRecord>
        <record>
            <header>
                <identifier>oai:seanoe.org:41307</identifier>
                <datestamp>2021-05-12</datestamp>
                <setSpec>GROUP:EMSO</setSpec>
                <setSpec>ec_fundedresources</setSpec>
            </header>
            <metadata>
                <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
                    xmlns:dc="http://purl.org/dc/elements/1.1/"
                    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                    xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
                    <dc:title>Iridium GPS 1 data from the EMSO-Azores observatory, 2014-2015</dc:title>
                    <dc:creator>Legrand, Julien</dc:creator>
                    <dc:creator>Sarradin, Pierre-marie</dc:creator>
                    <dc:creator>Cannat, Mathilde</dc:creator>
                    <dc:subject>Mid-Atlantic Ridge</dc:subject>
                    <dc:subject>EMSO</dc:subject>
                    <dc:subject>Lucky Strike</dc:subject>
                    <dc:subject>Time-series</dc:subject>
                    <dc:subject>Environmental monitoring node</dc:subject>
                    <dc:subject>MoMAR</dc:subject>
                    <dc:subject>BOREL</dc:subject>
                    <dc:subject>GPS</dc:subject>
                    <dc:subject>Position</dc:subject>
                    <dc:description>This dataset contains the GPS positions of the EMSO-Azores
                        transmission buoy BOREL acquired between July 2014 and April 2015 using the
                        Iridium/GPS modem 1 (data acquired every 6 hours).</dc:description>
                    <dc:publisher>SEANOE</dc:publisher>
                    <dc:date>2015-10</dc:date>
                    <dc:type>dataset</dc:type>
                    <dc:identifier>DOI:10.17882/41307</dc:identifier>
                    <dc:identifier>https://doi.org/10.17882/41307</dc:identifier>
                    <dc:identifier>https://www.seanoe.org/data/00302/41307/</dc:identifier>
                    <dc:relation>info:eu-repo/grantAgreement/EC/FP7/312463/EU//FIXO3</dc:relation>
                    <dc:coverage>North 37.30134, South 37.2888, East -32.275618, West -32.27982</dc:coverage>
                    <dct:references>https://www.seanoe.org/data/00302/41307/</dct:references>
                    <dcterms:spatial xsi:type="DCTERMS:Box">37.2888 -32.27982 37.30134 -32.275618</dcterms:spatial>
                    <dc:rights>CC-BY</dc:rights>
                </oai_dc:dc>
            </metadata>
        </record>
    </GetRecord>
</OAI-PMH>

Actually, this record can't be harvested because the following exception occurs :

Caused by: com.ctc.wstx.exc.WstxParsingException: Undeclared namespace prefix "dct"
  at [row,col {unknown-source}]: [5,555]
       at com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.java:634)
       at com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:504)
       at com.ctc.wstx.sr.InputElementStack.resolveAndValidateElement(InputElementStack.java:503)
       at com.ctc.wstx.sr.BasicStreamReader.handleStartElem(BasicStreamReader.java:3066)
       at com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2928)
       at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1122)
       at edu.harvard.iq.dataverse.api.imports.ImportGenericServiceBean.processXMLElement(ImportGenericServiceBean.java:209)
       at edu.harvard.iq.dataverse.api.imports.ImportGenericServiceBean.processOAIDCxml(ImportGenericServiceBean.java:180)
       ... 100 more

We propose to ignore everything that is not the dc namespace which means skip the WstxParsingException.

Which issue(s) this PR closes:

No related issue funded

Special notes for your reviewer:

Not really but I've a suggestion to improve the scope of this pull request with another one (or issue) : the ForeignMetadataFormatMapping can be more flexible and can be used for more namespaces than dcterms. With this, we can add a mapping for dct namespace

Suggestions on how to test this:

Add a new harvesting client with https://www.seanoe.org/oai/OAIHandler server and GROUP:EMSO set.
Before the PR, all datasets are in error, with this PR, all datasets are imported.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

No

Is there a release notes update needed for this change?:

A release note snippet has beed added

@coveralls
Copy link

coveralls commented Sep 12, 2024

Coverage Status

coverage: 21.854% (-0.002%) from 21.856%
when pulling baeffdc on Recherche-Data-Gouv:harvest_exclude_invalid_tag
into b28812b on IQSS:develop.

@qqmyers
Copy link
Member

qqmyers commented Sep 12, 2024

FYI: You might want to look at/review #10836 which I think is doing something similar but more extensive.

@luddaniel
Copy link
Contributor

FYI: You might want to look at/review #10836 which I think is doing something similar but more extensive.

@qqmyers I'm not sure there is a link.

#10837 comes before in the dsDTO = importGenericService.processOAIDCxml(xmlToParse); where we can experience constraints with xml namespaces due to :FastGetRecord xml truncation, dc: prefix requirement and possible XMLStreamException. Also, a generic OAI archive can send a customised oai_dc content like in the example above.

If I missed something, could you shed some light on it for me?

@qqmyers
Copy link
Member

qqmyers commented Sep 25, 2024

Sorry - I agree it's not related. I just saw the note about skipping entries that would fail and wanted to make sure you saw the other PR, but looking at your code I see you're addressing problems in even reading the XML input.

@jeromeroucou jeromeroucou marked this pull request as ready for review September 27, 2024 13:10
@pdurbin pdurbin added the Type: Feature a feature request label Oct 9, 2024
@jeromeroucou
Copy link
Contributor Author

Hi @pdurbin ! There is a chance for this small PR to be embedded into 6.5 version ? 🙏

@landreev landreev self-assigned this Nov 8, 2024
@landreev landreev added Feature: Harvesting GREI 2 Consistent Metadata labels Nov 8, 2024
@cmbz cmbz added GREI 3 Search and Browse and removed GREI 2 Consistent Metadata labels Nov 8, 2024
@landreev landreev added GREI 2 Consistent Metadata FY25 Sprint 10 FY25 Sprint 10 (2024-11-06 - 2024-11-20) GREI 3 Search and Browse and removed GREI 3 Search and Browse GREI 2 Consistent Metadata labels Nov 8, 2024
@pdurbin
Copy link
Member

pdurbin commented Nov 8, 2024

@jeromeroucou we moved it to "ready for review". Thanks for the PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Harvesting FY25 Sprint 10 FY25 Sprint 10 (2024-11-06 - 2024-11-20) GREI 3 Search and Browse Type: Feature a feature request
Projects
Status: In Review 🔎
Status: 🙏 Wanted for next version
Development

Successfully merging this pull request may close these issues.

7 participants