Fix handling of storageidentifiers in dataverse_json harvests #7736

landreev · 2021-03-29T21:57:49Z

Dataverses can harvest metadata in our custom json format from each other. Our own proprietary export and import are used in the process.
A side effect of this method is that the storageidentifier, the physical location of the datafile on the remote installation ends up being imported verbatim into the harvesting dataverse; instead of the url of the download API on the remote end.
We now have all these recently harvested files with strange storageidentifiers;
like this one:

1784af576bf-27316028aabe

(no driver prefix; must have been harvested from a pre-4.20 Dataverse)
or this:

s3://dataverse-prod-s3:178290c4c23-3a496b7152af

(a file in somebody's S3 storage bucket...)
These are of course completely useless for a harvesting installation. We want to handle these the same way as when we harvest DDI between Dataverses. I.e., the dvobjects for these harvested files need to be created with the remote download API url in storageidentifier field (would be https://dataverse.tdl.org/api/access/datafile/something-something for the last file above, for example...)

Aside from these entries being useless as imported, this is not urgent in that we don't use these remote locations for any practical purpose, as of now. (So that's why we haven't noticed until now). But it's still messy (I was completely weirded out when I saw the ones like the first one above; that looked entirely like a local storageidentifier, that somehow got created without a driver prefix...)

The text was updated successfully, but these errors were encountered:

qqmyers · 2021-03-29T23:48:04Z

Note that #7325 uses storageidentifiers of this type/form. The use there should probably be consistent with anything done for harvested datasets (at least not be incompatible).

landreev · 2021-03-30T00:34:02Z

@qqmyers

Note that #7325 uses storageidentifiers of this type/form. The use there should probably be consistent with anything done for harvested datasets (at least not be incompatible).

Of which type? With the "http(s):" prefix?

qqmyers · 2021-03-30T12:07:48Z

Reading that PR again, I think the final design defines a storage type of 'http', but the storageidentifiers would use the label for the store as with other types, so they would be things like trsa:// rather than https://. So no direct conflict but a chance for confusion both through the current type name (could rename the type to 'remote'). A potential for issues with storage code overall exists though - the code assumes a format of <storage driver label> :// when trying to identify the right StorageIO class to use so http(s) would have to become reserved words.

landreev · 2021-03-30T13:23:51Z

I agree, it could be confusing, at least for a human person looking at the database entries. But I don't think it should lead to any real conflicts, even if you define a storage of type "http" with the actual label "http" (like we have with "file:" and "s3:"). Because we have other ways to unambiguously tell a harvested dvobject from a real one, without having to rely on the storageidentifier.

That said we could make it more explicit; maybe all the harvested ones should have some reserved prefix... harvested:https://remote.edu/api/access/datafile/123? Not storing that remote source url in the storageidentifier field would potentially be even better... I just wasn't ok with the idea of introducing another database column just for that - so decided not to do that for now.

landreev · 2021-03-30T22:12:41Z

I am going to open an issue for reviewing and potentially refactoring this setup, of "harvested files".
This is yet another case of something that's a legacy, that may be more trouble than it's worth.
This system is a leftover from the times when DVNs were actually reading files harvested from other DVNs and other remote sources (for example, ICPSR), serving them to the users and even running statistical calculations on them. But that was all abandoned, for various reasons and now these harvested files are only used for indexing. There are no pages associated with them (search results redirect the user to the original remote source).
In other words, there doesn't seem to be any good reason for these "files" to be full-blown DvObjects and DataFiles.

(this current issue is still a valid case for a shorter-term fix though).

landreev · 2022-04-20T21:27:11Z

I have also opened #8629, for potentially redesigning the whole scheme of how we handle "harvested files". But this should be straightforward enough that we should just go ahead and fix.

landreev · 2023-01-09T16:01:04Z

My guess this is a 10

mreekie · 2023-01-09T16:37:15Z

Sizing:

Leonid puts this at a 10.
This is well defined. When we harvest in JSON format, if a dataset has any files, we just take the storage identifier on their site and put it in the datdabase file as is. This can be very confusing for an admin looking at the database. e.g. they see actual storage identifies. We should do like we do for DDI harvesting, indicating that the file lives somewhere else.
about a day of work.

mreekie · 2023-01-10T22:23:26Z

Priority Review with Stefano:

Moved from NIH Deliverables Backlog to Ordered Backlog

…s harvested in the proprietary json format. #7736

landreev added the Feature: Harvesting label Mar 29, 2021

pdurbin mentioned this issue Apr 13, 2022

Spike: Inventory and prioritize all existing Harvesting related issues IQSS/dataverse-pm#24

Closed

3 tasks

mreekie mentioned this issue Mar 10, 2023

Collection: Keep track of list of issues that we want to address as part of 1.4.1 IQSS/dataverse-pm#25

Closed

20 tasks

mreekie added pm.epic.nih_harvesting NIH OTA DC Grant: The Harvard Dataverse repository: A generalist repository integrated with a Data Commons labels May 9, 2022

mreekie added the Size: 10 A percentage of a sprint. 7 hours. label Jan 9, 2023

sync-by-unito bot mentioned this issue Mar 3, 2023

4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 IQSS/dataverse-pm#10

Closed

3 tasks

mreekie added the pm.GREI-d-1.4.1 NIH, yr1, aim4, task1: Resolve OAI-PMH harvesting issues label Mar 20, 2023

mreekie added the pm.GREI-d-1.4.2 NIH, yr1, aim4, task2: Create working group on packaging standards label Mar 20, 2023

landreev self-assigned this Mar 21, 2023

landreev added a commit that referenced this issue Mar 22, 2023

Changed json parser to modify the "storageidentifiers" of remote file…

b27d266

…s harvested in the proprietary json format. #7736

landreev mentioned this issue Mar 22, 2023

Fix for handling storageidentifiers in harvested files (dataverse_json) #9467

Merged

kcondon closed this as completed in #9467 Mar 24, 2023

pdurbin added this to the 5.14 milestone May 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix handling of storageidentifiers in dataverse_json harvests #7736

Fix handling of storageidentifiers in dataverse_json harvests #7736

landreev commented Mar 29, 2021 •

edited

Loading

qqmyers commented Mar 29, 2021

landreev commented Mar 30, 2021

qqmyers commented Mar 30, 2021

landreev commented Mar 30, 2021

landreev commented Mar 30, 2021

landreev commented Apr 20, 2022

landreev commented Jan 9, 2023

mreekie commented Jan 9, 2023

mreekie commented Jan 10, 2023

Fix handling of storageidentifiers in dataverse_json harvests #7736

Fix handling of storageidentifiers in dataverse_json harvests #7736

Comments

landreev commented Mar 29, 2021 • edited Loading

qqmyers commented Mar 29, 2021

landreev commented Mar 30, 2021

qqmyers commented Mar 30, 2021

landreev commented Mar 30, 2021

landreev commented Mar 30, 2021

landreev commented Apr 20, 2022

landreev commented Jan 9, 2023

mreekie commented Jan 9, 2023

mreekie commented Jan 10, 2023

landreev commented Mar 29, 2021 •

edited

Loading