Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix handling of storageidentifiers in dataverse_json harvests #7736

Closed
landreev opened this issue Mar 29, 2021 · 9 comments · Fixed by #9467
Closed

Fix handling of storageidentifiers in dataverse_json harvests #7736

landreev opened this issue Mar 29, 2021 · 9 comments · Fixed by #9467
Assignees
Labels
Feature: Harvesting NIH OTA DC Grant: The Harvard Dataverse repository: A generalist repository integrated with a Data Commons NIH OTA: 1.4.1 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 prdOwnThis is an item synched from the product ... pm.epic.nih_harvesting pm.GREI-d-1.4.1 NIH, yr1, aim4, task1: Resolve OAI-PMH harvesting issues pm.GREI-d-1.4.2 NIH, yr1, aim4, task2: Create working group on packaging standards Size: 10 A percentage of a sprint. 7 hours.
Milestone

Comments

@landreev
Copy link
Contributor

landreev commented Mar 29, 2021

Dataverses can harvest metadata in our custom json format from each other. Our own proprietary export and import are used in the process.
A side effect of this method is that the storageidentifier, the physical location of the datafile on the remote installation ends up being imported verbatim into the harvesting dataverse; instead of the url of the download API on the remote end.
We now have all these recently harvested files with strange storageidentifiers;
like this one:

1784af576bf-27316028aabe

(no driver prefix; must have been harvested from a pre-4.20 Dataverse)
or this:

s3://dataverse-prod-s3:178290c4c23-3a496b7152af

(a file in somebody's S3 storage bucket...)
These are of course completely useless for a harvesting installation. We want to handle these the same way as when we harvest DDI between Dataverses. I.e., the dvobjects for these harvested files need to be created with the remote download API url in storageidentifier field (would be https://dataverse.tdl.org/api/access/datafile/something-something for the last file above, for example...)

Aside from these entries being useless as imported, this is not urgent in that we don't use these remote locations for any practical purpose, as of now. (So that's why we haven't noticed until now). But it's still messy (I was completely weirded out when I saw the ones like the first one above; that looked entirely like a local storageidentifier, that somehow got created without a driver prefix...)

@qqmyers
Copy link
Member

qqmyers commented Mar 29, 2021

Note that #7325 uses storageidentifiers of this type/form. The use there should probably be consistent with anything done for harvested datasets (at least not be incompatible).

@landreev
Copy link
Contributor Author

@qqmyers

Note that #7325 uses storageidentifiers of this type/form. The use there should probably be consistent with anything done for harvested datasets (at least not be incompatible).

Of which type? With the "http(s):" prefix?

@qqmyers
Copy link
Member

qqmyers commented Mar 30, 2021

Reading that PR again, I think the final design defines a storage type of 'http', but the storageidentifiers would use the label for the store as with other types, so they would be things like trsa:// rather than https://. So no direct conflict but a chance for confusion both through the current type name (could rename the type to 'remote'). A potential for issues with storage code overall exists though - the code assumes a format of <storage driver label> :// when trying to identify the right StorageIO class to use so http(s) would have to become reserved words.

@landreev
Copy link
Contributor Author

I agree, it could be confusing, at least for a human person looking at the database entries. But I don't think it should lead to any real conflicts, even if you define a storage of type "http" with the actual label "http" (like we have with "file:" and "s3:"). Because we have other ways to unambiguously tell a harvested dvobject from a real one, without having to rely on the storageidentifier.

That said we could make it more explicit; maybe all the harvested ones should have some reserved prefix... harvested:https://remote.edu/api/access/datafile/123? Not storing that remote source url in the storageidentifier field would potentially be even better... I just wasn't ok with the idea of introducing another database column just for that - so decided not to do that for now.

@landreev
Copy link
Contributor Author

I am going to open an issue for reviewing and potentially refactoring this setup, of "harvested files".
This is yet another case of something that's a legacy, that may be more trouble than it's worth.
This system is a leftover from the times when DVNs were actually reading files harvested from other DVNs and other remote sources (for example, ICPSR), serving them to the users and even running statistical calculations on them. But that was all abandoned, for various reasons and now these harvested files are only used for indexing. There are no pages associated with them (search results redirect the user to the original remote source).
In other words, there doesn't seem to be any good reason for these "files" to be full-blown DvObjects and DataFiles.

(this current issue is still a valid case for a shorter-term fix though).

@landreev
Copy link
Contributor Author

I have also opened #8629, for potentially redesigning the whole scheme of how we handle "harvested files". But this should be straightforward enough that we should just go ahead and fix.

@mreekie mreekie added pm.epic.nih_harvesting NIH OTA DC Grant: The Harvard Dataverse repository: A generalist repository integrated with a Data Commons labels May 9, 2022
@mreekie mreekie added NIH OTA: 1.4.1 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 prdOwnThis is an item synched from the product ... and removed NIH OTA: 1.4.1 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 prdOwnThis is an item synched from the product ... labels Oct 25, 2022
@landreev
Copy link
Contributor Author

landreev commented Jan 9, 2023

My guess this is a 10

@mreekie
Copy link

mreekie commented Jan 9, 2023

Sizing:

  • Leonid puts this at a 10.
  • This is well defined. When we harvest in JSON format, if a dataset has any files, we just take the storage identifier on their site and put it in the datdabase file as is. This can be very confusing for an admin looking at the database. e.g. they see actual storage identifies. We should do like we do for DDI harvesting, indicating that the file lives somewhere else.
  • about a day of work.

@mreekie mreekie added the Size: 10 A percentage of a sprint. 7 hours. label Jan 9, 2023
@mreekie
Copy link

mreekie commented Jan 10, 2023

Priority Review with Stefano:

  • Moved from NIH Deliverables Backlog to Ordered Backlog

@mreekie mreekie added the pm.GREI-d-1.4.1 NIH, yr1, aim4, task1: Resolve OAI-PMH harvesting issues label Mar 20, 2023
@mreekie mreekie added the pm.GREI-d-1.4.2 NIH, yr1, aim4, task2: Create working group on packaging standards label Mar 20, 2023
@landreev landreev self-assigned this Mar 21, 2023
landreev added a commit that referenced this issue Mar 22, 2023
@pdurbin pdurbin added this to the 5.14 milestone May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Harvesting NIH OTA DC Grant: The Harvard Dataverse repository: A generalist repository integrated with a Data Commons NIH OTA: 1.4.1 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 prdOwnThis is an item synched from the product ... pm.epic.nih_harvesting pm.GREI-d-1.4.1 NIH, yr1, aim4, task1: Resolve OAI-PMH harvesting issues pm.GREI-d-1.4.2 NIH, yr1, aim4, task2: Create working group on packaging standards Size: 10 A percentage of a sprint. 7 hours.
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

4 participants