Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit/reimplement the concept of a "Harvested file". #8629

Closed
3 tasks
landreev opened this issue Apr 20, 2022 · 7 comments
Closed
3 tasks

Revisit/reimplement the concept of a "Harvested file". #8629

landreev opened this issue Apr 20, 2022 · 7 comments
Labels
Feature: Harvesting NIH OTA: 1.4.1 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 prdOwnThis is an item synched from the product ... pm.epic.nih_harvesting_framework pm.GREI-d-1.4.1 NIH, yr1, aim4, task1: Resolve OAI-PMH harvesting issues pm.GREI-d-1.4.2 NIH, yr1, aim4, task2: Create working group on packaging standards pm.GREI-d-2.4.1B NIH AIM:4 YR:2 TASK:1B | 2.4.1B | (started yr1) Resolve OAI-PMH harvesting issues Size: 10 A percentage of a sprint. 7 hours.

Comments

@landreev
Copy link
Contributor

landreev commented Apr 20, 2022

Short version: "Harvested files" are currently stored as DvObject/DataFile/FileMetadata/etc. entities, just like "real" files. I don't think they should be handled so.

(I feel like I have a memory of opening an issue for this, but looks like I never did - ?)

History: "Harvested Files" are created locally when a Harvesting client imports DDI or native JSON dataset metadata records with file entries from other Dataverses (DC format does not have a mechanism for encoding files or any kinds of child objects). The reason they become DataFiles/DvObjects is a throwback to or legacy of the old implementation in DVN v2-3. Back then they were treated as actual files - users could download them locally; they stored the remote location (url) in place of the physical file name, and DVN would make an HTTP call to get and proxy the content, transparently to the user. We abandoned that scheme as overly complicated (the problem with authentication was never fully resolved, among other things). So in the current scheme these "files" are used only for indexing. We still attempt to store a link to the remote object (as the storageidentifier of the DvObject), but it is never used practically. When search hits for harvested files are displayed, no attempt is made to redirect the user specifically to that file - clicking on the card always sends them to the remote location of the dataset to which the file belongs. This really doesn't justify maintaining the same DvObject hierarchy of entities as for "real" files, IMO.

The concept of a "remote file", something that transparently appears as a DataFile to the local user, with the byte content stored elsewhere/remotely, is now being revisited (#7324). Once we have that, we may consider, as an optional/configurable harvesting feature, being able to turn harvested files into these "remotely stored" files locally. But when harvesting file records solely for indexing, I believe we should instead introduce some "HarvestedFileMetadata" entity for storing them.

Definition of done:

  • discuss during a tech hour.
  • decide whether to move forward on this.
    • if we decide to implement this, create the corresponding issues that are associated with it.
@landreev
Copy link
Contributor Author

landreev commented Jan 9, 2023

This is probably doable in one sprint-worth of time... But let's decide if we actually want to do this ("revisit" being the key word). And/or if maybe we want to address other, more urgently needed harvesting issues?

@mreekie
Copy link

mreekie commented Jan 9, 2023

This will be a spike:

  • Discussion during the tech hours maybe.
  • we created a definition of done and added it to the end of the description.
  • assigned it a size of 10, since we have a good idea of what will be done in this step

@mreekie mreekie added Size: 80 A percentage of a sprint. 56 hours. Size: 10 A percentage of a sprint. 7 hours. and removed Size: 80 A percentage of a sprint. 56 hours. labels Jan 9, 2023
@mreekie mreekie added pm.GREI-d-1.4.1 NIH, yr1, aim4, task1: Resolve OAI-PMH harvesting issues pm.GREI-d-1.4.2 NIH, yr1, aim4, task2: Create working group on packaging standards labels Mar 20, 2023
@mreekie
Copy link

mreekie commented Apr 19, 2023

Sizing:

  • Was discussed in tech hour.
  • Decided to do one small but important thing.
  • Leonid will update this spike and create a follow-up ticket from that discussion.
  • Once that is done, this issue can be closed.

@cmbz
Copy link

cmbz commented Jun 1, 2023

@landreev I'm following up on @mreekie's 19 April note with some questions:

  • Has discussion occurred somewhere else I can link to?
  • Has a follow-up ticket already been created so this issue can be closed?

@cmbz cmbz added the pm.GREI-d-2.4.1B NIH AIM:4 YR:2 TASK:1B | 2.4.1B | (started yr1) Resolve OAI-PMH harvesting issues label Jun 2, 2023
@landreev
Copy link
Contributor Author

landreev commented Jun 9, 2023

@cmbz
Sorry, meant to reply last week...

This was discussed during a tech hour. And we concluded that it wasn't worth it, to try and heavily re-design the current setup, such as, introduce a new database object dedicated to representing a harvested file, etc. But we decided to do one small/simple thing: move the column harvestingclient_id from the dataset table in the database to the common dvobject. This by itself will simplify many operations, will make it much easier to tell a harvested from a "real" file in 1 step, etc.

So we can do one of the 2 things: close this issue as a completed spike, and open a quick dev. issue for implementing the change above. Or change the title of this issue and use it for scheduling and implementing it. The former is probably cleaner (?).

@cmbz
Copy link

cmbz commented Jun 12, 2023

@landreev I like your first suggestion: "close this issue as a completed spike, and open a quick dev. issue for implementing the change above". Thank you! :)

@cmbz
Copy link

cmbz commented Jun 30, 2023

Acting on @landreev recommendation, I created the related issue: #9686 and am closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Harvesting NIH OTA: 1.4.1 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 prdOwnThis is an item synched from the product ... pm.epic.nih_harvesting_framework pm.GREI-d-1.4.1 NIH, yr1, aim4, task1: Resolve OAI-PMH harvesting issues pm.GREI-d-1.4.2 NIH, yr1, aim4, task2: Create working group on packaging standards pm.GREI-d-2.4.1B NIH AIM:4 YR:2 TASK:1B | 2.4.1B | (started yr1) Resolve OAI-PMH harvesting issues Size: 10 A percentage of a sprint. 7 hours.
Projects
Status: No status
Development

No branches or pull requests

3 participants