Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalable Harvest does not replace file paths with the appropriate URL prefix #64

Closed
mdrum opened this issue Jun 9, 2022 · 5 comments · Fixed by nasa-pds-engineering-node/registry-harvest-service#23
Assignees
Labels
B13.0 bug Something isn't working s.high High severity

Comments

@mdrum
Copy link

mdrum commented Jun 9, 2022

🐛 Describe the bug

Using the Scalable Harvest suite of tools, the objects eventually uploaded to the registry do not conform to the replacePrefix directives in the harvest config file. Therefore, the fields that appear in the registry map to the file paths instead of their reachable URLs. I tested the exact same config file on standalone harvest, and the paths were updated correctly.
Resultant example:

{
        "_index" : "registry",
        "_type" : "_doc",
        "_id" : "urn:nasa:pds:smallbodiesoccultations:data:occsatlist_tab::1.0",
        "_score" : 5.440146,
        "fields" : {
          "ops:Data_File_Info/ops:file_ref" : [
            "/dsk1/www/archive/pds4/non_mission/smallbodiesoccultations/data/occsatlist.tab"
          ],
          "ops:Label_File_Info/ops:file_ref" : [
            "/dsk1/www/archive/pds4/non_mission/smallbodiesoccultations/data/occsatlist.xml"
          ]
        }

📜 To Reproduce

  1. Install the appropriate services (specified below)
  2. Run ./registry-harvest-cli-1.0.0/bin/harvest-client harvest -j ./harvest-test.xml with archive bundles at the path specified (/dsk1/www/archive/pds4/non_mission/)
  3. Wait for the job to complete
  4. Query the registry for items from the package just harvested

🕵️ Expected behavior

Expect fields to look like this:

{
        "_index" : "registry",
        "_type" : "_doc",
        "_id" : "urn:nasa:pds:smallbodiesoccultations:data:occsatlist_tab::1.0",
        "_score" : 5.440146,
        "fields" : {
          "ops:Data_File_Info/ops:file_ref" : [
            "https://sbnarchive.psi.edu/pds4/pds4/non_mission/smallbodiesoccultations/data/occsatlist.tab"
          ],
          "ops:Label_File_Info/ops:file_ref" : [
            "https://sbnarchive.psi.edu/pds4/pds4/non_mission/smallbodiesoccultations/data/occsatlist.xml"
          ]
        }

📚 Version of Software Used

registry-crawler-service-1.0.0
registry-harvest-cli-1.0.0
registry-harvest-service-1.0.0
registry-manager-4.4.0

🩺 Test Data / Additional context

Attached the config used for testing both standalone and scalable harvest
harvest-text.xml.txt

🏞Screenshots

🖥 System Info

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

🦄 Related requirements

⚙️ Engineering Details

@jordanpadams
Copy link
Member

@tloubrieu-jpl ☝️

@mdrum
Copy link
Author

mdrum commented Jun 13, 2022

When would an ETA be known for this fix? Can I get an order of magnitude? (days, weeks, months)? Pretty sure this takes the registry out of commission for us, except for the smaller bundles, until a fix comes in.

@tloubrieu-jpl
Copy link
Member

Hi @mdrum ,

I did not look at the bug yet, but I am not expecting something hard to resolve.
We can target an analysis tomorrow and solution by the end of the week.

We are also willing to provide you with a script to bulk update the dat which would have been previously loaded by you.

Thanks

@mdrum
Copy link
Author

mdrum commented Jun 13, 2022

That would be great, thanks @tloubrieu-jpl!

@tloubrieu-jpl
Copy link
Member

tloubrieu-jpl commented Jun 16, 2022

Hi @mdrum ,

I looked at this issue and from what I am seeing (this is not my code), the configuration of the job is read in a different way in the standalone harvest and the scalable havest and the <fileRef replacePrefix... directive is not supported in the scalable harvest.

This is a bug and we will add that feature, but since:

  • this the fileInfo/fileRef configuration is not read yet in the scalable harvest
  • the original developer is not available,

That will take a bit longer than expected.

You can use standalone harvest in the meantime, sorry about the frustration of having a brand new tool but missing a critical feature.... I'll give you a new ETA by the end of day.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B13.0 bug Something isn't working s.high High severity
Projects
None yet
4 participants