Harvest OREX dataset from SBN-PSI web #196

tloubrieu-jpl · 2023-05-18T16:41:55Z

💡 Description

Find the dataset on https://arcnav.psi.edu/urn:nasa:pds:orex.ovirs:data_calibrated

We should download all the products of this collection and harvest them in the EN production registry

The reference to the labels and data files should still point on the SBN PSI web site: https://sbnarchive.psi.edu/pds4/orex/orex.ovirs/data_calibrated/

jordanpadams · 2023-05-18T22:27:40Z

@rchenatjpl can you help us load this data into our EN registry? we will eventually delete it, but we want to have this loaded for some demo purposes.

you can download the data onto one of the cloud prod or gamma machines
you can actually replace the previous version we have online here: https://pds.nasa.gov/data/pds4/test-data/registry/orex.ovirs/
and load into the operational API.

would you be able to help us out here? when running harvest, we want to load from our machine, but we should make sure the URL points to their data on their servers at https://sbnarchive.psi.edu/pds4/orex/orex.ovirs/data_calibrated/ .

rchenatjpl · 2023-05-18T23:15:22Z

Sure. To be clear, that collection seems to have 1.6 million files, and they're downloading very slowly. If anyone knows a better way than wget, please say so. And sorry I've forgotten, and correct me if I'm wrong, but the way to point to the PSI web site is to change this in the config file.

  <fileInfo processDataFiles="true" storeLabels="true">
    <fileRef replacePrefix="/path/to/archive" with="https://url/to/archive/" />
  </fileInfo>

jordanpadams · 2023-05-18T23:46:33Z

@rchenatjpl yeah... it is going to be very slow unfortunately. wget is all I know.

per the config file, that is correct! I think it will be something like:

<fileRef replacePrefix="/path/to/data/pds4/test-data/registry/" with="https://sbnarchive.psi.edu/pds4/orex/" />

jordanpadams · 2023-05-18T23:48:26Z

once you try to register the data, the ops:Data_File_Info/ops:file_ref should have valid URLs to SBN data.

rchenatjpl · 2023-05-23T23:48:33Z

@jordanpadams I need more disk space. I believe I'm responsible for killing https://pds.nasa.gov earlier. I freed up a little by moving two directories to /tmp on pdscloud-prod1, but I think I'll need more. See
https://itsd-jira.jpl.nasa.gov/servicedesk/customer/portal/16/DSIO-3936

rchenatjpl · 2023-05-23T23:51:15Z

This one collection is enormous. Should I harvest it in pieces? Does harvest check against the collection.csv?

tloubrieu-jpl · 2023-05-24T03:46:40Z

Hi @rchenatjpl @jordanpadams , we could use our scalable harvest service for that job. @rchenatjpl let me know where that should be deployed ? I will help you with that. It is a different version of harvest which is meant to work on larger set of files.

tloubrieu-jpl · 2023-05-24T03:55:31Z

Actually @jordanpadams, we could ask @sjoshi-jpl to deploy the scalable harvest on AWS ECS to be able to scale it up and run parallel harvests. That could be a good demo for other nodes. The deployment might also be reused for nucleus/css.

jordanpadams · 2023-05-24T15:04:19Z

@tloubrieu-jpl we should maybe chat about this offline. architecturally is scalable harvest really built for the cloud? the way the services are built, they don't seem to be designed for a serverless environment? I may be wrong. This may require some rethinking of how to deploy this.

also, I actually think this would be a great benchmark testing for the standalone harvest? thoughts?

tloubrieu-jpl · 2023-05-24T18:44:51Z

@jordanpadams Whatever works will be good since the priority is to have these data ingested and you are right using scalable harvest adds some useless risks. We can discuss offline if we should try that, but may be not for this ticket.

I remember myself using standalone harvest for these data 1 or 2 years ago, and I created a python script to split the input and parallelize harvest. But we can have a first attempt where we use standalone harvest as-is on the full collection and see what happens.

rchenatjpl · 2023-05-25T21:50:57Z

@jordanpadams @tloubrieu-jpl Holy cow, how do we feel about errors? I'm going to plow ahead regardless. I'm finding duplicate lines in the massive file https://sbnarchive.psi.edu/pds4/orex/orex.ovirs/data_calibrated/collection_inventory_ovirs_data_calibrated.csv

% grep 20181102t040122s658_ovr_spacel2 data_calibrated/collection_inventory_ovirs_data_calibrated.csv
P,urn:nasa:pds:orex.ovirs:data_calibrated:20181102t040122s658_ovr_spacel2.fits::1.0
P,urn:nasa:pds:orex.ovirs:data_calibrated:20181102t040122s658_ovr_spacel2.fits::1.0
P,urn:nasa:pds:orex.ovirs:data_calibrated:20181102t040122s658_ovr_spacel2_calv2.fits::2.0
% wc data_calibrated/collection_inventory_ovirs_data_calibrated.csv
1597353 1597353 137482254 data_calibrated/collection_inventory_ovirs_data_calibrated.csv
% sort data_calibrated/collection_inventory_ovirs_data_calibrated.csv | uniq | wc
1169346 1169346 101731008

tloubrieu-jpl · 2023-05-25T22:01:08Z

Let's assume harvest does not care. You can try to harvest the collection as-is. But I guess we should tell SBN-PSI about that.

@rchenatjpl you been able to download the full collection yet ?

tloubrieu-jpl · 2023-05-25T22:04:37Z

Oh but it is like 30% is duplicated. I am reading you wc results correctly ? For performance purpose we might gain some time if we clean that file up before harvest runs on it.

rchenatjpl · 2023-05-30T20:37:17Z

@tloubrieu-jpl @jordanpadams To be sure I'm doing something reasonable: I'm downloading parts of the collection, harvesting, then deleting those files to make room for more parts. I am replacing the prefix of the path with PSI's web site while harvesting. I have not approved any yet. If this is the wrong approach, please let me know soon. Thanks

tloubrieu-jpl · 2023-05-31T13:25:07Z

@rchenatjpl that looks reasonable to me but you would spare you some pain if you had a larger disk space. Where are you downloading the data ? On pdscloud-prod ?

rchenatjpl · 2023-05-31T18:09:48Z

Thanks, Thomas. I've been downloading onto production machine. du -k so far says 453514192, which is 453GB, which doesn't seem that much, but I think Andrew or someone said he increased the disk space for $DATA_HOME to 350GB. I've killed the production machine twice, which is still affecting my other work. I also have more to ingest.
OMG, I'm looking at Carol's email now, and her total is 1206GB. The numbers from her individual directories often don't match what I downloaded, sometimes off by 2x, sometimes by something else. Eh, I'll just keep doing what I'm doing.

rchenatjpl · 2023-06-14T22:35:24Z

@tloubrieu-jpl @jordanpadams I may be done. I hope I harvested 1169346 labels. If being precise matters, is there a way to dump all the LIDs that start with urn:nasa:pds:orex.ovirs:data_calibrated:? I still wouldn't be able to give you an ironclad guarantee that the VIDs match.

tloubrieu-jpl · 2023-06-14T23:32:24Z

That is great @rchenatjpl , I was not able to find the collection itself yet but was able to see at least one of the observational products.
We will need to change the status of the collection from staged to archived as well. I will do more investigation tonight hopefully and I'll let you know what remains to be done.

Thanks !

tloubrieu-jpl · 2023-06-15T04:05:50Z

@rchenatjpl ,

The number of products which lid starts with urn:nasa:pds:orex.ovirs:data_calibrated: is 1170078, which sound perfect

I confirm that I don't see the collection itself (with lid=urn:nasa:pds:orex.ovirs:data_calibrated). It is not in the registry.

Could you add it ? I guess when you loaded the products by parts, you missed it.

One that is done, you will be able to switch the archive status for the full collection with a single registry-mgr command:

    ./registry-manager set-archive-status -status archived -lidvid {the lidvid of the collection} -es https://search-en-prod-di7dor7quy7qwv3husi2wt5tde.us-west-2.es.amazonaws.com:443 -auth ...

rchenatjpl · 2023-06-15T07:04:42Z

I ingested collection* then tried to change the archive_status. Maybe it worked?
[pds4@pdscloud-prod1 test]$ registry-manager set-archive-status -status archived -lidvid urn:nasa:pds:orex.ovirs:data_calibrated::11.0 -es https://search-en-prod-di7dor7quy7qwv3husi2wt5tde.us-west-2.es.amazonaws.com:443 -auth auth.txt
[INFO] Setting product status. LIDVID = urn:nasa:pds:orex.ovirs:data_calibrated::11.0, status = archived
[INFO] Setting status of primary references from collection inventory
[ERROR] 10,000 milliseconds timeout on connection http-outgoing-0 [ACTIVE]
[pds4@pdscloud-prod1 test]$
[pds4@pdscloud-prod1 test]$

The collection LIDVID urn:nasa:pds:orex.ovirs:data_calibrated::11.0 shows ops:Tracking_Meta/ops:archive_status = "archived", as does one lower-level product, but i don't know if all got changed to "archived".

tloubrieu-jpl · 2023-06-15T17:28:41Z

Thanks very much @rchenatjpl we can see the collection and its members in the registry-api now. See https://pds.nasa.gov/api/search/1/products/urn:nasa:pds:orex.ovirs:data_calibrated

jordanpadams · 2023-06-15T18:55:26Z

@tloubrieu-jpl are we sure everything was loaded? That timeout on connection worries me...

Also, new requirement for registry-mgr fault tolerance :-)

tloubrieu-jpl added B14.0 sprint-backlog task i&t.skip labels May 18, 2023

tloubrieu-jpl mentioned this issue May 18, 2023

SBN PSI data sets ingested with past version of tools and fields are not searchable NASA-PDS/registry-api#288

Closed

jordanpadams assigned rchenatjpl May 18, 2023

jordanpadams changed the title ~~Harvest Orex dataset frpm SBN-PSI web~~ Harvest OREX dataset from SBN-PSI web May 18, 2023

tloubrieu-jpl closed this as completed Jun 15, 2023

jordanpadams removed the sprint-backlog label Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harvest OREX dataset from SBN-PSI web #196

Harvest OREX dataset from SBN-PSI web #196

tloubrieu-jpl commented May 18, 2023 •

edited

Loading

jordanpadams commented May 18, 2023

rchenatjpl commented May 18, 2023

jordanpadams commented May 18, 2023 •

edited

Loading

jordanpadams commented May 18, 2023

rchenatjpl commented May 23, 2023

rchenatjpl commented May 23, 2023

tloubrieu-jpl commented May 24, 2023

tloubrieu-jpl commented May 24, 2023

jordanpadams commented May 24, 2023

tloubrieu-jpl commented May 24, 2023

rchenatjpl commented May 25, 2023

tloubrieu-jpl commented May 25, 2023

tloubrieu-jpl commented May 25, 2023

rchenatjpl commented May 30, 2023

tloubrieu-jpl commented May 31, 2023

rchenatjpl commented May 31, 2023 •

edited by jordanpadams

Loading

rchenatjpl commented Jun 14, 2023

tloubrieu-jpl commented Jun 14, 2023

tloubrieu-jpl commented Jun 15, 2023 •

edited

Loading

rchenatjpl commented Jun 15, 2023

tloubrieu-jpl commented Jun 15, 2023

jordanpadams commented Jun 15, 2023

Harvest OREX dataset from SBN-PSI web #196

Harvest OREX dataset from SBN-PSI web #196

Comments

tloubrieu-jpl commented May 18, 2023 • edited Loading

💡 Description

jordanpadams commented May 18, 2023

rchenatjpl commented May 18, 2023

jordanpadams commented May 18, 2023 • edited Loading

jordanpadams commented May 18, 2023

rchenatjpl commented May 23, 2023

rchenatjpl commented May 23, 2023

tloubrieu-jpl commented May 24, 2023

tloubrieu-jpl commented May 24, 2023

jordanpadams commented May 24, 2023

tloubrieu-jpl commented May 24, 2023

rchenatjpl commented May 25, 2023

tloubrieu-jpl commented May 25, 2023

tloubrieu-jpl commented May 25, 2023

rchenatjpl commented May 30, 2023

tloubrieu-jpl commented May 31, 2023

rchenatjpl commented May 31, 2023 • edited by jordanpadams Loading

rchenatjpl commented Jun 14, 2023

tloubrieu-jpl commented Jun 14, 2023

tloubrieu-jpl commented Jun 15, 2023 • edited Loading

rchenatjpl commented Jun 15, 2023

tloubrieu-jpl commented Jun 15, 2023

jordanpadams commented Jun 15, 2023

tloubrieu-jpl commented May 18, 2023 •

edited

Loading

jordanpadams commented May 18, 2023 •

edited

Loading

rchenatjpl commented May 31, 2023 •

edited by jordanpadams

Loading

tloubrieu-jpl commented Jun 15, 2023 •

edited

Loading