Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harvest OREX dataset from SBN-PSI web #196

Closed
tloubrieu-jpl opened this issue May 18, 2023 · 22 comments
Closed

Harvest OREX dataset from SBN-PSI web #196

tloubrieu-jpl opened this issue May 18, 2023 · 22 comments

Comments

@tloubrieu-jpl
Copy link
Member

tloubrieu-jpl commented May 18, 2023

💡 Description

Find the dataset on https://arcnav.psi.edu/urn:nasa:pds:orex.ovirs:data_calibrated

We should download all the products of this collection and harvest them in the EN production registry

The reference to the labels and data files should still point on the SBN PSI web site: https://sbnarchive.psi.edu/pds4/orex/orex.ovirs/data_calibrated/

@jordanpadams
Copy link
Member

@rchenatjpl can you help us load this data into our EN registry? we will eventually delete it, but we want to have this loaded for some demo purposes.

would you be able to help us out here? when running harvest, we want to load from our machine, but we should make sure the URL points to their data on their servers at https://sbnarchive.psi.edu/pds4/orex/orex.ovirs/data_calibrated/ .

@rchenatjpl
Copy link

Sure. To be clear, that collection seems to have 1.6 million files, and they're downloading very slowly. If anyone knows a better way than wget, please say so. And sorry I've forgotten, and correct me if I'm wrong, but the way to point to the PSI web site is to change this in the config file.

  <fileInfo processDataFiles="true" storeLabels="true">
    <fileRef replacePrefix="/path/to/archive" with="https://url/to/archive/" />
  </fileInfo>

@jordanpadams
Copy link
Member

jordanpadams commented May 18, 2023

@rchenatjpl yeah... it is going to be very slow unfortunately. wget is all I know.

per the config file, that is correct! I think it will be something like:

<fileRef replacePrefix="/path/to/data/pds4/test-data/registry/" with="https://sbnarchive.psi.edu/pds4/orex/" />

@jordanpadams
Copy link
Member

once you try to register the data, the ops:Data_File_Info/ops:file_ref should have valid URLs to SBN data.

@rchenatjpl
Copy link

@jordanpadams I need more disk space. I believe I'm responsible for killing https://pds.nasa.gov earlier. I freed up a little by moving two directories to /tmp on pdscloud-prod1, but I think I'll need more. See
https://itsd-jira.jpl.nasa.gov/servicedesk/customer/portal/16/DSIO-3936

@rchenatjpl
Copy link

This one collection is enormous. Should I harvest it in pieces? Does harvest check against the collection.csv?

@tloubrieu-jpl
Copy link
Member Author

Hi @rchenatjpl @jordanpadams , we could use our scalable harvest service for that job. @rchenatjpl let me know where that should be deployed ? I will help you with that. It is a different version of harvest which is meant to work on larger set of files.

@tloubrieu-jpl
Copy link
Member Author

Actually @jordanpadams, we could ask @sjoshi-jpl to deploy the scalable harvest on AWS ECS to be able to scale it up and run parallel harvests. That could be a good demo for other nodes. The deployment might also be reused for nucleus/css.

@jordanpadams
Copy link
Member

@tloubrieu-jpl we should maybe chat about this offline. architecturally is scalable harvest really built for the cloud? the way the services are built, they don't seem to be designed for a serverless environment? I may be wrong. This may require some rethinking of how to deploy this.

also, I actually think this would be a great benchmark testing for the standalone harvest? thoughts?

@tloubrieu-jpl
Copy link
Member Author

@jordanpadams Whatever works will be good since the priority is to have these data ingested and you are right using scalable harvest adds some useless risks. We can discuss offline if we should try that, but may be not for this ticket.

I remember myself using standalone harvest for these data 1 or 2 years ago, and I created a python script to split the input and parallelize harvest. But we can have a first attempt where we use standalone harvest as-is on the full collection and see what happens.

@rchenatjpl
Copy link

@jordanpadams @tloubrieu-jpl Holy cow, how do we feel about errors? I'm going to plow ahead regardless. I'm finding duplicate lines in the massive file https://sbnarchive.psi.edu/pds4/orex/orex.ovirs/data_calibrated/collection_inventory_ovirs_data_calibrated.csv

% grep 20181102t040122s658_ovr_spacel2 data_calibrated/collection_inventory_ovirs_data_calibrated.csv
P,urn:nasa:pds:orex.ovirs:data_calibrated:20181102t040122s658_ovr_spacel2.fits::1.0
P,urn:nasa:pds:orex.ovirs:data_calibrated:20181102t040122s658_ovr_spacel2.fits::1.0
P,urn:nasa:pds:orex.ovirs:data_calibrated:20181102t040122s658_ovr_spacel2_calv2.fits::2.0
% wc data_calibrated/collection_inventory_ovirs_data_calibrated.csv
1597353 1597353 137482254 data_calibrated/collection_inventory_ovirs_data_calibrated.csv
% sort data_calibrated/collection_inventory_ovirs_data_calibrated.csv | uniq | wc
1169346 1169346 101731008

@tloubrieu-jpl
Copy link
Member Author

Let's assume harvest does not care. You can try to harvest the collection as-is. But I guess we should tell SBN-PSI about that.

@rchenatjpl you been able to download the full collection yet ?

@tloubrieu-jpl
Copy link
Member Author

Oh but it is like 30% is duplicated. I am reading you wc results correctly ? For performance purpose we might gain some time if we clean that file up before harvest runs on it.

@rchenatjpl
Copy link

@tloubrieu-jpl @jordanpadams To be sure I'm doing something reasonable: I'm downloading parts of the collection, harvesting, then deleting those files to make room for more parts. I am replacing the prefix of the path with PSI's web site while harvesting. I have not approved any yet. If this is the wrong approach, please let me know soon. Thanks

@tloubrieu-jpl
Copy link
Member Author

@rchenatjpl that looks reasonable to me but you would spare you some pain if you had a larger disk space. Where are you downloading the data ? On pdscloud-prod ?

@rchenatjpl
Copy link

rchenatjpl commented May 31, 2023

Thanks, Thomas. I've been downloading onto production machine. du -k so far says 453514192, which is 453GB, which doesn't seem that much, but I think Andrew or someone said he increased the disk space for $DATA_HOME to 350GB. I've killed the production machine twice, which is still affecting my other work. I also have more to ingest.
OMG, I'm looking at Carol's email now, and her total is 1206GB. The numbers from her individual directories often don't match what I downloaded, sometimes off by 2x, sometimes by something else. Eh, I'll just keep doing what I'm doing.

@rchenatjpl
Copy link

@tloubrieu-jpl @jordanpadams I may be done. I hope I harvested 1169346 labels. If being precise matters, is there a way to dump all the LIDs that start with urn:nasa:pds:orex.ovirs:data_calibrated:? I still wouldn't be able to give you an ironclad guarantee that the VIDs match.

@tloubrieu-jpl
Copy link
Member Author

That is great @rchenatjpl , I was not able to find the collection itself yet but was able to see at least one of the observational products.
We will need to change the status of the collection from staged to archived as well. I will do more investigation tonight hopefully and I'll let you know what remains to be done.

Thanks !

@tloubrieu-jpl
Copy link
Member Author

tloubrieu-jpl commented Jun 15, 2023

@rchenatjpl ,

The number of products which lid starts with urn:nasa:pds:orex.ovirs:data_calibrated: is 1170078, which sound perfect

I confirm that I don't see the collection itself (with lid=urn:nasa:pds:orex.ovirs:data_calibrated). It is not in the registry.

Could you add it ? I guess when you loaded the products by parts, you missed it.

One that is done, you will be able to switch the archive status for the full collection with a single registry-mgr command:

    ./registry-manager set-archive-status -status archived -lidvid {the lidvid of the collection} -es https://search-en-prod-di7dor7quy7qwv3husi2wt5tde.us-west-2.es.amazonaws.com:443 -auth ...

@rchenatjpl
Copy link

I ingested collection* then tried to change the archive_status. Maybe it worked?
[pds4@pdscloud-prod1 test]$ registry-manager set-archive-status -status archived -lidvid urn:nasa:pds:orex.ovirs:data_calibrated::11.0 -es https://search-en-prod-di7dor7quy7qwv3husi2wt5tde.us-west-2.es.amazonaws.com:443 -auth auth.txt
[INFO] Setting product status. LIDVID = urn:nasa:pds:orex.ovirs:data_calibrated::11.0, status = archived
[INFO] Setting status of primary references from collection inventory
[ERROR] 10,000 milliseconds timeout on connection http-outgoing-0 [ACTIVE]
[pds4@pdscloud-prod1 test]$
[pds4@pdscloud-prod1 test]$

The collection LIDVID urn:nasa:pds:orex.ovirs:data_calibrated::11.0 shows ops:Tracking_Meta/ops:archive_status = "archived", as does one lower-level product, but i don't know if all got changed to "archived".

@tloubrieu-jpl
Copy link
Member Author

Thanks very much @rchenatjpl we can see the collection and its members in the registry-api now. See https://pds.nasa.gov/api/search/1/products/urn:nasa:pds:orex.ovirs:data_calibrated

@jordanpadams
Copy link
Member

@tloubrieu-jpl are we sure everything was loaded? That timeout on connection worries me...

Also, new requirement for registry-mgr fault tolerance :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants