-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Harvest OREX dataset from SBN-PSI web #196
Comments
@rchenatjpl can you help us load this data into our EN registry? we will eventually delete it, but we want to have this loaded for some demo purposes.
would you be able to help us out here? when running harvest, we want to load from our machine, but we should make sure the URL points to their data on their servers at https://sbnarchive.psi.edu/pds4/orex/orex.ovirs/data_calibrated/ . |
Sure. To be clear, that collection seems to have 1.6 million files, and they're downloading very slowly. If anyone knows a better way than wget, please say so. And sorry I've forgotten, and correct me if I'm wrong, but the way to point to the PSI web site is to change this in the config file.
|
@rchenatjpl yeah... it is going to be very slow unfortunately. wget is all I know. per the config file, that is correct! I think it will be something like:
|
once you try to register the data, the |
@jordanpadams I need more disk space. I believe I'm responsible for killing https://pds.nasa.gov earlier. I freed up a little by moving two directories to /tmp on pdscloud-prod1, but I think I'll need more. See |
This one collection is enormous. Should I harvest it in pieces? Does harvest check against the collection.csv? |
Hi @rchenatjpl @jordanpadams , we could use our scalable harvest service for that job. @rchenatjpl let me know where that should be deployed ? I will help you with that. It is a different version of harvest which is meant to work on larger set of files. |
Actually @jordanpadams, we could ask @sjoshi-jpl to deploy the scalable harvest on AWS ECS to be able to scale it up and run parallel harvests. That could be a good demo for other nodes. The deployment might also be reused for nucleus/css. |
@tloubrieu-jpl we should maybe chat about this offline. architecturally is scalable harvest really built for the cloud? the way the services are built, they don't seem to be designed for a serverless environment? I may be wrong. This may require some rethinking of how to deploy this. also, I actually think this would be a great benchmark testing for the standalone harvest? thoughts? |
@jordanpadams Whatever works will be good since the priority is to have these data ingested and you are right using scalable harvest adds some useless risks. We can discuss offline if we should try that, but may be not for this ticket. I remember myself using standalone harvest for these data 1 or 2 years ago, and I created a python script to split the input and parallelize harvest. But we can have a first attempt where we use standalone harvest as-is on the full collection and see what happens. |
@jordanpadams @tloubrieu-jpl Holy cow, how do we feel about errors? I'm going to plow ahead regardless. I'm finding duplicate lines in the massive file https://sbnarchive.psi.edu/pds4/orex/orex.ovirs/data_calibrated/collection_inventory_ovirs_data_calibrated.csv % grep 20181102t040122s658_ovr_spacel2 data_calibrated/collection_inventory_ovirs_data_calibrated.csv |
Let's assume harvest does not care. You can try to harvest the collection as-is. But I guess we should tell SBN-PSI about that. @rchenatjpl you been able to download the full collection yet ? |
Oh but it is like 30% is duplicated. I am reading you wc results correctly ? For performance purpose we might gain some time if we clean that file up before harvest runs on it. |
@tloubrieu-jpl @jordanpadams To be sure I'm doing something reasonable: I'm downloading parts of the collection, harvesting, then deleting those files to make room for more parts. I am replacing the prefix of the path with PSI's web site while harvesting. I have not approved any yet. If this is the wrong approach, please let me know soon. Thanks |
@rchenatjpl that looks reasonable to me but you would spare you some pain if you had a larger disk space. Where are you downloading the data ? On pdscloud-prod ? |
Thanks, Thomas. I've been downloading onto production machine. du -k so far says 453514192, which is 453GB, which doesn't seem that much, but I think Andrew or someone said he increased the disk space for $DATA_HOME to 350GB. I've killed the production machine twice, which is still affecting my other work. I also have more to ingest. |
@tloubrieu-jpl @jordanpadams I may be done. I hope I harvested 1169346 labels. If being precise matters, is there a way to dump all the LIDs that start with urn:nasa:pds:orex.ovirs:data_calibrated:? I still wouldn't be able to give you an ironclad guarantee that the VIDs match. |
That is great @rchenatjpl , I was not able to find the collection itself yet but was able to see at least one of the observational products. Thanks ! |
The number of products which lid starts with I confirm that I don't see the collection itself (with lid=urn:nasa:pds:orex.ovirs:data_calibrated). It is not in the registry. Could you add it ? I guess when you loaded the products by parts, you missed it. One that is done, you will be able to switch the archive status for the full collection with a single
|
I ingested collection* then tried to change the archive_status. Maybe it worked? The collection LIDVID urn:nasa:pds:orex.ovirs:data_calibrated::11.0 shows ops:Tracking_Meta/ops:archive_status = "archived", as does one lower-level product, but i don't know if all got changed to "archived". |
Thanks very much @rchenatjpl we can see the collection and its members in the registry-api now. See https://pds.nasa.gov/api/search/1/products/urn:nasa:pds:orex.ovirs:data_calibrated |
@tloubrieu-jpl are we sure everything was loaded? That timeout on connection worries me... Also, new requirement for registry-mgr fault tolerance :-) |
💡 Description
Find the dataset on https://arcnav.psi.edu/urn:nasa:pds:orex.ovirs:data_calibrated
We should download all the products of this collection and harvest them in the EN production registry
The reference to the labels and data files should still point on the SBN PSI web site: https://sbnarchive.psi.edu/pds4/orex/orex.ovirs/data_calibrated/
The text was updated successfully, but these errors were encountered: