-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix S3 direct upload NPE and NetCDF/HDF5 and keep geospatial metadata extraction #9611
Conversation
Note that the NcML aux file is not created when S3 direct upload is enabled.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
By default, keep S3 direct upload fast. Don't download NetCDF or HDF5 files to try to pull geospatial metadata out of them when S3 direct upload is configured. If you really want this, add this setting and make it true.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
📦 Pushed preview application image as
🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name. |
Conflicts: doc/sphinx-guides/source/installation/config.rst src/main/java/edu/harvard/iq/dataverse/ingest/IngestServiceBean.java
This comment has been minimized.
This comment has been minimized.
Having some trouble getting this to work. Tried env var, tried jvm option, tried .properties file. |
@pdurbin I was unable to get this working using a jvm-option: I also tried an env var, DATAVERSE_NETCDF_GEO_EXTRACT_S3_DIRECT_UPLOAD and no luck. |
@@ -123,6 +123,10 @@ public enum JvmSettings { | |||
SCOPE_UI(PREFIX, "ui"), | |||
UI_ALLOW_REVIEW_INCOMPLETE(SCOPE_UI, "allow-review-for-incomplete"), | |||
UI_SHOW_VALIDITY_FILTER(SCOPE_UI, "show-validity-filter"), | |||
|
|||
// NetCDF SETTINGS | |||
SCOPE_NETCDF(PREFIX, "netcdf"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please start having an "ingest" scope where we collect these? (Netcdf being a subscope of it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe? The timing is poor because I was away for two weeks and now you're away until August.
What are you suggesting, specifically?
Instead of...
dataverse.netcdf.geo-extract-s3-direct-upload
... you'd like to see...
dataverse.ingest.netcdf.geo-extract-s3-direct-upload
?
I'm ok with it if @landreev and/or @scolapasta are ok with it. And @sekmiller did the review. And @qqmyers does a lot of storage work. They might want to weigh in.
For the record, I looked for ingest stuff in our config page and we do already have these:
dataverse.files.<id>.ingestsizelimit
:TabularIngestSizeLimit
Personally, I've been referring to this feature as "metadata extraction" (from a file) rather than ingest. I'm not sure where we draw the line on what ingest is.
I'll plug having a glossary, like I always do! 😜
📦 Pushed preview images as
🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name. |
I removed the "waiting" label 'cuz I'm back, baby! 😜 First, I merged the latest from develop. Then I retested, this time on my dev1 server using JVM options rather than on my laptop using Docker and environment variables (both should work). Here's the file I'm using to test: https://github.com/IQSS/dataverse/blob/develop/src/test/resources/netcdf/ICOADS_R3.0.0_1662-10.nc The geospatial block must be enabled for the bounding box to be populated. By using If you remove that JVM option and keep the setup the same (s3 direct upload enabled), the bounding box should not be populated. For reference, here are some of my JVM options. I configured a new collection to use this "s3" store:
I'm putting this into "Ready for QA". I'm happy to be co-assigned to work through any problems. |
@pdurbin Thanks for the examples. I misunderstood -I was thinking of the aux file that is generated post upload for netcdf files. Also, the netcdf file may not contain bounding box info so need to have one that does. I have confirmed bounding box metadata is extracted for direct s3 upload. Note that the netcdf aux file is not generated for direct s3 upload. |
What this PR does / why we need it:
The fileMetadataExtractableFromNetcdf method added in PR #9541 breaks S3 direct upload. This PR fixes it.
It turns out a number of features don't work when S3 direct upload is enabled. This PR enumerates them.
Finally a config setting was added: dataverse.netcdf.geo-extract-s3-direct-upload.
Which issue(s) this PR closes:
Special notes for your reviewer:
As
if (driverType.equals("tmp"))
in IngestServiceBean.saveAndAddFilesToDataset() is executing, note that if S3 direct upload is enabled, driverType is "s3". However, if S3 direct upload is disabled (but S3 is still enabled), driverType is "tmp". (I found this confusing. I'd be happy to add a comment in the code about this.)This means that methods like
extractMetadataNcml()
and code to generate thumbnails is not executed when S3 direct upload is enabled.The behavior above is not changed by this pull request, but it did lead me to set expectations better in the guides.
Suggestions on how to test this:
Primary bug:
To test FITS extraction:
Does this PR introduce a user interface change? If mockups are available, please link/include them here:
No.
Is there a release notes update needed for this change?:
Yes, I updated the previous release note to mention the new setting.
Additional documentation:
I expanded our docs on what doesn't work during S3 direct upload, including NetCDF/HDF5 metadata extraction. I also documented the new setting.