-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add validation for entered coordinate values in Geospatial metadata (bounding boxes, etc) #9547
Comments
Rather, validation, and, maybe, some automatic sanitation. Stripping white spaces quietly seems like a safe enough bet; and it would have addressed this situation. |
A couple things. As @jggautier pointed out, this issue is related: Also, I bumped into this while making the following pull request... ... because apparently some NetCDF files use a "domain" of 0 to 360 for longitude instead of -180 to 180 like Solr expects. In that PR, I'm converting it if needed. It's always tricky to add validation after the fact because stuff tends to blow up for old data, but yes, we should try. We should also skip indexing if the data is invalid. That's what we do for bad dates (when dates aren't in the YYYY-MM-DD format or whatever). |
Related to this: As pointed out on the community call last week, there is some confusion about what format coordinates should have in the geospatial metadata block. The guidance text uses comma as the decimal separator, e.g., 180,0, but the correct format seems to be period, e.g., 180.0. From some version (I can't remember which one, other than it was after 5.6) Dataverse Solr started to validate the format and content of the coordinate fields in the geospatial metadata block. As for the format, period not comma has to be used. As for the content for bounding boxes, the value for West Longitude must be lower than the value for East Longitude; the value for South Latitude must be lower than the value for North Latitude. This might seem obvious, but at DataverseNO, we got Solr validation errors for more than 900 datasets when we upgraded from v5.6 to v5.13. This indicates, that adding coordinate information is not as trivial as it might seem. We have fixed these datasets and updated our deposit guidelines accordingly; see section GEOSPATIAL METADATA. As pointed out above, coordinate values should be validated at creation. |
I just checked and all of these confusing commas are definitely coming from geospatial.tsv: $ cat scripts/api/data/metadatablocks/geospatial.tsv | grep ',0' | cut -f4 | sed G Westernmost coordinate delimiting the geographic extent of the Dataset. A valid range of values, expressed in decimal degrees, is -180,0 <= West Bounding Longitude Value <= 180,0. Easternmost coordinate delimiting the geographic extent of the Dataset. A valid range of values, expressed in decimal degrees, is -180,0 <= East Bounding Longitude Value <= 180,0. Northernmost coordinate delimiting the geographic extent of the Dataset. A valid range of values, expressed in decimal degrees, is -90,0 <= North Bounding Latitude Value <= 90,0. Southernmost coordinate delimiting the geographic extent of the Dataset. A valid range of values, expressed in decimal degrees, is -90,0 <= South Bounding Latitude Value <= 90,0. If I squint I can sort of guess what was intended by these tooltips but the commas are not helping. We should reword them and drop the commas. The newish errors are caused by this geospatial search pull request being merged as part of 5.13: In short, as of 5.13 we send the geospatial bounding box to Solr. Solr only wants numbers and these numbers must be in a specific range (hinted at above in the tooltips). One can see an error like this, for example:
|
2023/10/04
|
We have code like this…
… that we can hopefully refactor into a method we can use at index time as a sanity check before we feed geospatial values to Solr it can’t handle: https://github.com/IQSS/dataverse/blob/v6.0/src/main/java/edu/harvard/iq/dataverse/ingest/metadataextraction/impl/plugins/netcdf/NetcdfFileMetadataExtractor.java#L52-L53 |
2023/11/15: Tagging with 6.1 with hope! |
Yes, it will be absolutely great to fix this for 6.1. Just to fix the validation on the Dataset page/metadata edit form, as described in the opening comment, would be great. Would be even better if we could expand the scope just a tiny bit, and use the same validation method(s) inside the IndexServiceBean and fix the current behavior, where it silently fails to index any other metadata field in the dataset where a bad geobox is encountered; as we discussed previously. Will add a reproducer in a sec. |
@stevenwinship Why is it bad? - Because the northernmost and southernmost corners of the box are swapped - i.e., the top coordinate is lower than the bottom. To replicate: On your Dataverse instance, log in as the admin user, go to Edit -> General Information on the main page. Make sure the Geospatial metadata block is enabled (it's not by default), like this: Then create a new empty dataset ("Add Data" -> "New Dataset"), with the bare minimum of metadata - title/description/subject ..., save it. Go to the Metadata tab, then to "Add + Edit Metadata". At the bottom of the form, under "Geospatial Metadata", populate the "Geographic Bounding Box". You can copy-and-paste the exact numeric values from the real dataset above: Westernmost/Left: 31.659769 Easternmost/Right: 31.668014 ... but then you can also enter any junk at all, that doesn't even look like coordinates, and the form will accept it! - and that's the problem we are trying to solve. Save the dataset. Publish it. It should succeed, with no error messages. But if you go back to the collection ("dataverse") page, the dataset will still be showing as an unpublished draft. That's because the application failed quietly to index it in solr. In the server log there will be an error message along the lines of
... but, this was not communicated to the user in any way. What we want to happen is the page should stop the user when they clicked "Save" on the metadata form, and explain to them why. Note: Yes, the entry form for this "geo bounding box" values is problematic/confusing by itself. We may be encouraging users to enter junk there. We want to improve that too, but it should be prudent to focus on the validation first. |
Finally, a good place in the existing code where we are going through the metadata in a dataset (in the "dataset version", to be precise), looking for geobox fields, and checking the specific values of the sub-fields: The line 893 in IndexServiceAdmin.java:
is a loop where we go through the metadata fields in a datasetversion.
What the code there does, it tries to go through every geobox field (datasets can have multiple geoboxes. you can enter as many as you want on the metadata form on the page as well - note the plus sign next to it), and tries to generate the min. and max. values of each subfields, so that the single "master" box that encompasses all of the individual ones can be defined. This is for the purposes of indexing - so that users can search for any data with geospatial definitions within specific coordinates. The code there unfortunately fails to weed out values that are not valid, so they end up being passed to solr, resulting in the failure described. In the context of the page save() method we need to similarly go through the geoboxes and throw an exception if one or more of them is invalid. Would make sense to use the same validation methods in both places, and anywhere else where we allow geo values to be entered or imported. |
@stevenwinship I noticed you picked this up. Great! Please note that I wrote some existing tests in this area that might be helpful for you to study or update or tear apart:
@landreev all the detail you added above is fantastic! Thanks for pushing to get this issue prioritized. Also, I mentioned this above, but to reinforce what I said in person, I did add some code that parses floats and looks at west vs east etc. I'm putting a snippet below. If we can delete this code and switch to new code, fine with me! 😄
|
…r the entry form + extra formatting for display. #9547
TL;DR
: We allow users to enter junk on the geo metadata form. That ends poorly.This came in as a support request, somebody published a dataset, but it was not showing up on the collection page. Meaning, indexing failed, so according to the search engine, it was still an unpublished draft.
On closer inspection, indexing was failing because it didn't like the geo bounding box coordinates:
(note that the above was from server.log; zero diagnostic info back from the index api)
The only problem there was the space character in "
- 81.06667
". For this user it was addressed by fixing the field value in the database and reindexing. But we clearly need to validate the values that are being entered, on the edit form or via the api.The text was updated successfully, but these errors were encountered: