Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add validation for entered coordinate values in Geospatial metadata (bounding boxes, etc) #9547

Closed
landreev opened this issue Apr 21, 2023 · 13 comments · Fixed by #10142
Closed
Assignees
Labels
pm.netcdf-hdf5.d All 3 aims are currently under this deliverable Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) Type: Bug a defect
Milestone

Comments

@landreev
Copy link
Contributor

TL;DR: We allow users to enter junk on the geo metadata form. That ends poorly.

This came in as a support request, somebody published a dataset, but it was not showing up on the collection page. Meaning, indexing failed, so according to the search engine, it was still an unpublished draft.
On closer inspection, indexing was failing because it didn't like the geo bounding box coordinates:

org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at ...
Error adding field 'geolocation'='ENVELOPE(- 81.06667,-75.19694,-1.363889,-5.015556)' 
msg=Unable to parse shape given formats "lat,lon", "x y" or as WKT because 
java.text.ParseException: java.lang.NumberFormatException: For input string: "-" input:

(note that the above was from server.log; zero diagnostic info back from the index api)

The only problem there was the space character in "- 81.06667". For this user it was addressed by fixing the field value in the database and reindexing. But we clearly need to validate the values that are being entered, on the edit form or via the api.

@landreev
Copy link
Contributor Author

Rather, validation, and, maybe, some automatic sanitation. Stripping white spaces quietly seems like a safe enough bet; and it would have addressed this situation.

@pdurbin
Copy link
Member

pdurbin commented Apr 24, 2023

A couple things. As @jggautier pointed out, this issue is related:

Also, I bumped into this while making the following pull request...

... because apparently some NetCDF files use a "domain" of 0 to 360 for longitude instead of -180 to 180 like Solr expects. In that PR, I'm converting it if needed.

It's always tricky to add validation after the fact because stuff tends to blow up for old data, but yes, we should try.

We should also skip indexing if the data is invalid. That's what we do for bad dates (when dates aren't in the YYYY-MM-DD format or whatever).

@philippconzett
Copy link
Contributor

Related to this:

As pointed out on the community call last week, there is some confusion about what format coordinates should have in the geospatial metadata block. The guidance text uses comma as the decimal separator, e.g., 180,0, but the correct format seems to be period, e.g., 180.0.

From some version (I can't remember which one, other than it was after 5.6) Dataverse Solr started to validate the format and content of the coordinate fields in the geospatial metadata block. As for the format, period not comma has to be used. As for the content for bounding boxes, the value for West Longitude must be lower than the value for East Longitude; the value for South Latitude must be lower than the value for North Latitude. This might seem obvious, but at DataverseNO, we got Solr validation errors for more than 900 datasets when we upgraded from v5.6 to v5.13. This indicates, that adding coordinate information is not as trivial as it might seem. We have fixed these datasets and updated our deposit guidelines accordingly; see section GEOSPATIAL METADATA.

As pointed out above, coordinate values should be validated at creation.

@pdurbin
Copy link
Member

pdurbin commented May 30, 2023

I just checked and all of these confusing commas are definitely coming from geospatial.tsv:

$ cat scripts/api/data/metadatablocks/geospatial.tsv | grep ',0' | cut -f4 | sed G

Westernmost coordinate delimiting the geographic extent of the Dataset. A valid range of values, expressed in decimal degrees, is -180,0 <= West Bounding Longitude Value <= 180,0.

Easternmost coordinate delimiting the geographic extent of the Dataset. A valid range of values, expressed in decimal degrees, is -180,0 <= East Bounding Longitude Value <= 180,0.

Northernmost coordinate delimiting the geographic extent of the Dataset. A valid range of values, expressed in decimal degrees, is -90,0 <= North Bounding Latitude Value <= 90,0.

Southernmost coordinate delimiting the geographic extent of the Dataset. A valid range of values, expressed in decimal degrees, is -90,0 <= South Bounding Latitude Value <= 90,0.

If I squint I can sort of guess what was intended by these tooltips but the commas are not helping. We should reword them and drop the commas.

The newish errors are caused by this geospatial search pull request being merged as part of 5.13:

In short, as of 5.13 we send the geospatial bounding box to Solr. Solr only wants numbers and these numbers must be in a specific range (hinted at above in the tooltips). One can see an error like this, for example:

Bad X value 343.68 is not in boundary Rect(minX=-180.0,maxX=180.0,minY=-90.0,maxY=90.0) input: ENVELOPE(343.68,353.78,49.62,41.8)

@scolapasta scolapasta moved this to Dataverse Team (Gustavo) in IQSS Dataverse Project May 31, 2023
@cmbz
Copy link

cmbz commented Oct 4, 2023

2023/10/04

@cmbz cmbz moved this from Dataverse Team (Gustavo) to SPRINT- NEEDS SIZING in IQSS Dataverse Project Oct 4, 2023
@cmbz cmbz added Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) pm.netcdf-hdf5.d All 3 aims are currently under this deliverable Type: Bug a defect labels Oct 4, 2023
@cmbz cmbz moved this from SPRINT- NEEDS SIZING to SPRINT READY in IQSS Dataverse Project Oct 11, 2023
@pdurbin
Copy link
Member

pdurbin commented Oct 13, 2023

We have code like this…

parseGeospatial(getNetcdfFile(file));

getStandardLongitude(new WestAndEastLongitude(geoFields.get(WEST_LONGITUDE), geoFields.get(EAST_LONGITUDE)));

… that we can hopefully refactor into a method we can use at index time as a sanity check before we feed geospatial values to Solr it can’t handle: https://github.com/IQSS/dataverse/blob/v6.0/src/main/java/edu/harvard/iq/dataverse/ingest/metadataextraction/impl/plugins/netcdf/NetcdfFileMetadataExtractor.java#L52-L53

@cmbz cmbz added this to the 6.1 milestone Nov 15, 2023
@scolapasta scolapasta moved this from SPRINT READY to Clear of the Backlog in IQSS Dataverse Project Nov 15, 2023
@cmbz
Copy link

cmbz commented Nov 15, 2023

2023/11/15: Tagging with 6.1 with hope!

@stevenwinship stevenwinship self-assigned this Nov 16, 2023
@landreev
Copy link
Contributor Author

Yes, it will be absolutely great to fix this for 6.1. Just to fix the validation on the Dataset page/metadata edit form, as described in the opening comment, would be great.

Would be even better if we could expand the scope just a tiny bit, and use the same validation method(s) inside the IndexServiceBean and fix the current behavior, where it silently fails to index any other metadata field in the dataset where a bad geobox is encountered; as we discussed previously.
But, one thing at a time.

Will add a reproducer in a sec.

@landreev
Copy link
Contributor Author

landreev commented Nov 16, 2023

@stevenwinship
A reproducer:
Here's a bad geobox from a real life dataset in our production (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/PA6Q9B):
Screen Shot 2023-11-16 at 4 54 12 PM

Why is it bad? - Because the northernmost and southernmost corners of the box are swapped - i.e., the top coordinate is lower than the bottom.

To replicate:

On your Dataverse instance, log in as the admin user, go to Edit -> General Information on the main page. Make sure the Geospatial metadata block is enabled (it's not by default), like this:

Screen Shot 2023-11-16 at 4 57 03 PM

Then create a new empty dataset ("Add Data" -> "New Dataset"), with the bare minimum of metadata - title/description/subject ..., save it.

Go to the Metadata tab, then to "Add + Edit Metadata". At the bottom of the form, under "Geospatial Metadata", populate the "Geographic Bounding Box".

You can copy-and-paste the exact numeric values from the real dataset above:

Westernmost/Left: 31.659769 Easternmost/Right: 31.668014
Northernmost/Top: 34.541372 Southernmost/Bottom: 34.552483

... but then you can also enter any junk at all, that doesn't even look like coordinates, and the form will accept it! - and that's the problem we are trying to solve.

Save the dataset. Publish it. It should succeed, with no error messages. But if you go back to the collection ("dataverse") page, the dataset will still be showing as an unpublished draft. That's because the application failed quietly to index it in solr.

In the server log there will be an error message along the lines of

Error from server at http://localhost:8983/solr/collection1: ERROR: [doc=dataset_4040893] Error adding field 'geolocation'='ENVELOPE(31.659769,31.668014,34.541372,34.552483)' msg=Unable to parse shape given formats "lat,lon", "x y" or as WKT because org.locationtech.spatial4j.exception.InvalidShapeException: maxY must be >= minY: 34.552483 to 34.541372 input: ENVELOPE(31.659769,31.668014,34.541372,34.552483)

... but, this was not communicated to the user in any way.

What we want to happen is the page should stop the user when they clicked "Save" on the metadata form, and explain to them why.

Note: Yes, the entry form for this "geo bounding box" values is problematic/confusing by itself. We may be encouraging users to enter junk there. We want to improve that too, but it should be prudent to focus on the validation first.

@landreev
Copy link
Contributor Author

Some screenshots of what the desired behavior of the form should look like:
If I enter something that's not a valid date in this field on the same form:
Screen Shot 2023-11-17 at 10 53 31 AM
and click "Save", this is what I get at the top of the form:
Screen Shot 2023-11-17 at 10 54 23 AM
and then this is shown next to the Date field:
Screen Shot 2023-11-17 at 10 54 42 AM

... we want something like this to happen when bad values are entered for the geobox as well.

@landreev
Copy link
Contributor Author

Actual validation rules that need to be applied:
Note that we have it explained on the form what the values should be; but the info is kinda hidden in the "help tooltips" that you see when you mouse over the little question marks:
Screen Shot 2023-11-17 at 11 07 24 AM

This "geobox" is what we call a "Compound Metadata Field" with 4 individual sub-fields, or "Primitive Fields" - the 2 longitude values + 2 latitude values spelling out the 4 corners of the box.

So, at the risk of spelling out the obvious, the rules are as follows:

  • each of the 4 individual fields must parse as a number with an arbitrary (?) number of decimal points - like 31.659769
  • each of the 2 longitudes must be in the (-180.0, 180.0) range;
  • each of the 2 lattitudes must be in the (-90.0, 90.0) range;
  • the westernmost end should be < the easternmost, etc.

One special case - we should keep allowing users to enter just one longitude and one latitude value. When indexing, this is treated as a "box" where X1=X2 and Y2=Y2.

@landreev
Copy link
Contributor Author

Finally, a good place in the existing code where we are going through the metadata in a dataset (in the "dataset version", to be precise), looking for geobox fields, and checking the specific values of the sub-fields:

The line 893 in IndexServiceAdmin.java:

for (DatasetField dsf : datasetVersion.getFlatDatasetFields()) {

is a loop where we go through the metadata fields in a datasetversion.
In the context of the code there, the version is the one we are trying to index. In the context of the Dataset page it will be the draft version the user is trying to save. But the basic idea will be the same.
Further down, at line 1024:

if(dsfType.getName().equals(DatasetFieldConstant.geographicBoundingBox)) {
there's some code that processes a geobox field when one is encountered.

What the code there does, it tries to go through every geobox field (datasets can have multiple geoboxes. you can enter as many as you want on the metadata form on the page as well - note the plus sign next to it), and tries to generate the min. and max. values of each subfields, so that the single "master" box that encompasses all of the individual ones can be defined. This is for the purposes of indexing - so that users can search for any data with geospatial definitions within specific coordinates. The code there unfortunately fails to weed out values that are not valid, so they end up being passed to solr, resulting in the failure described. In the context of the page save() method we need to similarly go through the geoboxes and throw an exception if one or more of them is invalid. Would make sense to use the same validation methods in both places, and anywhere else where we allow geo values to be entered or imported.

@pdurbin
Copy link
Member

pdurbin commented Nov 17, 2023

@stevenwinship I noticed you picked this up. Great! Please note that I wrote some existing tests in this area that might be helpful for you to study or update or tear apart:

@landreev all the detail you added above is fantastic! Thanks for pushing to get this issue prioritized.

Also, I mentioned this above, but to reinforce what I said in person, I did add some code that parses floats and looks at west vs east etc. I'm putting a snippet below. If we can delete this code and switch to new code, fine with me! 😄

    try {
        westAsFloat = Float.valueOf(westAndEastLongitude.getWestLongitude());
        eastAsFloat = Float.valueOf(westAndEastLongitude.getEastLongitude());
    } catch (NumberFormatException ex) {
        return null;
    }
    // "If one of them is > 180, the domain is 0:360"
    if (westAsFloat > 180 && eastAsFloat > 180) {
        Float westStandard = westAsFloat - 360;
        Float eastStandard = eastAsFloat - 360;
        WestAndEastLongitude updatedWeLong = new WestAndEastLongitude(westStandard.toString(), eastStandard.toString());
        return updatedWeLong;
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pm.netcdf-hdf5.d All 3 aims are currently under this deliverable Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) Type: Bug a defect
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

5 participants