Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NJsonDocUtils.writeField() converts single-element-valued collection to String #29

Closed
alexdunnjpl opened this issue Nov 23, 2022 · 12 comments · Fixed by #30
Closed

NJsonDocUtils.writeField() converts single-element-valued collection to String #29

alexdunnjpl opened this issue Nov 23, 2022 · 12 comments · Fixed by #30
Assignees

Comments

@alexdunnjpl
Copy link
Contributor

🐛 Describe the bug

When encountering a collection-valued field with size 1, writeField() explicitly writes a String instead of an Array with a single String element

This is resulting in undesirable behaviour in some cases.

This must be fixed, but given that someone wrote it intentionally I'm loathe to unilaterally change it in case there are some cases where it's desirable - from what I've seen this will result in numerous fields being "corrected", not just ref_lid_collection.

📜 To Reproduce

  1. Harvest any bundle containing a single collection (e.g. https://pdssbn.astro.umd.edu/holdings/pds4-epoxi_mri-v1.0/)
    , into a registry instance
  2. Query OpenSearch for that bundle (e.g. with curl -k -u admin:admin https://localhost:9200/registry/_doc/urn:nasa:pds:epoxi_mri::1.0?_source=ref_lidvid_collection,ref_lid_collection | json_pp)
  3. Observe the value of ref_lid_collection

🕵️ Expected behavior

Unclear. Collection-like fields should be written as Arrays, but if there are cases where conversion is desired/required, these must be accounted for and not affected by the fix.

📚 Version of Software Used

registry-common==1.4.0

@alexdunnjpl
Copy link
Contributor Author

@jordanpadams @tloubrieu-jpl gonna need either permission to fix this in all cases, or some context for when it's desirable to convert a single-element string array to a string, to prevent this from potentially breaking a whole bunch of stuff (i.e. if end-users are reading such fields programmatically)

@tloubrieu-jpl
Copy link
Member

tloubrieu-jpl commented Nov 23, 2022

Hi @alexdunnjpl , good catch but I have no idea yet why this case has been coded in the first place. As you said I suspect there is a good reason for that although that does not sound like good thing to have in the code (even less without a comment).

@alexdunnjpl
Copy link
Contributor Author

Conversion originally implemented in this commit

The commit message is unrelated/opaque.

@alexdunnjpl
Copy link
Contributor Author

Oh, looks like it's a reimplementation of existing functionality that was unintentionally deleted in the previous commit.

Original source is here and/or here

@jordanpadams
Copy link
Member

@alexdunnjpl unless @tloubrieu-jpl has an objection, let's march forward with a PR fix, and we can evaluate if that is something we do not want to do

@tloubrieu-jpl
Copy link
Member

I cannot be 100% sure, but it is slowly emerging from my memories that without this we would have all the properties values show as arrays even if there is a single value. See screenshot:
image

Instead of scalar values we would have arrays with length 1. I remember I asked that change to Eugene when I started the development of the API service.

But I agree this code does not look good and I would prefer if we have an option not to have arrays values in the first place for single values. That might be related to the XML parser which is used ? @alexdunnjpl could you have a look at that ?

@alexdunnjpl
Copy link
Contributor Author

alexdunnjpl commented Nov 29, 2022

@tloubrieu-jpl I'll dig in to make sure I'm not misremembering something, but it sounds like the core issue is that there's a generic "properties" flow which processes fields with either collection-like or string-like values (depending on the field), but currently has no way of differentiating between the two cases.

My gut feeling is that we should know prima facie what properties are collection-like, and should therefore be able to detect "collection-like strings" and "everything else" and process them accordingly? This could/should probably be done during xml parsing, I agree.

I'm not certain that's a valid assumption though, and I'd probably want help enumerating the collection-likes unless they're trivially-obvious to track down.

@alexdunnjpl
Copy link
Contributor Author

alexdunnjpl commented Nov 29, 2022

Taking a closer look, NJsonDocUtils.writeField() is signature-polymorphic, so the behaviour is due to the fact that writeField() is passed a Collection<String> in the first place. This supports the idea that the root/fix is in the xml parsing.

RegistryDocBuilder.createDataJson() already recognises a distinction between "references" (meta.intRefs) and "other fields" (meta.fields).

Unfortunately, all values for meta.fields are ArrayList and there are plenty that have multiple elements, so we can't lazily split it there.

@alexdunnjpl
Copy link
Contributor Author

The imprecisely-typed metadata construction occurs in harvest, here.

At a glance, it looks pretty fraught to retroactively introduce any concept of typing here. It seems like the idea that metadata field values will be extracted into collections is pretty baked in. If that's the case, it may be least-worst to fall back to Eugene's original approach (i.e. the JSON writer is responsible for "typing" the fields) and extract the "convert to a single value" conditional to a function that encapsulates the business logic for making the differentiation - instead of if size == 1, something like if size == 1 and fieldName not in collectionTypedFieldNames

@tloubrieu-jpl @jordanpadams thoughts?

@tloubrieu-jpl
Copy link
Member

@jordanpadams @jshughes could we use the PDS information model (schema, schematron) to tell which element should be treated as an array, which one should not ?

@alexdunnjpl
Copy link
Contributor Author

Per stand-up meeting, solution is to remove scalar conversion - all fields will be written as arrays

@tloubrieu-jpl
Copy link
Member

Since the API is showing these values as arrays anyway (see https://pds.nasa.gov/api/search/1.0/products) we will remove the condition in registry-common to always save collections as arrays in OpenSearch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants
@jordanpadams @tloubrieu-jpl @alexdunnjpl and others