Add or enhance a function to extract JSON-Records from an JSON-API #382

TobiasNx · 2021-08-24T08:51:49Z

While we are able to extract JSON records which are arrayed at the top level in an JSON file we are not able to extract JSON records from an JSON API that has the records in an array in an (sub-)field. At the moment we can't extract or split the records. The JSON file received via the JSON-API is extracted as one record:

"https://imoox.at/mooc/local/moochubs/classes/webservice.php"
| open-http(accept="application/json")
| as-lines //as-records has the same result
| decode-json
...

Example for file:
https://imoox.at/mooc/local/moochubs/classes/webservice.php

In the field "data" there are the JSON records as objects. These objects should each be retrieved as single records.

Functional Review: @TobiasNx
Code Review: @dr0i

The text was updated successfully, but these errors were encountered:

blackwinter · 2021-08-24T09:20:06Z

I guess something like a RecordPathFilter would be actually nice to have. More or less a generic version of org.metafacture.xml.XmlElementSplitter. Maybe based on XPath or JSONPath?

...
| decode-json
| filter-records-by-path("$.data")
...

Instead of using the entire JSON as a single record, provide a JSON path to query the JSON for the records to process, e.g. `$.data` to process every entry in a `data` array as a record.

fsteeg · 2021-08-25T13:37:15Z

Implemented with a JSON path as suggested by @blackwinter, but as an option in JsonDecoder, because we have the full JSON to query the JSON path against at that point: db8f5a1

TobiasNx · 2021-08-26T14:07:21Z

Hei, it works fine for that +1 Cool that this worked out so fast.

In my opinion the function code needs some documentation about the options/attributes that can be selected.
For a new user this is not very instructive.

fsteeg · 2021-08-26T14:46:20Z

Assigned and requested review from @dr0i in #384, unassigning myself here.

See #382.

See metafacture/metafacture-core#382.

dr0i · 2021-08-26T16:04:58Z

In my opinion the function code needs some documentation about the options/attributes that can be selected.
For a new user this is not very instructive.

Updated the flux-commands.md (pending PR https://github.com/metafacture/metafacture-documentation/pull/14/files). @TobiasNx: may also be a good idea to document that option, especially the "splitting" capability, at other places of documentation - maybe to the cookbook.

blackwinter · 2021-08-26T18:58:12Z

Aside from the unfortunate fact that the proposed implementation leads to parsing the JSON document twice, it also means loading all extracted records into memory simultaneously. This is a potentially serious limitation. I assume we'd have to implement the filtering mechanism ourselves (in terms of our incremental parsing) if we wanted to avoid those downsides. In which case JSON Pointer might be the simpler specification to implement while still satisfying the current use case.

Finally, I'd still prefer a generic stream filter rather than extending each individual format decoder ad hoc whenever the need arises...

fsteeg · 2021-08-27T12:33:44Z

the proposed implementation leads to parsing the JSON document twice

Hm, does it? If the recordPath is set (if it isn't set, it retains the old behavior), the full record is parsed once, to apply the JSON path, and then each field value treated as a record is parsed once. Oh, is that what you mean, that technically each subrecord is parsed twice (once as part of the full document, once as a record on its own)?

it also means loading all extracted records into memory simultaneously

Right, but if I'm not mistaken, all that content has already been loaded into memory as a string when passed to JsonDecoder. Not as a JSON object though, so it will consume more memory. But not in the order of all records in memory vs only one record in memory.

I think there is great benefit in providing (optional) full JSON path support in our JSON decoder. It provides a very flexible mechanism to query any JSON API for records. And the performance cost seems reasonable to me.

However, I also like the idea of a generic stream filter, since it would unify different current approaches (the mentioned XML splitting, maybe also extract-element from the metafacture-html module, and this use case here). What I don't understand though, is how we would use a JSON path (or pointer, which would work for our use case here as well) at that point, where we have events, not JSON? Am I missing something here? Are you thinking of basically implementing a JSON pointer syntax for our event stream, without actual JSON involved in the process? But wouldn't it make more sense to use our own flattened event name syntax (like data[].*) then?

Since we need this functionality in OERSI, I'll merge the approved PR #384 for now. We should reconsider if we want to stick with this for our next actual release or if we have a better solution for this use case by then. So feel free to reopen this or open a new issue at any time.

blackwinter · 2021-08-27T13:00:40Z

that technically each subrecord is parsed twice (once as part of the full document, once as a record on its own)?

Yes, that's what I meant.

all that content has already been loaded into memory as a string when passed to JsonDecoder.

Right, I didn't consider this. There are additional data structures/objects with your approach so both memory consumption and GC pressure increase, but not in the way I initially assumed.

Are you thinking of basically implementing a JSON pointer syntax for our event stream, without actual JSON involved in the process?

Exactly, implementing some path/filter syntax in terms of our stream events (similar, though more involved, to what XmlElementSplitter does).

But wouldn't it make more sense to use our own flattened event name syntax (like data[].*) then?

Indeed, I thought of that after posting my comment. We're already using (something like) this with idKey in JsonToElasticsearchBulk. It might make sense to reuse existing, even if less powerful, syntax instead of implementing a new one. (Would also be compatible with XmlElementSplitter, if I'm not mistaken.)

blackwinter · 2021-08-27T13:04:12Z

But wouldn't it make more sense to use our own flattened event name syntax (like data[].*) then?

FTR, that would be EntityPathTracker.getCurrentPath()/.getCurrentPathWith(), right?

fsteeg · 2021-08-27T13:10:11Z

FTR, that would be EntityPathTracker.getCurrentPath()/.getCurrentPathWith(), right?

Yes, that's what I was thinking of.

fsteeg · 2021-08-27T13:22:58Z

Created a new issue to follow up on the generic approach: #385.

Split event stream into records based on entity path. Related to #382 and `org.metafacture.xml.XmlElementSplitter`. Resolves #385.

blackwinter · 2021-09-01T16:02:32Z

We should reconsider if we want to stick with this for our next actual release or if we have a better solution for this use case by then.

Should we revert this now that #385 is resolved? (Assuming it actually satisfies the use case.)

JSONPath is more powerful, though, so it might still be preferable when decoding JSON.

If we decide to keep, I'd like to get rid of the List<String> in JsonDecoder.process().

fsteeg · 2021-09-03T12:08:13Z

Should we revert this now that #385 is resolved? (Assuming it actually satisfies the use case.)
JSONPath is more powerful, though, so it might still be preferable when decoding JSON.

So @TobiasNx tried it for our OERSI use case and it seems like #385 works here as well. At the same time, I'm using the JsonPath support for an experimental workflow to process data coming from an API returning a JSON array (which plain JsonDecoder currently does not support). I added a test case for that in 7b47a1c. While it might make sense to add array support to JsonDecoder, I think this shows how versatile JsonPath support is here. So I vote for keeping this.

If we decide to keep, I'd like to get rid of the List<String> in JsonDecoder.process().

I pushed b4e056b to avoid wrapping the JSON in a list when not using JsonPath support. Is that what you meant? I'll open a PR for both these changes, so we can discuss any details there.

dr0i · 2021-09-03T13:01:08Z

Another aspect is, as you pointed out in a discussion today, that the json-path introduces another dependency to core. We have the plugin concept in metafacture to avoid these bloating dependencies, but that concept seems to be rarely used. Also I don't know what that would mean here - duplicating the JsonDecoder to https://github.com/metafacture/metafacture-json-plugin? Or moving it there, introducing an API break? I don't know.

blackwinter · 2021-09-03T13:34:21Z

At the same time, I'm using the JsonPath support for an experimental workflow to process data coming from an API returning a JSON array (which plain JsonDecoder currently does not support).

StringMatcher could potentially be used for preprocessing. But I see your point.

I added a test case for that in 7b47a1c.

It might be worthwhile to include the same test without the recordPath to illustrate the default behaviour.

So I vote for keeping this.

OK.

I pushed b4e056b to avoid wrapping the JSON in a list when not using JsonPath support. Is that what you meant?

Almost ;) Why does matches() have to return a list? The stream would suffice, wouldn't it?

fsteeg · 2021-09-03T13:36:32Z

Another aspect is, as you pointed out in a discussion today, that the json-path introduces another dependency to core. [...]

I don't think adding a dependency to metafacture-core is a problem per se, in particular since it has been modularized, and we only add the dependency to metafacture-json. What I meant was that if we had no use case at all, adding a feature that also introduces a dependency would be no good. But with the two different use cases we saw for using a JsonPath here, I think it's a useful addition, and worth adding a dependency.

fsteeg · 2021-09-03T13:49:06Z

Addressed comments by @blackwinter in 88d941f and 7b978b6.

Gradle would produce the following error on Windows (while Linux is not affected): "Cannot access input property 'classpath' of task ':metafix-runner:startScripts'. Accessing unreadable inputs or outputs is not supported. Declare the task as untracked by using Task.doNotTrackState(). For more information, please refer to https://docs.gradle.org/8.10.2/userguide/incremental_build.html#sec:disable-state-tracking in the Gradle documentation."

…hub.com:metafacture/metafacture-fix

TobiasNx added Enhancement Flux labels Aug 24, 2021

acka47 assigned fsteeg Aug 24, 2021

fsteeg mentioned this issue Aug 25, 2021

Add recordPath option to JsonDecoder #384

Merged

fsteeg assigned TobiasNx and unassigned fsteeg Aug 25, 2021

TobiasNx assigned fsteeg and unassigned TobiasNx Aug 26, 2021

fsteeg removed their assignment Aug 26, 2021

dr0i added a commit that referenced this issue Aug 26, 2021

Add flux annotations

8761d98

See #382.

dr0i added a commit to metafacture/metafacture-documentation that referenced this issue Aug 26, 2021

Update decode-json

35a8058

See metafacture/metafacture-core#382.

dr0i mentioned this issue Aug 26, 2021

Update decode-json metafacture/metafacture-documentation#14

Merged

dr0i assigned TobiasNx Aug 26, 2021

fsteeg added a commit that referenced this issue Aug 27, 2021

Tweak description annotation for Flux usage (#382)

8fe67c1

fsteeg mentioned this issue Aug 27, 2021

Split up event stream into records #385

Closed

fsteeg closed this as completed in #384 Aug 27, 2021

blackwinter mentioned this issue Aug 31, 2021

Implement record path filter. #386

Merged

blackwinter added a commit that referenced this issue Aug 31, 2021

Implement record path filter.

ca9a107

Split event stream into records based on entity path. Related to #382 and `org.metafacture.xml.XmlElementSplitter`. Resolves #385.

fsteeg added a commit that referenced this issue Sep 3, 2021

Add test case for recordPath with root JSON array (#382)

7b47a1c

fsteeg added a commit that referenced this issue Sep 3, 2021

Avoid wrapping single record in list (#382)

b4e056b

fsteeg mentioned this issue Sep 3, 2021

Tweak recordPath code, add test case #387

Merged

fsteeg linked a pull request Sep 3, 2021 that will close this issue

Tweak recordPath code, add test case #387

Merged

fsteeg reopened this Sep 3, 2021

fsteeg added a commit that referenced this issue Sep 3, 2021

Avoid unnecessary List creation (#382)

7b978b6

fsteeg added a commit that referenced this issue Sep 3, 2021

Add test case for unsupported root array without recordPath (#382)

88d941f

fsteeg added a commit that referenced this issue Sep 3, 2021

Remove unused import (#382)

5ffe4b5

fsteeg closed this as completed in #387 Sep 3, 2021

dr0i mentioned this issue Nov 2, 2021

Next release 5.3.0 #397

Closed

TobiasNx mentioned this issue Nov 22, 2021

OAI-PMH Opener plain metadata? #424

Closed

blackwinter pushed a commit that referenced this issue Dec 13, 2024

Merge #382 from branch '371-classpathOfRunnerTooLongOnWindows' of git…

6284eab

…hub.com:metafacture/metafacture-fix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add or enhance a function to extract JSON-Records from an JSON-API #382

Add or enhance a function to extract JSON-Records from an JSON-API #382

TobiasNx commented Aug 24, 2021 •

edited by acka47

Loading

blackwinter commented Aug 24, 2021

fsteeg commented Aug 25, 2021

TobiasNx commented Aug 26, 2021

fsteeg commented Aug 26, 2021

dr0i commented Aug 26, 2021

blackwinter commented Aug 26, 2021

fsteeg commented Aug 27, 2021

blackwinter commented Aug 27, 2021

blackwinter commented Aug 27, 2021

fsteeg commented Aug 27, 2021

fsteeg commented Aug 27, 2021

blackwinter commented Sep 1, 2021

fsteeg commented Sep 3, 2021

dr0i commented Sep 3, 2021

blackwinter commented Sep 3, 2021

fsteeg commented Sep 3, 2021

fsteeg commented Sep 3, 2021

Add or enhance a function to extract JSON-Records from an JSON-API #382

Add or enhance a function to extract JSON-Records from an JSON-API #382

Comments

TobiasNx commented Aug 24, 2021 • edited by acka47 Loading

blackwinter commented Aug 24, 2021

fsteeg commented Aug 25, 2021

TobiasNx commented Aug 26, 2021

fsteeg commented Aug 26, 2021

dr0i commented Aug 26, 2021

blackwinter commented Aug 26, 2021

fsteeg commented Aug 27, 2021

blackwinter commented Aug 27, 2021

blackwinter commented Aug 27, 2021

fsteeg commented Aug 27, 2021

fsteeg commented Aug 27, 2021

blackwinter commented Sep 1, 2021

fsteeg commented Sep 3, 2021

dr0i commented Sep 3, 2021

blackwinter commented Sep 3, 2021

fsteeg commented Sep 3, 2021

fsteeg commented Sep 3, 2021

TobiasNx commented Aug 24, 2021 •

edited by acka47

Loading