Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add or enhance a function to extract JSON-Records from an JSON-API #382

Closed
TobiasNx opened this issue Aug 24, 2021 · 17 comments · Fixed by #384 or #387
Closed

Add or enhance a function to extract JSON-Records from an JSON-API #382

TobiasNx opened this issue Aug 24, 2021 · 17 comments · Fixed by #384 or #387
Assignees

Comments

@TobiasNx
Copy link
Contributor

TobiasNx commented Aug 24, 2021

While we are able to extract JSON records which are arrayed at the top level in an JSON file we are not able to extract JSON records from an JSON API that has the records in an array in an (sub-)field. At the moment we can't extract or split the records. The JSON file received via the JSON-API is extracted as one record:

"https://imoox.at/mooc/local/moochubs/classes/webservice.php"
| open-http(accept="application/json")
| as-lines //as-records has the same result
| decode-json
...

Example for file:
https://imoox.at/mooc/local/moochubs/classes/webservice.php

In the field "data" there are the JSON records as objects. These objects should each be retrieved as single records.

Functional Review: @TobiasNx
Code Review: @dr0i

@blackwinter
Copy link
Member

I guess something like a RecordPathFilter would be actually nice to have. More or less a generic version of org.metafacture.xml.XmlElementSplitter. Maybe based on XPath or JSONPath?

...
| decode-json
| filter-records-by-path("$.data")
...

fsteeg added a commit that referenced this issue Aug 25, 2021
Instead of using the entire JSON as a single record, provide a
JSON path to query the JSON for the records to process, e.g.
`$.data` to process every entry in a `data` array as a record.
@fsteeg
Copy link
Member

fsteeg commented Aug 25, 2021

Implemented with a JSON path as suggested by @blackwinter, but as an option in JsonDecoder, because we have the full JSON to query the JSON path against at that point: db8f5a1

@fsteeg fsteeg assigned TobiasNx and unassigned fsteeg Aug 25, 2021
@TobiasNx
Copy link
Contributor Author

Hei, it works fine for that +1 Cool that this worked out so fast.

In my opinion the function code needs some documentation about the options/attributes that can be selected.
For a new user this is not very instructive.

@TobiasNx TobiasNx assigned fsteeg and unassigned TobiasNx Aug 26, 2021
@fsteeg
Copy link
Member

fsteeg commented Aug 26, 2021

Assigned and requested review from @dr0i in #384, unassigning myself here.

@fsteeg fsteeg removed their assignment Aug 26, 2021
dr0i added a commit that referenced this issue Aug 26, 2021
dr0i added a commit to metafacture/metafacture-documentation that referenced this issue Aug 26, 2021
@dr0i
Copy link
Member

dr0i commented Aug 26, 2021

In my opinion the function code needs some documentation about the options/attributes that can be selected.
For a new user this is not very instructive.

Updated the flux-commands.md (pending PR https://github.com/metafacture/metafacture-documentation/pull/14/files). @TobiasNx: may also be a good idea to document that option, especially the "splitting" capability, at other places of documentation - maybe to the cookbook.

@blackwinter
Copy link
Member

Aside from the unfortunate fact that the proposed implementation leads to parsing the JSON document twice, it also means loading all extracted records into memory simultaneously. This is a potentially serious limitation. I assume we'd have to implement the filtering mechanism ourselves (in terms of our incremental parsing) if we wanted to avoid those downsides. In which case JSON Pointer might be the simpler specification to implement while still satisfying the current use case.

Finally, I'd still prefer a generic stream filter rather than extending each individual format decoder ad hoc whenever the need arises...

@fsteeg
Copy link
Member

fsteeg commented Aug 27, 2021

the proposed implementation leads to parsing the JSON document twice

Hm, does it? If the recordPath is set (if it isn't set, it retains the old behavior), the full record is parsed once, to apply the JSON path, and then each field value treated as a record is parsed once. Oh, is that what you mean, that technically each subrecord is parsed twice (once as part of the full document, once as a record on its own)?

it also means loading all extracted records into memory simultaneously

Right, but if I'm not mistaken, all that content has already been loaded into memory as a string when passed to JsonDecoder. Not as a JSON object though, so it will consume more memory. But not in the order of all records in memory vs only one record in memory.

I think there is great benefit in providing (optional) full JSON path support in our JSON decoder. It provides a very flexible mechanism to query any JSON API for records. And the performance cost seems reasonable to me.

However, I also like the idea of a generic stream filter, since it would unify different current approaches (the mentioned XML splitting, maybe also extract-element from the metafacture-html module, and this use case here). What I don't understand though, is how we would use a JSON path (or pointer, which would work for our use case here as well) at that point, where we have events, not JSON? Am I missing something here? Are you thinking of basically implementing a JSON pointer syntax for our event stream, without actual JSON involved in the process? But wouldn't it make more sense to use our own flattened event name syntax (like data[].*) then?

Since we need this functionality in OERSI, I'll merge the approved PR #384 for now. We should reconsider if we want to stick with this for our next actual release or if we have a better solution for this use case by then. So feel free to reopen this or open a new issue at any time.

@blackwinter
Copy link
Member

that technically each subrecord is parsed twice (once as part of the full document, once as a record on its own)?

Yes, that's what I meant.

all that content has already been loaded into memory as a string when passed to JsonDecoder.

Right, I didn't consider this. There are additional data structures/objects with your approach so both memory consumption and GC pressure increase, but not in the way I initially assumed.

Are you thinking of basically implementing a JSON pointer syntax for our event stream, without actual JSON involved in the process?

Exactly, implementing some path/filter syntax in terms of our stream events (similar, though more involved, to what XmlElementSplitter does).

But wouldn't it make more sense to use our own flattened event name syntax (like data[].*) then?

Indeed, I thought of that after posting my comment. We're already using (something like) this with idKey in JsonToElasticsearchBulk. It might make sense to reuse existing, even if less powerful, syntax instead of implementing a new one. (Would also be compatible with XmlElementSplitter, if I'm not mistaken.)

@blackwinter
Copy link
Member

But wouldn't it make more sense to use our own flattened event name syntax (like data[].*) then?

FTR, that would be EntityPathTracker.getCurrentPath()/.getCurrentPathWith(), right?

@fsteeg
Copy link
Member

fsteeg commented Aug 27, 2021

FTR, that would be EntityPathTracker.getCurrentPath()/.getCurrentPathWith(), right?

Yes, that's what I was thinking of.

@fsteeg
Copy link
Member

fsteeg commented Aug 27, 2021

Created a new issue to follow up on the generic approach: #385.

blackwinter added a commit that referenced this issue Aug 31, 2021
Split event stream into records based on entity path.

Related to #382 and `org.metafacture.xml.XmlElementSplitter`.

Resolves #385.
@blackwinter
Copy link
Member

We should reconsider if we want to stick with this for our next actual release or if we have a better solution for this use case by then.

Should we revert this now that #385 is resolved? (Assuming it actually satisfies the use case.)

JSONPath is more powerful, though, so it might still be preferable when decoding JSON.

If we decide to keep, I'd like to get rid of the List<String> in JsonDecoder.process().

@fsteeg
Copy link
Member

fsteeg commented Sep 3, 2021

Should we revert this now that #385 is resolved? (Assuming it actually satisfies the use case.)
JSONPath is more powerful, though, so it might still be preferable when decoding JSON.

So @TobiasNx tried it for our OERSI use case and it seems like #385 works here as well. At the same time, I'm using the JsonPath support for an experimental workflow to process data coming from an API returning a JSON array (which plain JsonDecoder currently does not support). I added a test case for that in 7b47a1c. While it might make sense to add array support to JsonDecoder, I think this shows how versatile JsonPath support is here. So I vote for keeping this.

If we decide to keep, I'd like to get rid of the List<String> in JsonDecoder.process().

I pushed b4e056b to avoid wrapping the JSON in a list when not using JsonPath support. Is that what you meant? I'll open a PR for both these changes, so we can discuss any details there.

@fsteeg fsteeg linked a pull request Sep 3, 2021 that will close this issue
@fsteeg fsteeg reopened this Sep 3, 2021
@dr0i
Copy link
Member

dr0i commented Sep 3, 2021

Another aspect is, as you pointed out in a discussion today, that the json-path introduces another dependency to core. We have the plugin concept in metafacture to avoid these bloating dependencies, but that concept seems to be rarely used. Also I don't know what that would mean here - duplicating the JsonDecoder to https://github.com/metafacture/metafacture-json-plugin? Or moving it there, introducing an API break? I don't know.

@blackwinter
Copy link
Member

At the same time, I'm using the JsonPath support for an experimental workflow to process data coming from an API returning a JSON array (which plain JsonDecoder currently does not support).

StringMatcher could potentially be used for preprocessing. But I see your point.

I added a test case for that in 7b47a1c.

It might be worthwhile to include the same test without the recordPath to illustrate the default behaviour.

So I vote for keeping this.

OK.

I pushed b4e056b to avoid wrapping the JSON in a list when not using JsonPath support. Is that what you meant?

Almost ;) Why does matches() have to return a list? The stream would suffice, wouldn't it?

@fsteeg
Copy link
Member

fsteeg commented Sep 3, 2021

Another aspect is, as you pointed out in a discussion today, that the json-path introduces another dependency to core. [...]

I don't think adding a dependency to metafacture-core is a problem per se, in particular since it has been modularized, and we only add the dependency to metafacture-json. What I meant was that if we had no use case at all, adding a feature that also introduces a dependency would be no good. But with the two different use cases we saw for using a JsonPath here, I think it's a useful addition, and worth adding a dependency.

@fsteeg
Copy link
Member

fsteeg commented Sep 3, 2021

Addressed comments by @blackwinter in 88d941f and 7b978b6.

fsteeg added a commit that referenced this issue Sep 3, 2021
@dr0i dr0i mentioned this issue Nov 2, 2021
blackwinter added a commit that referenced this issue Dec 13, 2024
Gradle would produce the following error on Windows (while Linux is not affected):

"Cannot access input property 'classpath' of task ':metafix-runner:startScripts'. Accessing unreadable inputs or outputs is not supported. Declare the task as untracked by using Task.doNotTrackState(). For more information, please refer to https://docs.gradle.org/8.10.2/userguide/incremental_build.html#sec:disable-state-tracking in the Gradle documentation."
blackwinter pushed a commit that referenced this issue Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants