Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"shell out" or filter though bash command? #147

Open
ghost opened this issue Jun 8, 2013 · 41 comments
Open

"shell out" or filter though bash command? #147

ghost opened this issue Jun 8, 2013 · 41 comments

Comments

@ghost
Copy link

ghost commented Jun 8, 2013

Given this file

{
  "releases": [
    {
      "date": "1998-05-12"
    },
    {
      "date": "1997-05-12"
    },
    {
      "date": "1999-05-12"
    }
  ]
}

I would like to run the date values through the date command, example

$ date +%s -d 1998-05-12
894949200

so that the final result is

{
  "releases": [
    {
      "date": 894949200
    },
    {
      "date": 863413200
    },
    {
      "date": 926485200
    }
  ]
}

Is something like this possible?

@stedolan
Copy link
Contributor

stedolan commented Jun 8, 2013

Not currently possible, was definitely thinking of adding something along these lines.

@nicowilliams
Copy link
Contributor

It'd be cool to be able to add functions written in C. Since builtins that are written in C with trivial function prototypes (return jv, take a fixed number [between 1 and 5] of arguments) it should be quite simple to to use dlopen()/dlsym() (or win32's LoadLibrary() equivalents). The hard part for object code plugins is the need for them to use the jv_* functions from libjq, which would require passing plugin functions a pointer to a table of jv_* functions.

Ideologically, is this ok for the jq language? Few builtins in jq have side-effects. Are side-effects required to be backtrackable? No, I think they're not (earlier I thought they were).

@stedolan
Copy link
Contributor

I definitely want this to happen.

I think it might be possible to use gcc -Wl,--export-dynamic or similar to allow dlopened libraries to load symbols from the main jq executable - that way, plugins that used jv_foo wouldn't have to have a definition handy and wouldn't need a table of function pointers.

I envisage having an import foo statement in jq code which searched the user's jq path for either libjq-foo.so or foo.jq, and used either dl_open or jq_parse_library to get at the contents. Does that sound sane?

@nicowilliams
Copy link
Contributor

Well... it's complicated. If there were two versions of libjq in the process then the plugins for one would get the jv_* from the wrong version of libjq and all hell breaks loose quickly. If there were a portable way to get a dl handle for the calling libjq then it could pass that to the plugins, but alas, that's not portable. SQLite3 handles this about as portably as can be done, roughly like this:

  • it generates a header file that defines a struct table of pointers to the library's exported objects
  • it also generates macros by the same names as the exported symbols which then expect to be invoked in a lexical context where there's a variable of a given name that points to the library's exported object pointers struct
  • it also generates code to setup that struct
  • plugins include the header for plugins, and their entry points take a pointer to the calling library's exported object pointer struct

I highly recommend this approach.

@nicowilliams
Copy link
Contributor

But yes, modulo DLL hell prevention measures it's sane.

@stedolan
Copy link
Contributor

Urrrrrrrgh. That really seems like exactly the thing the dynamic linker is supposed to do.

@stedolan
Copy link
Contributor

I accept that dynamic linkers are generally broken and that we may have to do as you say to avoid DLL hell. I'm just pining for a sane linker.

@stedolan
Copy link
Contributor

I may be missing something fundamental here. In what situation are there two incompatible versions of libjq in the same process where we're not already screwed?

@nicowilliams
Copy link
Contributor

ikr.

But the Unix RTLDs weren't that smart initially. Some of them are quite good now (Solaris' in particular), but the improvements haven't spread universally. Options like RTLD_GROUP and RTLD_FIRST and so on, really need to become universal. Also, I wish the GNU linker crowd would adopt Solaris' direct binding (-B direct)...

@nicowilliams
Copy link
Contributor

I may be missing something fundamental here. In what situation are there two incompatible versions of libjq in the same process where we're not already screwed?

Hopefully never. But in real life this happens all the time in apps
that use, e.g., OpenSSL and which also use, e.g., libpam. Or if you
have multiple nss (name service switch) modules that use different
versions of OpenSSL, libldap, libsasl2, libgss, ..., when nscd is not
running.

I can definitely see libjq being used in all sorts of networking libraries.

@nicowilliams
Copy link
Contributor

I've wanted to write an open source generic plugin system, one that could use advanced RTLDs or fallback on the SQLite3 scheme without the developer of the plugin interface (or plugin) having to know about the details.

@stedolan
Copy link
Contributor

I am also envious of solaris' seemingly working dynamic linker. I comfort myself in the secure knowledge that no dynamic linker actually works properly and I'm just not familiar enough with solaris' to know the manner in which it breaks :)

I am scared by the thought of someone using libjq in a low-level network library. I'm reasonably happy with the jq language, but less so with the API. API/ABI breaks will likely be frequent over the next while.

The generic plugin system would be nice. It makes me sad that it would involve so much work.

@nicowilliams
Copy link
Contributor

Oh, I forgot to mention that the SQLite3 struct thing includes an ABI version number first, so it's easy to fail safe.

Regarding ABI breaks and apps that use libjq: that's what the shared object versioning is for.

Re: source backwards-incompatible changes: those are easy to discover (the compiler errors out).

It'll all work out.

(I've used the Solaris RTLD extensively. It's really quite good. There's some really good docs on it and the link-editor, and then there's some great blog entries by the Solaris engineering linker aliens, as we call them.)

@teknomath
Copy link

I'm looking for the equivalent of an old-style ETL tool in the JSON world. All the regular ETL tools (Pentaho, Talend, Orange, Knime, etc.) are painful to use with JSON, and overkill for what I need. I just need to do some format translations on JSON values, like constructing date strings from separate fields, or converting "monetary shorthand" into numbers, or breaking up fully-qualified stock tickers into exchange and ticker. Simple stuff. Except that I don't want to extend jq or write a C library or do much heavy coding to accomplish it -- I don't consider myself a developer but I do use scripting languages to prepare and analyze data.

Jq has the potential to become the ETL tool for JSON, if it can get this feature right. In the best of all possible worlds, I would be able to write transformations in my favorite scripting language, point jq to my "library" of transformations, and then just cobble together jq command lines like this:

jq --transforms myscript.js 'def chgvalue(f): _change_it(f); map(chgvalue(.[0].field_that_must_be_changed))'

where the _change_it() function was found in the myscript.js file. Transformation functions should be able to take in a keypair list and return a keypair list that may have additional keypairs, in whatever data container paradigm the scripting language supports. Transforms written in C/C++ are faster, of course, and yes, you'd have to include a script-running engine (but since its JSON I figure you can probably do JavaScript already?), but I really think this elevates jq immensely. As JSON becomes more and more ubiquitous, transformation tools for non-developers are going to become important.

@nicowilliams
Copy link
Contributor

What's "ETL"?

@teknomath
Copy link

Extract-Transform-Load -- its a class of software found commonly in enterprises. The best example is probably Informatica's PowerCenter platform, but there are open source alternatives (Knime, Orange) and freemium alternatives (Pentaho, Talend, Rapid-Miner). Also, scripting languages like R and Python get used heavily for ETL, but the value of these bigger platforms is that they provide a lot of enterprise-specific features that a simple script approach lacks, like high availability, failover, auditing, compliance verification, data provenance, and managed workflow.

The basic gist behind ETL is that you have data in source A and you want to get it into sink B but A and B have different formats and/or different expectations of what shape the data should be, so you need to extract it from A, transform the data, and load it into B. ETL, as a method, is required when the systems that produce A and consume B cannot be changed, for whatever reason -- its for when you have the "square peg and round hole" problem and you need to solve it by changing the peg, not the hole. Simple data reformatting is the low, low end of the ETL spectrum of features...the real serious stuff addresses problems when you have 1:n or n:m data reshaping issues, or pivoting or classification. Yeah, you can do all this in R or Python or C++...but an ETL platform is going to make your life a lot easier.

ETL becomes important for the JSON universe as soon as you want to send data from a modern web-based data source (that produces JSON) into a legacy application that knows nothing about JSON. Yes, you could update the legacy app to read JSON...but often that is a Hard Problem. It's easier to just transform the JSON data into whatever form the legacy system expects.

Another thing that comes up is that a lot of the more sophisticated transformations can actually be done better on the JSON side of the story, rather than inside the ETL tool -- so I might really want a "TEL" or "ELT" process. Moving the "T" part of the story outside of the data movement and trivial reformatting and reshaping tasks is an ongoing debate in the ETL world. [...actually, that's exactly what I'm up to: I'm using Elasticsearch to do some categorization and similarity testing and I need to get my data back into my legacy system -- it comes out of Elasticsearch as JSON and it needs to go back into my system as a CSV file].

@nicowilliams
Copy link
Contributor

See the handles branch of my github clone of jq. This is coming.

@nicowilliams
Copy link
Contributor

@svnpenn Thanks :)

@nicowilliams
Copy link
Contributor

@teknomath I'm already using jq as an ETL, much as I've used XSLT in that fashion before (only jq makes me much happier than XSLT). You might want to try out the features in https://github.com/nicowilliams/jq/tree/handles . I'm working towards adding a proper library system, including dlopen()ing C extensions -- I think that will help make jq incredibly powerful.

@zmwangx
Copy link
Contributor

zmwangx commented Jul 8, 2014

Hi, I just want to know the current status of this issue (enhancement)? I have little to say about C plugins, but since most standard Unix programs (sed, awk, etc.) are filters, and jq operates on filters, they should definitely coexist well.

Meanwhile, for tasks such as the one mentioned at the beginning of this issue, I'm using jq to extract relevant values, passing through relevant filters, and then using jq to assemble back, which is a huge pain:

jq -r '.releases[].date' | parallel -k date -d {} +%s | jq -R '{"date": .}' | jq -s '{"release": .}'

(And this gets even more complicated when other key/value pairs are present in the array and needs to be preserved.)

@jrdriscoll
Copy link

Off and on I end up doing work with big piles of unstructured data that I need to make sense of, and I traditionally have used giant shell pipelines to sort out the needles in the haystacks. Lately I have being doing more and more with jq, primarily because it is much less error prone due to its clean (but occasionally surprising) semantics. This feature request is one of the most frequent reasons I have to "drop out" of jq to process some data.

In the traditional unix-pipeline-awk world, you would would use awk's system() call with a string shell command constructed from the record you are processing (e.g.system("date -d @" $2)). Based on my (possibly idiosyncratic) usage of jq, I think this would fit reasonably well into the jq world:

jq ' .date = system("date -d @" + (.epoch | tostring)) '

Note that the json string argument will need to be converted to raw, and in the example we will probably want the output to be converted to a json string. But the result could have been json. I expect the user will want some control over how the input/output conversions are handled.

If date returned more than one result (although I don't think it will) this also seems to work fine.

So system() will work find for date, which processes a single date at a time, but what if you want to NFKC normalize some strings in a few million json records? In awk, you would solve this using a "coprocess" to which you can both read and write. The syntax is awkward, and derived from the Korn shells |& operator.

awk '
  BEGIN {
    uconv = "uconv -b 1 -f utf-8 -t utf-8 -x \"::nfkc;\"";
    PROCINFO[uconv, "PTY"] = 1;
  }
  {
    print $2 |& uconv;
    uconv |& getline normalized;
   ...
  }
  END {
    close(uconv);
  }
'

In these situations, deadlock is a possibility and that is why the example asks awk to use a pty for the coprocess. I left out the usual stdbuf incantations that try to force the coprocess to work unbuffered.

The low level reading and writing is quite flexible, but not so clean. If you constrain it to be a filter that takes one input and produced one output, It looks more like a jq filter and seems to fit in nicely, although as a filter it will not produce an interesting json output unless the coprocess does.

@jrdriscoll
Copy link

I have problems like that of the OP all the time where I need to get data out of one system, transform it, and then somehow join it back in with the original data. With the newish input it is much easier to do this sort of thing with jq. Here is a reasonably clean way to do way the OP poster wants (admittedly, my tolerance "reasonable" and "clean" in these matters may not be representative):

#!/bin/bash

# need objects on single line for subsequent paste to work
# paste is effectively joining on implicit key = line_number

cat releases.json \
 | jq -c '.' >r-c.json 

# convert contained dates to pipe separated string in jq
# convert to space separated epochs in awk
# paste json objects, corresponding epochs onto single line
# read the epochs after each object with input
# note that the test for type not string also removes nulls

cat r-c.json \
  | jq -r ' 
    if (.|type)=="object" and (.releases|type)=="array"
      then [ .releases[].date? ] 
           | map(if (.|type)!="string" then empty else . end) 
           | reduce .[] as $d (""; . + $d + "|") 
      else "" end
  ' \
  | gawk -F\| ' { 
    for (i=1; i<NF; i++) {
      "date +%s -d " $(i) | getline d;
      printf d " "; 
    } 
    printf "\n";
  } ' \
  | paste -d\  r-c.json - \
  | jq ' 
    if (.|type)=="object" and (.releases|type)=="array"
      then .releases = ( .releases 
        | map(if (.|type)=="object" and (.date|type)=="string" then .date=input else . end)
      ) else . end
  '

@nicowilliams
Copy link
Contributor

Yes, I resort to these kinds of tricks too. A shell-out should probably be a high priority. I may even work on it this coming weekend, we'll see.

@jrdriscoll
Copy link

Well, while I do think that a "shell-out" feature would help many users
of jq, my point was actually that "input" solved essentially ALL my
problems (although not in the most terse or elegant way). And it is
true you need to be a relatively sophisticated user to do so.

On Mon, 2015-10-26 at 08:05 -0700, Nico Williams wrote:

Yes, I resort to these kinds of tricks too. A shell-out should
probably be a high priority. I may even work on it this coming
weekend, we'll see.


Reply to this email directly or view it on GitHub.

@nicowilliams
Copy link
Contributor

Yes, input and inputs solved and/or help work around a number of problems. I'm quite happy about how input and inputs turned out.

@ghost
Copy link
Author

ghost commented Oct 26, 2015

@nicowilliams I should mention I do not think this feature is required anymore in this specific case. I have not had time to check, but I believe the new JQ date commands can be used to fix my issue here. However others might still be interested in generic JQ system command.

@nicowilliams
Copy link
Contributor

@svnpenn Right, for datetime-related tasks a shell-out is not needed. You'll note that we added the sorts of things we needed that were relatively easy to add :)

A shell-out wouldn't be so hard to code, but first we needed to work out a privilege management model that would work for that and I/O in general.

I see two shell-out forms: CMD | popen and CMD | popen(inputs_for_cmd). The former would read from the command (it would map to popen() with "r"), and the latter would write to the command (it would map to popen() with "w"). EDIT: The first requires relatively little new infrastructure in jq (C-coded generators). The latter requires dealing with file handles as well.

EDIT: Fix typo.

@nicowilliams
Copy link
Contributor

See #1005.

@mterron
Copy link

mterron commented Jun 24, 2019

I'm converting base64 encoded binary SHA256 hashes back into hexadecimal representation. |@base64d breaks when the decoded value is not a UTF string.

I want to be able to "shell-out" to execute base64 -d | xxd -p -c32 | tr -cd '[:alnum:]\n' on a field. The output of that pipeline is an hexadecimal encoded hash that is a valid json value.

At the moment I have an awk hack to do it.

@ghost
Copy link
Author

ghost commented Jun 24, 2019

@mterron have you considered using a proper programming language? i know it
might be daunting, but i might be able to help if you have some sample data.
here are some links:

Python

Ruby

@mterron
Copy link

mterron commented Jun 25, 2019

I have but jq does 99.9% of what I want so why bother. I hacked that awk thing in 10 minutes, it'd take me 10 times as much to do it in Python or Ruby and also that adds a huge dependency framework to my pipeline that I'd rather not have.

Example json input (after a lot of manipulation with jq):

{
  "component": "1000hz-bootstrap-validator",
  "version": "0.10.2",
  "hashes": [
    {
      "file": "validator.js",
      "base64": "sha256-eXmycr1Eg/2vbFxVvM6avBWolBl8B6VkTRV5/9B9kto="
    },
    {
      "file": "validator.min.js",
      "base64": "sha256-mbv8R/8RTicMdfYPxNwD4QVAvNPO6Ht+ZDW9EK0gNHM="
    }
  ]
}

and json output:

{
  "component": "1000hz-bootstrap-validator",
  "version": "0.10.2",
  "hashes": [
    {
      "file": "validator.js",
      "base64": "sha256-eXmycr1Eg/2vbFxVvM6avBWolBl8B6VkTRV5/9B9kto=",
      "sha256": "7979b272bd4483fdaf6c5c55bcce9abc15a894197c07a5644d1579ffd07d92da"
    },
    {
      "file": "validator.min.js",
      "base64": "sha256-mbv8R/8RTicMdfYPxNwD4QVAvNPO6Ht+ZDW9EK0gNHM=",
      "sha256": "99bbfc47ff114e270c75f60fc4dc03e10540bcd3cee87b7e6435bd10ad203473"
    }
  ]
}

@ghost
Copy link
Author

ghost commented Jun 25, 2019

$ echo sha256-eXmycr1Eg/2vbFxVvM6avBWolBl8B6VkTRV5/9B9kto= | base64 -d
□□base64: invalid input

@mterron
Copy link

mterron commented Jun 25, 2019

You need to process the string first, sub() is nice.

$ echo "eXmycr1Eg/2vbFxVvM6avBWolBl8B6VkTRV5/9B9kto=" | base64 -d | xxd -p -c32
7979b272bd4483fdaf6c5c55bcce9abc15a894197c07a5644d1579ffd07d92da

@ghost
Copy link
Author

ghost commented Jun 25, 2019

Sorry, I see you are moving goalposts. I have dealt with that before and I have no tolerance policy. Good luck.

@mterron
Copy link

mterron commented Jun 25, 2019

Meaning? I just provided a real life use case for the shell-out feature. What I want to do can't be done with jq without a shell-out feature.

It is an escape hatch.

@ghost
Copy link
Author

ghost commented Jun 25, 2019

@mterron
Copy link

mterron commented Jun 25, 2019

I'm not sure what you are trying to say tbh.

@wtlangford
Copy link
Contributor

This conversation is getting a little unnecessarily heated, so let's please remain polite and civil, shall we?

On topic, though: @mterron The ability to shell out is something we're working on, and there are some branches floating around with the capability. They're a little buggy at the moment, and may not actually have direct support for shelling out yet (I'd have to check), but they contain the necessary groundwork for us to support it.

@ghost
Copy link
Author

ghost commented Jun 25, 2019

if you want something constructive, look at #1005

although I have to admit I am frustrated, as that pull has been linked repeatedly already, in this thread.

@nicowilliams
Copy link
Contributor

@cup please stop.

@ghost
Copy link
Author

ghost commented Jun 25, 2019

@nicowilliams what is your problem? I linked to a pull that actually accomplishes (from personally testing) what he is asking for.

@jqlang jqlang locked as too heated and limited conversation to collaborators Jun 25, 2019
@itchyny itchyny removed this from the 1.7 release milestone Jun 25, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants