-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"shell out" or filter though bash command? #147
Comments
Not currently possible, was definitely thinking of adding something along these lines. |
It'd be cool to be able to add functions written in C. Since builtins that are written in C with trivial function prototypes (return jv, take a fixed number [between 1 and 5] of arguments) it should be quite simple to to use dlopen()/dlsym() (or win32's LoadLibrary() equivalents). The hard part for object code plugins is the need for them to use the jv_* functions from libjq, which would require passing plugin functions a pointer to a table of jv_* functions. Ideologically, is this ok for the jq language? Few builtins in jq have side-effects. Are side-effects required to be backtrackable? No, I think they're not (earlier I thought they were). |
I definitely want this to happen. I think it might be possible to use I envisage having an |
Well... it's complicated. If there were two versions of libjq in the process then the plugins for one would get the jv_* from the wrong version of libjq and all hell breaks loose quickly. If there were a portable way to get a dl handle for the calling libjq then it could pass that to the plugins, but alas, that's not portable. SQLite3 handles this about as portably as can be done, roughly like this:
I highly recommend this approach. |
But yes, modulo DLL hell prevention measures it's sane. |
Urrrrrrrgh. That really seems like exactly the thing the dynamic linker is supposed to do. |
I accept that dynamic linkers are generally broken and that we may have to do as you say to avoid DLL hell. I'm just pining for a sane linker. |
I may be missing something fundamental here. In what situation are there two incompatible versions of libjq in the same process where we're not already screwed? |
ikr. But the Unix RTLDs weren't that smart initially. Some of them are quite good now (Solaris' in particular), but the improvements haven't spread universally. Options like RTLD_GROUP and RTLD_FIRST and so on, really need to become universal. Also, I wish the GNU linker crowd would adopt Solaris' direct binding (-B direct)... |
Hopefully never. But in real life this happens all the time in apps I can definitely see libjq being used in all sorts of networking libraries. |
I've wanted to write an open source generic plugin system, one that could use advanced RTLDs or fallback on the SQLite3 scheme without the developer of the plugin interface (or plugin) having to know about the details. |
I am also envious of solaris' seemingly working dynamic linker. I comfort myself in the secure knowledge that no dynamic linker actually works properly and I'm just not familiar enough with solaris' to know the manner in which it breaks :) I am scared by the thought of someone using libjq in a low-level network library. I'm reasonably happy with the jq language, but less so with the API. API/ABI breaks will likely be frequent over the next while. The generic plugin system would be nice. It makes me sad that it would involve so much work. |
Oh, I forgot to mention that the SQLite3 struct thing includes an ABI version number first, so it's easy to fail safe. Regarding ABI breaks and apps that use libjq: that's what the shared object versioning is for. Re: source backwards-incompatible changes: those are easy to discover (the compiler errors out). It'll all work out. (I've used the Solaris RTLD extensively. It's really quite good. There's some really good docs on it and the link-editor, and then there's some great blog entries by the Solaris engineering linker aliens, as we call them.) |
I'm looking for the equivalent of an old-style ETL tool in the JSON world. All the regular ETL tools (Pentaho, Talend, Orange, Knime, etc.) are painful to use with JSON, and overkill for what I need. I just need to do some format translations on JSON values, like constructing date strings from separate fields, or converting "monetary shorthand" into numbers, or breaking up fully-qualified stock tickers into exchange and ticker. Simple stuff. Except that I don't want to extend jq or write a C library or do much heavy coding to accomplish it -- I don't consider myself a developer but I do use scripting languages to prepare and analyze data. Jq has the potential to become the ETL tool for JSON, if it can get this feature right. In the best of all possible worlds, I would be able to write transformations in my favorite scripting language, point jq to my "library" of transformations, and then just cobble together jq command lines like this:
where the _change_it() function was found in the myscript.js file. Transformation functions should be able to take in a keypair list and return a keypair list that may have additional keypairs, in whatever data container paradigm the scripting language supports. Transforms written in C/C++ are faster, of course, and yes, you'd have to include a script-running engine (but since its JSON I figure you can probably do JavaScript already?), but I really think this elevates jq immensely. As JSON becomes more and more ubiquitous, transformation tools for non-developers are going to become important. |
What's "ETL"? |
Extract-Transform-Load -- its a class of software found commonly in enterprises. The best example is probably Informatica's PowerCenter platform, but there are open source alternatives (Knime, Orange) and freemium alternatives (Pentaho, Talend, Rapid-Miner). Also, scripting languages like R and Python get used heavily for ETL, but the value of these bigger platforms is that they provide a lot of enterprise-specific features that a simple script approach lacks, like high availability, failover, auditing, compliance verification, data provenance, and managed workflow. The basic gist behind ETL is that you have data in source A and you want to get it into sink B but A and B have different formats and/or different expectations of what shape the data should be, so you need to extract it from A, transform the data, and load it into B. ETL, as a method, is required when the systems that produce A and consume B cannot be changed, for whatever reason -- its for when you have the "square peg and round hole" problem and you need to solve it by changing the peg, not the hole. Simple data reformatting is the low, low end of the ETL spectrum of features...the real serious stuff addresses problems when you have 1:n or n:m data reshaping issues, or pivoting or classification. Yeah, you can do all this in R or Python or C++...but an ETL platform is going to make your life a lot easier. ETL becomes important for the JSON universe as soon as you want to send data from a modern web-based data source (that produces JSON) into a legacy application that knows nothing about JSON. Yes, you could update the legacy app to read JSON...but often that is a Hard Problem. It's easier to just transform the JSON data into whatever form the legacy system expects. Another thing that comes up is that a lot of the more sophisticated transformations can actually be done better on the JSON side of the story, rather than inside the ETL tool -- so I might really want a "TEL" or "ELT" process. Moving the "T" part of the story outside of the data movement and trivial reformatting and reshaping tasks is an ongoing debate in the ETL world. [...actually, that's exactly what I'm up to: I'm using Elasticsearch to do some categorization and similarity testing and I need to get my data back into my legacy system -- it comes out of Elasticsearch as JSON and it needs to go back into my system as a CSV file]. |
See the |
@svnpenn Thanks :) |
@teknomath I'm already using jq as an ETL, much as I've used XSLT in that fashion before (only jq makes me much happier than XSLT). You might want to try out the features in https://github.com/nicowilliams/jq/tree/handles . I'm working towards adding a proper library system, including dlopen()ing C extensions -- I think that will help make jq incredibly powerful. |
Hi, I just want to know the current status of this issue (enhancement)? I have little to say about C plugins, but since most standard Unix programs ( Meanwhile, for tasks such as the one mentioned at the beginning of this issue, I'm using jq -r '.releases[].date' | parallel -k date -d {} +%s | jq -R '{"date": .}' | jq -s '{"release": .}' (And this gets even more complicated when other key/value pairs are present in the array and needs to be preserved.) |
Off and on I end up doing work with big piles of unstructured data that I need to make sense of, and I traditionally have used giant shell pipelines to sort out the needles in the haystacks. Lately I have being doing more and more with jq, primarily because it is much less error prone due to its clean (but occasionally surprising) semantics. This feature request is one of the most frequent reasons I have to "drop out" of jq to process some data. In the traditional unix-pipeline-awk world, you would would use awk's
Note that the json string argument will need to be converted to raw, and in the example we will probably want the output to be converted to a json string. But the result could have been json. I expect the user will want some control over how the input/output conversions are handled. If date returned more than one result (although I don't think it will) this also seems to work fine. So
In these situations, deadlock is a possibility and that is why the example asks awk to use a pty for the coprocess. I left out the usual The low level reading and writing is quite flexible, but not so clean. If you constrain it to be a filter that takes one input and produced one output, It looks more like a jq filter and seems to fit in nicely, although as a filter it will not produce an interesting json output unless the coprocess does. |
I have problems like that of the OP all the time where I need to get data out of one system, transform it, and then somehow join it back in with the original data. With the newish
|
Yes, I resort to these kinds of tricks too. A shell-out should probably be a high priority. I may even work on it this coming weekend, we'll see. |
Well, while I do think that a "shell-out" feature would help many users On Mon, 2015-10-26 at 08:05 -0700, Nico Williams wrote:
|
Yes, |
@nicowilliams I should mention I do not think this feature is required anymore in this specific case. I have not had time to check, but I believe the new JQ date commands can be used to fix my issue here. However others might still be interested in generic JQ |
@svnpenn Right, for datetime-related tasks a shell-out is not needed. You'll note that we added the sorts of things we needed that were relatively easy to add :) A shell-out wouldn't be so hard to code, but first we needed to work out a privilege management model that would work for that and I/O in general. I see two shell-out forms: EDIT: Fix typo. |
See #1005. |
I'm converting base64 encoded binary SHA256 hashes back into hexadecimal representation. I want to be able to "shell-out" to execute At the moment I have an |
@mterron have you considered using a proper programming language? i know it Python
Ruby |
I have but Example json input (after a lot of manipulation with {
"component": "1000hz-bootstrap-validator",
"version": "0.10.2",
"hashes": [
{
"file": "validator.js",
"base64": "sha256-eXmycr1Eg/2vbFxVvM6avBWolBl8B6VkTRV5/9B9kto="
},
{
"file": "validator.min.js",
"base64": "sha256-mbv8R/8RTicMdfYPxNwD4QVAvNPO6Ht+ZDW9EK0gNHM="
}
]
} and json output: {
"component": "1000hz-bootstrap-validator",
"version": "0.10.2",
"hashes": [
{
"file": "validator.js",
"base64": "sha256-eXmycr1Eg/2vbFxVvM6avBWolBl8B6VkTRV5/9B9kto=",
"sha256": "7979b272bd4483fdaf6c5c55bcce9abc15a894197c07a5644d1579ffd07d92da"
},
{
"file": "validator.min.js",
"base64": "sha256-mbv8R/8RTicMdfYPxNwD4QVAvNPO6Ht+ZDW9EK0gNHM=",
"sha256": "99bbfc47ff114e270c75f60fc4dc03e10540bcd3cee87b7e6435bd10ad203473"
}
]
} |
|
You need to process the string first, $ echo "eXmycr1Eg/2vbFxVvM6avBWolBl8B6VkTRV5/9B9kto=" | base64 -d | xxd -p -c32
7979b272bd4483fdaf6c5c55bcce9abc15a894197c07a5644d1579ffd07d92da |
Sorry, I see you are moving goalposts. I have dealt with that before and I have no tolerance policy. Good luck. |
Meaning? I just provided a real life use case for the shell-out feature. What I want to do can't be done with It is an escape hatch. |
I'm not sure what you are trying to say tbh. |
This conversation is getting a little unnecessarily heated, so let's please remain polite and civil, shall we? On topic, though: @mterron The ability to shell out is something we're working on, and there are some branches floating around with the capability. They're a little buggy at the moment, and may not actually have direct support for shelling out yet (I'd have to check), but they contain the necessary groundwork for us to support it. |
if you want something constructive, look at #1005 although I have to admit I am frustrated, as that pull has been linked repeatedly already, in this thread. |
@cup please stop. |
@nicowilliams what is your problem? I linked to a pull that actually accomplishes (from personally testing) what he is asking for. |
Given this file
I would like to run the
date
values through thedate
command, exampleso that the final result is
Is something like this possible?
The text was updated successfully, but these errors were encountered: