Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save literal value of the parsed number to preserve it for the output #1743

Closed
wants to merge 22 commits into from

Conversation

leonid-s-usov
Copy link
Contributor

This is a POC implementation of a concept which may resolve numerous cases of truncated integer identifiers due to the double limitation reported by #1652

The logic is pretty simple: for every parsed number we store the actual read literal which was used to generate the double representation. Then, we use this literal for comparisons and printing.
Any math operation on the literal number will use the truncated double precision representation and generate a new jv_number with no literal associated.

The implementation is trying to make minimal impact on the performance by falling back to the original number representation whenever literal information is not available.

This implementation has a side effect of non equivalence of 10 and 10.0, which could be taken as a neat feature, or as a bug. Anyhow, it will definitely be considered a non backward compatible change :/. This is clearly seen in the need to update a test on line 539.

However, this particular aspect can be adjusted, as the equivalence operation may be considered "math" in terms that the double representation of the number will be taken.
This will make it backward compatible, but I believe that this literal equivalence must be preserved. If we take the use case of some external database ID then it would be great if jq could also match those ids, not only preserve in the output unchanged.

A fun fact about this branch is that the tests to verify the new functionality depend on the functionality itself. As it turned out, the test suite is utilising the same parsing routines to compare the actual output to the expected output, so the truncation test cases actually succeed on the master branch where the testing code is unable to see the difference between those large numbers.

Leonid S. Usov added 2 commits October 19, 2018 21:57
Any math operations on the numbers will truncate them to decimal precision.
@coveralls
Copy link

coveralls commented Oct 19, 2018

Coverage Status

Coverage increased (+0.001%) to 84.421% when pulling eda80f2 on leonid-s-usov:literal_number into 0c845aa on stedolan:master.

@leonid-s-usov
Copy link
Contributor Author

I've just discovered that mantests on my machine PASS because they get aborted

rake aborted!
LoadError: cannot load such file -- ronn
/Users/leonid/projects/jq/docs/Rakefile.manual:1:in `<top (required)>'
(See full trace by running task with --trace)
0 of 0 tests passed (0 malformed)
PASS tests/mantest (exit status: 0)

Not yet sure how to fix this but will try

@leonid-s-usov
Copy link
Contributor Author

@wtlangford @nicowilliams what's the reason behind saving on a jv_free call on numbers? I can see it all over the code in builtin.c. I'm pretty sure the call gets inlined by the compiler and can be a no-op for the NUMBER case easily.

I'm thinking maybe we should utilise the jv structure unused bits (kind, _pad) better and introduce a generic way of marking structures as requiring dedicated free method, and maybe even encoding the allocated size? This way we can definitely make a jv_free call almost "free" for non-allocated jv objects.

@wtlangford
Copy link
Contributor

what's the reason behind saving on a jv_free call on numbers? I can see it all over the code in builtin.c.

I don't think there was a strong reason, just more that since we knew it wasn't necessary, it didn't always get done.

I'm thinking maybe we should utilise the jv structure unused bits (kind, _pad) better and introduce a generic way of marking structures as requiring dedicated free method, and maybe even encoding the allocated size? This way we can definitely make a jv_free call almost "free" for non-allocated jv objects.

The call to jv_free is already pretty much free for numbers (though jv_free itself could likely use an optimization to prevent repeated calls to jv_get_kind), so I'm not too concerned there. And while we do have those extra bits in the jv structure, we'll want to be careful and sparing when/where we use them, because we do not want to increase sizeof(jv)

@leonid-s-usov
Copy link
Contributor Author

Why do jv_number_value and jv_string_value NOT consume the passed jv object?

@wtlangford
Copy link
Contributor

Why do jv_number_value and jv_string_value NOT consume the passed jv object?

Two reasons, in no particular order:

  1. Because we call these all the time on jvs we aren't actually done with, and so we can avoid an extra jv_copy, jv_free.

  2. Because this would be unpleasant:

jv s = jv_string("some string");
char *p = jv_string_value(s);
// something with p

If jv_string_value were to free the underlying storage here, we'd be sad, and forced to do this:

jv s = jv_string("some string");
char *p = jv_string_value(jv_copy(s));
// something with p
jv_free(s); // once we're done

and that's not better.

@leonid-s-usov
Copy link
Contributor Author

Because this would be unpleasant:

jv s = jv_string("some string");
char *p = jv_string_value(s);
// something with p

If jv_string_value were to free the underlying storage here, we'd be sad, and forced to do this:

jv s = jv_string("some string");
char *p = jv_string_value(jv_copy(s));
// something with p
jv_free(s); // once we're done

and that's not better.

Well, the jv_free(s) would be needed in both cases, unless you'd pass it on to some consumer. So the only real difference would be the call to jv_copy. jv_invalid_has_msg, for example, does consume the invalid.
We only save on a jv_copy call. jv_free will be run under the hood inside the consumer.

In any case, I suspected that there would be some reason, in that case let's maybe introduce some naming convention to distinguish between consuming and non-consuming methods? Because
as someone new to the code base I have quickly grasped and totally accepted the general "always consuming" semantics, but the exception methods are making it hard to fully trust and I keep jumping to every method definition to see if it actually consumes the parameter.

So, just maybe something like jv_peek_* to denote that the method is only going to check some things on the object but not consume it? I don't think this is the best naming approach but just to throw in some example.

@leonid-s-usov
Copy link
Contributor Author

leonid-s-usov commented Oct 21, 2018

@pkoppstein I have moved to this PR cause the topic becomes verbose, and also it kind of makes more sense to discuss this particular aspect here than in the issue #1741

Here I would like to refer to that strange behaviour you've pointed me to around how my current branch deals with ./jq -n '1.0 == 1'

I have found the reason for the behaviour, and it's not trivial. This is happening due to constant_fold optimisation at the parser stage. The issue is that that optimisation doesn't call any of the built-ins, which would take care of calling jv_equal under the hood; instead, these work on the doubles directly, which bypasses any logic I have implemented.

I am thinking that the solution would be to actually call the builtin implementations of the mentioned functions instead of making the calculations directly.
WDYT?


UPDATE

I have fixed the issue by using the jv_equal result for EQ, NEQ, LESSEQ and GREATEREQ operations, see the commit.
That actually fixed another number leak since the block_const results weren't ever free'd. Kind of opens another pool 1k lines to look for unfreed numbers...

Leonid S. Usov added 2 commits October 21, 2018 20:03
…ization

This is just a tip of an iceberg when talking about overloaded number arithmetic. For now that fixes the bug,
however should we introduce more sophisticated overload of semantic number treatment, other operations
will have to be changed.
@leonid-s-usov
Copy link
Contributor Author

@pkoppstein I believe I've gotten to some more or less final form of the PR.

TODOs

  • Settle on the filter semantics. You propose toliteral/0 while I feel like literal/1. Need to sleep over it.
  • clean up the leaks due to JV_KIND_NUMBER not being jv_freed all over the place. I may need some help here. Is there a way to run valgrind on macOs? maybe via docker?
  • Implement the proposed refactoring to distinguish between consuming and non-consuming jv_ methods (?)

@pkoppstein
Copy link
Contributor

Using jq-1.6rc1-25-g56f6124-dirty I was surprised to see:

jq -n 123456789123456789123456789
123456789123456790000000000

I trust that is a bug :-)

@pkoppstein
Copy link
Contributor

@leonid-s-usov - Here's a test case which highlights a discrepancy between the current implementation of jq (jq-1.6rc1-25-g56f6124-dirty) and the principles I'm advocating:

./jq -n '123456789123456789123456789 == 123456789123456790000000000'
true

By the principle:

 If m and n are JSON numeric literals, then `m==n` iff Sem(m) == Sem(n)

the result should of course be false.

@leonid-s-usov
Copy link
Contributor Author

oh yeah, Microsoft is now killing GitHub. Skype is already dead.

Anyhow, I seem to have your posts in my inbox, I can send those to you.

Using jq-1.6rc1-25-g56f6124-dirty I was surprised to see:

 jq -n 123456789123456789123456789
 123456789123456790000000000

I trust that is a bug :-)

Surprised? That is where it has started! That's the essence of #1652 and #1741.
And this PR fixes it.

./jq -n '123456789123456789123456789 == 123456789123456790000000000'
true

The equality above is currently kept compatible (i.e. faulty). Fixing it while maintaining full backward compatibility is the tricky part touching your tocanonical approach.

I would like to believe we have a chance for iterating in order to get there by having a first step which fixes an apparent bug of changed number representations with no reason. This step also provides a work around for cases when literal comparison is required:

./jq -n 'literal(123456789123456789123456789) == "123456789123456790000000000"'
false

The above could read (123456789123456789123456789 | toliteral) == "123456789123456790000000000"

tostring and tojson would work as well. This is described in the updated docs

@leonid-s-usov
Copy link
Contributor Author

@pkoppstein @wtlangford
I have a suggestion regarding the naming convention of the jv_* functions concerning the memory management.

When calling a function which doesn't consume the jv object, it can be thought as a semantic equivalent to a properly consuming variant of it called with an explicit jv_copy of the original argument.

so, the non-conforming jv_string_value can be actually phrased as a conforming jv_string_get_value called as

jv_string_get_value(jv_copy(str));

I would like to suggest that all functions starting with jv_ and performing some non-memory related action with the object must consume their arguments. However, if the function has a suffix *_copy* in its name, it must promise to copy the arguments on the caller's behalf.
This means that semantically it should be always equivalent to calling the non *_copy* version of the method (if such exists) with copied arguments, but calling the _copy version opens way for internal optimisation by totally omitting the copy / free cycle.

So, the only fishy case will be the jv_copy itself, but taking into an account that this method doesn't perform any non-memory related action, one may conclude that the method is not a consumer by definition, and as such it doesn't have to free the argument. However the copying is still performed as demanded by the suffix rule

There is another guideline for the actual placement of the _copy suffix. I would suggest that this suffix is placed after the type suffix, so that one trying to perform an action on a known jv_type could see all available completions, whether copying or not, given the jv_typename prefix.

Following this suggestion, the refactoring will change the following method names (non-exhaustive example):

jv_get_kind()            -> jv_copy_get_kind()
jv_is_valid()            -> jv_copy_is_valid()

jv_number_value()        -> jv_number_copy_get_value()
jv_number_is_integer()   -> jv_number_copy_is_integer()
jv_string_value()        -> jv_string_copy_get_value()

@wtlangford
Copy link
Contributor

This is happening due to constant_fold optimisation at the parser stage. The issue is that that optimisation doesn't call any of the built-ins, which would take care of calling jv_equal under the hood; instead, these work on the doubles directly, which bypasses any logic I have implemented.

Yep. A note: I'll be very interested to benchmark the compilation stage prior to merging the PR. Compilation is known to be very slow right now (O(m×n)), so I'll be interested to know if using jv_equal has a measurable performance cost.

I have fixed the issue by using the jv_equal result for EQ, NEQ, LESSEQ and GREATEREQ operations, see the commit.
That actually fixed another number leak since the block_const results weren't ever free'd. Kind of opens another pool 1k lines to look for unfreed numbers...

And now you see why we'd kicked this can so far for so long. 😉

  • clean up the leaks due to JV_KIND_NUMBER not being jv_freed all over the place. I may need some help here. Is there a way to run valgrind on macOs? maybe via docker?

I just run a small centos vm for this purpose. I do my dev locally, and run the tests in the vm when I'm ready.

The equality above is currently kept compatible (i.e. faulty). Fixing it while maintaining full backward compatibility is the tricky part touching your tocanonical approach.

I like this for now. I want to keep this PR smaller in scope/implication, and may advocate for removing toliteral/tocanonical per my stance in #1741 that this is an underlying implementation change.

I have a suggestion regarding the naming convention of the jv_* functions concerning the memory management.

I'd rather we didn't do this here (and possibly at all, though that's up for discussion) for a number of reasons.

  1. This PR is starting to grow in scope.
  2. We expose a lot (most? all? I'd have to check) of those jv_* functions as part of our public C API. Changing their names is a backwards-incompatible change, and I'm a bit hesitant to do so. (The public C API is contained in the following two headers: src/jq.h and src/jv.h)

@nicowilliams
Copy link
Contributor

@pkoppstein does not speak for the project or the maintainers. @pkoppstein only speaks for @pkoppstein.

Today jq uses IEEE754. That has impact on semantics that @pkoppstein and others do not like. @stedolan, myself, and @wtlangford have never much minded. This is a limitation that we accept.

There are only two things we might do about this:

  • defer conversion to IEEE754 until arithmetic or libm functions need it / or keep the original string until then (this is something that @stedolan personally opposed when I've discussed it with him, but it's been a long time, and I think it's something we can accept today)

and/or

  • add a arbitrary precision floating-point implementation as a build option, though it might sometimes need to downgrade to IEEE754 for backwards compatibility (e.g., if we can't get an arbitrary precision implementation of every libm function that we support today)

Keep in mind: jq may not have any dependencies on GPL'ed libraries.

@nicowilliams
Copy link
Contributor

I think too that any attempt at preserving untouched numbers will have negative performance effect in several ways:

  • additional memory pressure
  • additional jv_free()ing of numbers

We do need to start caring more about performance. We've stopped accepting new builtins because we need to fix linking performance first, and @wtlangford and I have run out of energy for that for a while.

@leonid-s-usov
Copy link
Contributor Author

leonid-s-usov commented Oct 22, 2018

Clear. In any case the current implementation has the IEEE double in place and simply keeps the original string around.

It should be noted that the string is only associated with a single instance of the parsed number literal, and gets passed around until the number is consumed by some filter or operator, or until it's dumped.

Memory wise... I can totally see how storing the original number string for every parsed number can increase the memory needed, but that is for the ability to return it back unchanged. And, is it such a high price?

I mean, loading a million chars into memory is not even a consideration in today's computing, so how big should that potential number array be? And wouldn't it be impractical to work with this data in memory for other reasons but the additional few megabytes for original number literal data?

@leonid-s-usov
Copy link
Contributor Author

leonid-s-usov commented Oct 22, 2018 via email

@leonid-s-usov
Copy link
Contributor Author

leonid-s-usov commented Oct 22, 2018 via email

@wtlangford
Copy link
Contributor

That’s bad. But listen, if anyone is using this public API then how come they know whether a given method consumes jv or not, without having access to the source code?

I didn't say it was great as-is, either. 😉 Just that breaking that API is a concern and should be handled with care.

Anyhow, once we introduce more sophisticated comparison mechanisms we’d have to get back there and plug in all the relevant jv_ methods instead of the adhoc arithmetic, not just equality check.

I'm not sure what you mean here?

@leonid-s-usov
Copy link
Contributor Author

OK. Re. API, I am preparing a separate PR with those changes. You will have to consider the pro's and con's, assess the number of API users, and potentially schedule this change for the next breaking version.
Just last point on the API before we forget about it in the scope of this PR, there may be some acceptable techniques to expose the API in a backward compatible manner, while implementing the change internally. I have tried it and I believe this will improve code readability and lessen memory errors in the jq core itself.

Anyhow, once we introduce more sophisticated comparison mechanisms we’d have to get back there and plug in all the relevant jv_ methods instead of the adhoc arithmetic, not just equality check.

I'm not sure what you mean here?

This was a response to your comment about "benchmark[ing] the compilation stage prior to merging the PR". I just said that in this PR I have reverted the code change around that equality operator, since it was having nasty side effects breaking backward compatibility.
On the other hand, if we are to go for it and actually implement the arbitrary large number arithmetic, then we'd need to replace all constant folding operators, not just the equality which was the case with my code.
The bottom line is that it seems like benchmarking compilation is not relevant here (at the current state)

Thanks!

@leonid-s-usov
Copy link
Contributor Author

leonid-s-usov commented Oct 23, 2018

Guys, I need some help please.

I am trying to hunt down a leak and I suspect that it's happening somewhere deep in the manually generated bytecode for things like foreach or range.

I have spotted this particular test case which generates a number leak (i.e. number isn't getting deallocated)

This command argues that a number is leaked
./jq -n 'limit(1;range(1))'

While this one does not leak
./jq -n 'limit(2;range(1))'

And this one also does not leak
./jq -n 'limit(-1;range(1))'

What I was able to deduce out of the above is that most probably a number is leaked due to a premature exit from a foreach loop by the means of a break $out in the implementation of the limit filter

def limit($n; exp):
  if $n < 0 then exp
  else label $out | foreach exp as $item ($n; .-1; $item, if . <= 0 then break $out else empty end)
  end;

Now as much as I tried to get into the bytecode details I can't spot the problematic place.

More than that, I believe it could be that the issue is not with the foreach specifically, but maybe in general with this kind of jumps out of scope where it fails to correctly release the resources.

Of course, I have scanned the code multiple times for some obvious places where a jv is not freed after testing it for number, and I had found such places before, and had fixed those.

So actually my guess now is that it's an unconditional leak which revealed itself only now because the resource leaked is always a number. Which, in turn, means that it could be the input parameters for things like foreach and range, which only accept numbers as their type.

Any ideas?

@leonid-s-usov
Copy link
Contributor Author

Found it!

image

Leonid S. Usov added 8 commits October 23, 2018 11:44
@execute.c:1161
int jq_compile_args(jq_state *jq, const char* str, jv args) {
  jv_nomem_handler(jq->nomem_handler, jq->nomem_handler_data);

The call to jv_nomem_handler was using uninitialised fields of the jq structure
This allows usage of scripting to find offending tests, something like this (pseudocode):

skip = 0
while (true) do
    valgrind ./jq --run-tests my.tests --take 1 --skip $skip
    check_exit_code_and_break_conditionally($?, $skip)
    skip=($skip+1)
done

The command will return status 2 if skip parameter was greater
than the number of tests in the file. This allows for additional breaking condition
@leonid-s-usov
Copy link
Contributor Author

@wtlangford @nicowilliams @pkoppstein

As you can see I have successfully removed all apparent leaks. Cleaned up a bit further than just the new number functionality ;)

Actually in order to do so I had to make two additional things

  1. I have actually run valgrind on macOs. If built from the latest master commit it runs all fine, but complains about some potential leak in libdispatch or something like that. For that I have committed a local.supp file with suppress of that message.
  2. I have added a small tweak to the --run-tests mode, so that it's now possible to specify a --skip and --take parameters. With their help I could quickly build a simple script to run valgrind tool over particular tests rather than on all of them, which has enabled me trace down and eliminate all leaks.

Now the ball is on your side, and I can get back to sleeping, working, eating and all those other things people do.

Cheers ;)

PS. I think you should seriously consider that API naming convention I have suggested above. I believe that most of the issues in the code I had to fix (apart from those where numbers weren't treated as the other kids, cause they're different) were partly due to the existing ambiguity of the method naming

@leonid-s-usov
Copy link
Contributor Author

As a side note, I think that we could have a MUCH less time consuming memory checker by simply keeping track of jv objects which have a non-zero refcount by the time the program finishes.

Could be tricky to get the stack trace for those, of course, but on the other hand it could be better integrated into every test report and save tons of time for finding the bad place.

Also, as you probably know, stack trace from valgrind is often not helpful at all: due to the long life of our objects most of the leaks are originating at the parse stage, or even if not - still far from the actual problematic place.

Additionally, such internal jv-aware memory tracker could do what valgrind cant: dump the contents of the leaked object. And that, gentlemen, is sometimes a jackpot.

Last but not the least, we could enhance the internal memory tracker with the origin of the allocation by utilising macro overrides of the jv object initialisers.

@leonid-s-usov
Copy link
Contributor Author

Does anyone know why the Apple builds fail on travis?
I mean, it's obviously due to failed brew installation of ruby, but why would it start failing suddenly?

@wtlangford
Copy link
Contributor

wtlangford commented Oct 26, 2018 via email

@wader
Copy link
Member

wader commented Nov 30, 2023

@leonid-s-usov What's left of this PR after #1752 was merged? i see toliteral and some docs and jv improvements.

@leonid-s-usov
Copy link
Contributor Author

@wader all of this was superseded by the decimals PR. The appropriate improvements and fixes have been applied there, and toliteral isn't needed anymore. I'm closing this PR, thanks for reviewing it.

@wader
Copy link
Member

wader commented Nov 30, 2023

@leonid-s-usov Great! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants