-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fuzzing test for JSON reader #5001
Conversation
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
build |
'spark.rapids.sql.format.json.read.enabled': 'true'} | ||
|
||
@approximate_float | ||
@pytest.mark.xfail(reason = "fuzz test") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should there be at least a second test that is always expected to succeed?
Naively I am also not sure why this test would be expected to fail. I guess the other point to make, would be to add a github issue link on what is failing. Perhaps that's well known to others, but it isn't obvious from this change to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The aim of this fuzzy test is to help find more corner cases in JSON reading. Link: #4821
Initially, my thought was that it would spoil our CI tests if a corner case was found. So I marked it as xfail
. However, after thinking carefully about your feedback, I remove the xfail
. Maybe we should let the CI fail when we find a new corner case, although the possibility is low.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Signed-off-by: remzi <13716567376yh@gmail.com>
build |
_name_gen = StringGen(pattern= "[a-z]{1,30}",nullable= False) | ||
_name_gen.start(random) | ||
_string_gen = StringGen(pattern= "[a-z]{1,30}",nullable= False) | ||
_string_gen.start(random) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these tests deterministic, using a fixed seed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Signed-off-by: remzi <13716567376yh@gmail.com>
build |
1 similar comment
build |
Signed-off-by: remzi <13716567376yh@gmail.com>
build |
1 similar comment
build |
please check the failed log before re-trigger,
exclude related json filed at https://github.com/NVIDIA/spark-rapids/blob/branch-22.06/pom.xml#L1154-L1178 to pass the license test |
Signed-off-by: remzi <13716567376yh@gmail.com>
build |
Signed-off-by: remzi <13716567376yh@gmail.com>
build |
|
||
@approximate_float | ||
@allow_non_gpu('FileSourceScanExec') | ||
@pytest.mark.xfail(reason="fuzz test may randomly fail") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear fuzz testing is intended to find issues and should not be running by default as a part of integration tests. It should be something that we can enable and run for a configurable number of iterations to find problems. Once we find the problem we need to have a way to debug/reproduce this issue. Then we can file a follow on issue/bug to fix it.
I would much rather have us skip this test by default unless a flag is set to enable fuzz testing, preferably with a number of iterations. With xfail we want to get to the point where if a test passes when we expect it to fail then it will be marked as a failure, xfail_strict=true
. Because this is totally random I really don't want to mark it as xfail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Fuzz tests are skipped by defualt.
We can enable fuzz tests and store the tested data by running
> bash run_pyspark_from_build.sh -k test_json_read_fuzz --fuzz_test --debug_tmp_path
@allow_non_gpu('FileSourceScanExec') | ||
@pytest.mark.xfail(reason="fuzz test may randomly fail") | ||
def test_json_read_fuzz(spark_tmp_path): | ||
depth = random.randint(1, 5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we reproduce/debug an error when it happens? Everything is random and based off of a regular random. Typically in python we would want to create an instance of Random with a seed, and then pass it everywhere so the results can be reproduced. We do this in data_gen already.
rand = random.Random(seed) |
I am fine if we randomly generate a seed based off of a timestamp or something like that. But at a minimum we need a way to log what that seed was so if/when it does fail we can debug things. With xdist this is not super simple, so I would suggest that we catch all Exceptions and then wrap them with something new that includes the seed that triggered the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the flag --debug_tmp_path
can save the test data
else: | ||
yield chr(random.randint(0x61, 0x66)) | ||
|
||
def gen_number(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark also supports special cases such as +Infinity
and NaN
for floating-point numbers, even though these are not valid in the JSON spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @andygrove . Really nice reminder! I will add these and other types supported by Spark in following PRs.
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
build |
integration_tests/README.md
Outdated
### Enabling fuzz tests | ||
|
||
Fuzz tests are intended to find more corner cases in testing. We disable them by default because they might randomly fail. | ||
The tests can be enabled by appending th option `--fuzz_test` to the command. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: You have "th" instead of "the"
The tests can be enabled by appending the option --fuzz_test
to the command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
|
||
* `--fuzz_test` (enable the fuzz tests when provided, and remove this option if you want to disable the tests) | ||
|
||
To reproduce an error appearing in the fuzz tests, you also need to add the flag `--debug_tmp_path` to save the test data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we combine these two together? So if you pass in --fuzz_test_debug_path
or something like that the tests are enabled? To be clear I am fine with keeping the debug_tmp_path as it is. If this is really hard you don't need to do it.
|
||
# A JSON generator built based on the context free grammar from https://www.json.org/json-en.html | ||
|
||
from cgi import test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this? We are not doing CGI are we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, the IDE added this line automatically. Have removed!
fix spelling mistake Signed-off-by: remzi <13716567376yh@gmail.com>
Co-authored-by: Niranjan Artal <50492963+nartal1@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but would be good to get review from others who had reviewed earlier.
build |
Does this close issue #4138 ? |
Yes. Have updated. |
Hi @revans2. Do you think we could merge this PR or not? |
@HaoYang670 I am very busy with other things. Please don't let me block this from going in. If others have approved it then we can merge it in and if there are issues we can iterate on them as we find them. |
Add fuzz tests for JSON reading.
close #4138
This PR contains 3 parts:
Data Type supported so far:
Plan to support in the future: