Add fuzzing test for JSON reader #5001

HaoYang670 · 2022-03-22T03:38:52Z

Add fuzz tests for JSON reading.
close #4138

This PR contains 3 parts:

Schema generator: randomly generate a JSON schema described in PySpark's DataType
JSON generator: randomly generator a JSON string based on the given schema
test_json_read_fuzz: test the behavior of CPU and GPU on reading a random json file (marked Xfail, so it will not impact the CI job)

Data Type supported so far:

primitive types: int, long, float, double, string
nested types: struct and array

Plan to support in the future:

primitive types: date, timestamp
nested types: map

Signed-off-by: remzi <13716567376yh@gmail.com>

HaoYang670 · 2022-03-22T05:49:36Z

build

abellina · 2022-03-22T14:18:01Z

integration_tests/src/main/python/json_fuzzing_test.py

+    'spark.rapids.sql.format.json.read.enabled': 'true'}
+
+@approximate_float
+@pytest.mark.xfail(reason = "fuzz test")


should there be at least a second test that is always expected to succeed?

Naively I am also not sure why this test would be expected to fail. I guess the other point to make, would be to add a github issue link on what is failing. Perhaps that's well known to others, but it isn't obvious from this change to me.

The aim of this fuzzy test is to help find more corner cases in JSON reading. Link: #4821

Initially, my thought was that it would spoil our CI tests if a corner case was found. So I marked it as xfail. However, after thinking carefully about your feedback, I remove the xfail. Maybe we should let the CI fail when we find a new corner case, although the possibility is low.

Signed-off-by: remzi <13716567376yh@gmail.com>

HaoYang670 · 2022-03-23T06:15:07Z

build

integration_tests/src/main/python/json_fuzzing_test.py

andygrove · 2022-03-24T02:02:04Z

integration_tests/src/main/python/json_fuzzing_test.py

+_name_gen = StringGen(pattern= "[a-z]{1,30}",nullable= False)
+_name_gen.start(random)
+_string_gen = StringGen(pattern= "[a-z]{1,30}",nullable= False)
+_string_gen.start(random)


Are these tests deterministic, using a fixed seed?

Signed-off-by: remzi <13716567376yh@gmail.com>

HaoYang670 · 2022-03-24T11:58:30Z

build

HaoYang670 · 2022-03-24T11:58:42Z

build

Signed-off-by: remzi <13716567376yh@gmail.com>

HaoYang670 · 2022-03-25T02:18:58Z

build

HaoYang670 · 2022-03-25T06:32:21Z

build

pxLi · 2022-03-25T06:51:37Z

please check the failed log before re-trigger,

Error: 3-25T06:40:13.649Z] [ERROR] Failed to execute goal org.apache.rat:apache-rat-plugin:0.13:check (default) on project rapids-4-spark-parent: Too many files with unapproved license: 1 See RAT report in: /home/jenkins/agent/workspace/jenkins-rapids_premerge-github-4260/target/spark24X/rat.txt -> [Help 1]

exclude related json filed at https://github.com/NVIDIA/spark-rapids/blob/branch-22.06/pom.xml#L1154-L1178 to pass the license test

Signed-off-by: remzi <13716567376yh@gmail.com>

HaoYang670 · 2022-03-25T06:58:38Z

build

Signed-off-by: remzi <13716567376yh@gmail.com>

HaoYang670 · 2022-03-28T12:32:33Z

build

revans2 · 2022-03-28T13:49:23Z

integration_tests/src/main/python/json_fuzzing_test.py

+
+@approximate_float
+@allow_non_gpu('FileSourceScanExec')
+@pytest.mark.xfail(reason="fuzz test may randomly fail")


To be clear fuzz testing is intended to find issues and should not be running by default as a part of integration tests. It should be something that we can enable and run for a configurable number of iterations to find problems. Once we find the problem we need to have a way to debug/reproduce this issue. Then we can file a follow on issue/bug to fix it.

I would much rather have us skip this test by default unless a flag is set to enable fuzz testing, preferably with a number of iterations. With xfail we want to get to the point where if a test passes when we expect it to fail then it will be marked as a failure, xfail_strict=true. Because this is totally random I really don't want to mark it as xfail.

Done. Fuzz tests are skipped by defualt.
We can enable fuzz tests and store the tested data by running

> bash run_pyspark_from_build.sh -k test_json_read_fuzz --fuzz_test --debug_tmp_path

revans2 · 2022-03-28T13:53:49Z

integration_tests/src/main/python/json_fuzzing_test.py

+@allow_non_gpu('FileSourceScanExec')
+@pytest.mark.xfail(reason="fuzz test may randomly fail")
+def test_json_read_fuzz(spark_tmp_path):
+    depth = random.randint(1, 5)


How do we reproduce/debug an error when it happens? Everything is random and based off of a regular random. Typically in python we would want to create an instance of Random with a seed, and then pass it everywhere so the results can be reproduced. We do this in data_gen already.

spark-rapids/integration_tests/src/main/python/data_gen.py

Line 671 in 0063053

rand = random.Random(seed)

I am fine if we randomly generate a seed based off of a timestamp or something like that. But at a minimum we need a way to log what that seed was so if/when it does fail we can debug things. With xdist this is not super simple, so I would suggest that we catch all Exceptions and then wrap them with something new that includes the seed that triggered the problem.

Add the flag --debug_tmp_path can save the test data

andygrove · 2022-03-29T14:46:51Z

integration_tests/src/main/python/json_fuzzing_test.py

+    else:
+        yield chr(random.randint(0x61, 0x66))
+
+def gen_number():


Spark also supports special cases such as +Infinity and NaN for floating-point numbers, even though these are not valid in the JSON spec.

Thank you @andygrove . Really nice reminder! I will add these and other types supported by Spark in following PRs.

Signed-off-by: remzi <13716567376yh@gmail.com>

HaoYang670 · 2022-03-30T08:52:57Z

build

revans2 · 2022-03-30T15:01:13Z

integration_tests/README.md

+### Enabling fuzz tests
+
+Fuzz tests are intended to find more corner cases in testing. We disable them by default because they might randomly fail. 
+The tests can be enabled by appending th option `--fuzz_test` to the command.


nit: You have "th" instead of "the"

The tests can be enabled by appending the option --fuzz_test to the command.

revans2 · 2022-03-30T15:02:13Z

integration_tests/README.md

+
+   * `--fuzz_test` (enable the fuzz tests when provided, and remove this option if you want to disable the tests)
+
+To reproduce an error appearing in the fuzz tests, you also need to add the flag `--debug_tmp_path` to save the test data.


Could we combine these two together? So if you pass in --fuzz_test_debug_path or something like that the tests are enabled? To be clear I am fine with keeping the debug_tmp_path as it is. If this is really hard you don't need to do it.

revans2 · 2022-03-30T15:06:20Z

integration_tests/src/main/python/json_fuzz_test.py

+
+# A JSON generator built based on the context free grammar from https://www.json.org/json-en.html
+
+from cgi import test


Why do we need this? We are not doing CGI are we?

Sorry, the IDE added this line automatically. Have removed!

fix spelling mistake Signed-off-by: remzi <13716567376yh@gmail.com>

integration_tests/src/main/python/json_fuzz_test.py

Co-authored-by: Niranjan Artal <50492963+nartal1@users.noreply.github.com>

nartal1

LGTM but would be good to get review from others who had reviewed earlier.

HaoYang670 · 2022-04-07T12:27:15Z

build

sameerz · 2022-04-11T01:46:06Z

Does this close issue #4138 ?

HaoYang670 · 2022-04-11T01:49:49Z

Does this close issue #4138 ?

Yes. Have updated.

HaoYang670 · 2022-04-18T11:57:25Z

Hi @revans2. Do you think we could merge this PR or not?

revans2 · 2022-04-18T13:32:51Z

@HaoYang670 I am very busy with other things. Please don't let me block this from going in. If others have approved it then we can merge it in and if there are issues we can iterate on them as we find them.

HaoYang670 added 13 commits March 8, 2022 21:14

prototype

bee87a3

Signed-off-by: remzi <13716567376yh@gmail.com>

add hex, digit, escape

b00b262

Signed-off-by: remzi <13716567376yh@gmail.com>

add more comments

428df4d

Signed-off-by: remzi <13716567376yh@gmail.com>

add schema gen

6d9d395

Signed-off-by: remzi <13716567376yh@gmail.com>

rename functions

960c37f

Signed-off-by: remzi <13716567376yh@gmail.com>

rename file, support multi lines

2138a49

Signed-off-by: remzi <13716567376yh@gmail.com>

temp save

63ac59e

Signed-off-by: remzi <13716567376yh@gmail.com>

fuly support json grammar

9b50e80

Signed-off-by: remzi <13716567376yh@gmail.com>

add a random test

524a44a

Signed-off-by: remzi <13716567376yh@gmail.com>

Merge remote-tracking branch 'upstream/branch-22.04' into json_fuzz

937fa2d

fix some bug

cf9c5f8

Signed-off-by: remzi <13716567376yh@gmail.com>

rename

98fd8af

Signed-off-by: remzi <13716567376yh@gmail.com>

add copyright

380327b

Signed-off-by: remzi <13716567376yh@gmail.com>

abellina reviewed Mar 22, 2022

View reviewed changes

sameerz added the test Only impacts tests label Mar 22, 2022

sameerz added this to the Mar 21 - Apr 1 milestone Mar 22, 2022

remove xfail

93a32b4

Signed-off-by: remzi <13716567376yh@gmail.com>

andygrove reviewed Mar 24, 2022

View reviewed changes

integration_tests/src/main/python/json_fuzzing_test.py Outdated Show resolved Hide resolved

andygrove reviewed Mar 24, 2022

View reviewed changes

wider range of chars

9b7f551

Signed-off-by: remzi <13716567376yh@gmail.com>

avoid unexpected escape chars

899aee1

Signed-off-by: remzi <13716567376yh@gmail.com>

delete the test data file

aecde63

Signed-off-by: remzi <13716567376yh@gmail.com>

bring back xfail, because fuzz test might randomly fail

7c70061

Signed-off-by: remzi <13716567376yh@gmail.com>

revans2 reviewed Mar 28, 2022

View reviewed changes

Merge remote-tracking branch 'upstream/branch-22.06' into json_fuzz

c31aeb8

andygrove reviewed Mar 29, 2022

View reviewed changes

HaoYang670 added 3 commits March 30, 2022 14:21

disable fuzz tests by default

d64651f

Signed-off-by: remzi <13716567376yh@gmail.com>

upadte readme

b3cb4c9

Signed-off-by: remzi <13716567376yh@gmail.com>

store schema for debugging

b14d4d2

Signed-off-by: remzi <13716567376yh@gmail.com>

revans2 reviewed Mar 30, 2022

View reviewed changes

remove useless package

426175e

fix spelling mistake Signed-off-by: remzi <13716567376yh@gmail.com>

sameerz modified the milestones: Mar 21 - Apr 1, Apr 4 - Apr 15 Apr 4, 2022

nartal1 reviewed Apr 5, 2022

View reviewed changes

integration_tests/src/main/python/json_fuzz_test.py Outdated Show resolved Hide resolved

Update integration_tests/src/main/python/json_fuzz_test.py

ea82492

Co-authored-by: Niranjan Artal <50492963+nartal1@users.noreply.github.com>

nartal1 approved these changes Apr 6, 2022

View reviewed changes

sameerz modified the milestones: Apr 4 - Apr 15, Apr 18 - Apr 29 Apr 18, 2022

andygrove approved these changes Apr 18, 2022

View reviewed changes

andygrove merged commit 79e9793 into NVIDIA:branch-22.06 Apr 18, 2022

HaoYang670 deleted the json_fuzz branch April 19, 2022 01:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fuzzing test for JSON reader #5001

Add fuzzing test for JSON reader #5001

HaoYang670 commented Mar 22, 2022 •

edited

Loading

HaoYang670 commented Mar 22, 2022

abellina Mar 22, 2022

HaoYang670 Mar 23, 2022

HaoYang670 Mar 24, 2022

HaoYang670 commented Mar 23, 2022

andygrove Mar 24, 2022

HaoYang670 Mar 24, 2022

HaoYang670 commented Mar 24, 2022

HaoYang670 commented Mar 24, 2022

HaoYang670 commented Mar 25, 2022

HaoYang670 commented Mar 25, 2022

pxLi commented Mar 25, 2022 •

edited

Loading

HaoYang670 commented Mar 25, 2022

HaoYang670 commented Mar 28, 2022

revans2 Mar 28, 2022

HaoYang670 Mar 30, 2022

revans2 Mar 28, 2022

HaoYang670 Mar 30, 2022

andygrove Mar 29, 2022

HaoYang670 Mar 30, 2022

HaoYang670 commented Mar 30, 2022

revans2 Mar 30, 2022

HaoYang670 Mar 31, 2022

revans2 Mar 30, 2022

revans2 Mar 30, 2022

HaoYang670 Mar 31, 2022

nartal1 left a comment

HaoYang670 commented Apr 7, 2022

sameerz commented Apr 11, 2022

HaoYang670 commented Apr 11, 2022

HaoYang670 commented Apr 18, 2022

revans2 commented Apr 18, 2022


		* `--fuzz_test` (enable the fuzz tests when provided, and remove this option if you want to disable the tests)

		To reproduce an error appearing in the fuzz tests, you also need to add the flag `--debug_tmp_path` to save the test data.


		# A JSON generator built based on the context free grammar from https://www.json.org/json-en.html

		from cgi import test

Add fuzzing test for JSON reader #5001

Add fuzzing test for JSON reader #5001

Conversation

HaoYang670 commented Mar 22, 2022 • edited Loading

HaoYang670 commented Mar 22, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HaoYang670 commented Mar 23, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HaoYang670 commented Mar 24, 2022

HaoYang670 commented Mar 24, 2022

HaoYang670 commented Mar 25, 2022

HaoYang670 commented Mar 25, 2022

pxLi commented Mar 25, 2022 • edited Loading

HaoYang670 commented Mar 25, 2022

HaoYang670 commented Mar 28, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HaoYang670 commented Mar 30, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nartal1 left a comment

Choose a reason for hiding this comment

HaoYang670 commented Apr 7, 2022

sameerz commented Apr 11, 2022

HaoYang670 commented Apr 11, 2022

HaoYang670 commented Apr 18, 2022

revans2 commented Apr 18, 2022

HaoYang670 commented Mar 22, 2022 •

edited

Loading

pxLi commented Mar 25, 2022 •

edited

Loading