get-json-object: main flow #1868

res-life · 2024-03-13T11:02:54Z

Implementation of get-json-object main flow
integration test with JSON generator

Signed-off-by: Chong Gao res_life@163.com

revans2

This generally looks like it follows the spark code, but it is kind of hard to tell until we actually have something working.

revans2 · 2024-03-13T19:39:43Z

src/main/cpp/src/get_json_object.cu

+      int len = parser.write_unescaped_text(output + output_len);
+      output_len += len;
+    } else {
+      int len = parser.compute_unescaped_text(output + output_len);


Why exactly does the compute_unescaped_text take an output at all?

Done.

Now:
compute_unescaped_len()

revans2 · 2024-03-13T19:44:52Z

src/main/cpp/src/get_json_object.cu

+  int path_commands_size,
+  char* out_buf,
+  size_t out_buf_size,
+  json_parser_options options)  // TODO make this a reference? use a global singleton options?


Or just don't have a json_parser_options at all. We don't have a way to change them, so why not just hard code it?

OK, I'll hard code it.

res-life · 2024-03-15T11:37:34Z

@ttnghia Help review.

Ready to do integration test with JSON generator.
Will fix more bugs when doing integration test.

[done]: Add more integration cases

res-life · 2024-03-18T06:11:04Z

Added more UT cases from link

revans2 · 2024-03-18T14:05:09Z

src/main/cpp/src/get_json_object.cu

+ * "success but no data" return cases.  For example, if you are reading the
+ * values of an array you might call a parse function in a while loop. You
+ * would want to continue doing this until you either encounter an error
+ * (parse_result::ERROR) or you get nothing back (parse_result::EMPTY)


But there is no parse_result::EMPTY in the enum?

Same question.

Removed parse_result
We do not need parse_result::EMPTY.
If evaluate_path success and result is empty, sets output_size as 0 and evaluate_path returns true.

ttnghia · 2024-03-18T17:04:11Z

src/main/cpp/src/get_json_object.cu

+ * @brief Result of calling a parse function.
+ *
+ * The primary use of this is to distinguish between "success" and
+ * "success but no data" return cases.  For example, if you are reading the


Suggested change

* "success but no data" return cases. For example, if you are reading the

* "success but no data" return cases. For example, if you are reading the

Removed parse_result

ttnghia · 2024-03-18T17:04:34Z

src/main/cpp/src/get_json_object.cu

+ * "success but no data" return cases.  For example, if you are reading the
+ * values of an array you might call a parse function in a while loop. You
+ * would want to continue doing this until you either encounter an error
+ * (parse_result::ERROR) or you get nothing back (parse_result::EMPTY)


Suggested change

* (parse_result::ERROR) or you get nothing back (parse_result::EMPTY)

* (`parse_result::ERROR`) or you get nothing back (`parse_result::EMPTY`).

Removed parse_result

ttnghia · 2024-03-18T17:06:43Z

src/main/cpp/src/get_json_object.cu

-  EMPTY,          // success, but no data
-};
+template <int max_json_nesting_depth>
+CUDF_HOST_DEVICE bool evaluate_path(json_parser<max_json_nesting_depth>& p,


Nit: Typically we should not use host+device qualifier too much. Only use it for short computation functions. If it does not need to be on device then just declare as a normal function. On the other hand, wee can just declare it as __device__ if it doesn't run on host.

Refer to this comment

ttnghia · 2024-03-18T17:15:28Z

src/main/cpp/src/get_json_object.hpp

+enum class write_style { raw_style, quoted_style, flatten_style };
+
+/**
+ * path instruction type
+ */
+enum class path_instruction_type { subscript, wildcard, key, index, named };


According to our current style, enum values are ALL_CAP.

Suggested change

enum class write_style { raw_style, quoted_style, flatten_style };

/**

* path instruction type

*/

enum class path_instruction_type { subscript, wildcard, key, index, named };

enum class write_style { RAW_STYLE, ... };

/**

* path instruction type

*/

enum class path_instruction_type { subscript, wildcard, key, index, named };

I'll update the enum to upper case after integration test is done.

ttnghia · 2024-03-18T17:18:29Z

src/main/cpp/src/json_parser.hpp


 namespace spark_rapids_jni {

 /**
 * write style when writing out JSON string
 */
 enum class write_style {
-  // e.g.: '\\r' is a string with 2 chars '\' '\r', writes 1 char '\r'
+  // e.g.: '\\r' is a string with 2 chars '\' 'r', writes 1 char '\r'


Nit: Sorry, again, following style guide, enum values are ALL_CAP.

I'll update the enum to upper case after integration test is done.

ttnghia · 2024-03-18T17:21:44Z

src/main/cpp/src/json_parser.hpp

-      copy_destination,  // copy destination while parsing, nullptr means do not copy
-      w_style);
+    return try_parse_quoted_string(str_pos,
+                                   '\'',


I'm not sure where this function will run (host or device?), as a lot of functions here are marked as CUDF_HOST_DEVICE. Generally we should not do that, only __device__ or without marking (host functions).

Anyway, this may crash, since it is passing a char* pointer that may point to host memory but that pointer may be passed into device function. If that is a host function then it's fine. If it is a device function, pass a string_scalar instead.

I'm not sure where this function will run (host or device?)

Refer to this comment

ttnghia · 2024-03-18T17:23:13Z

src/main/cpp/src/json_parser.hpp

-      copy_destination,  // copy destination while parsing, nullptr means do not copy
-      w_style);
+    return try_parse_quoted_string(str_pos,
+                                   '\"',


Similarly, do not pass a string literal. Pass either std::string or string_scalar.

The \" char will be on GPU stack, IMO, it's safe to put it in the parameter.
I'll test this PR from end to end after this PR is done

ttnghia · 2024-03-18T17:29:46Z

I just did another review iteration. What's most confusing is CUDA_HOST_DEVICE is used everywhere. If possible, please just use __device__ or none if the function does not run on GPU. That will reduce confusion when we want to see where the function is running to check the relevant context.

Will recirculate another more careful review round today.

ttnghia · 2024-03-18T23:28:15Z

src/main/cpp/src/get_json_object.cu

+ * @param j_parser The incoming json string and associated parser
+ * @param path_ptr The command buffer to be applied to the string.
+ * @param path_size Command buffer size
+ * @param output Buffer used to store the results of the query
 * @returns A result code indicating success/fail/empty.
 */
 template <int max_json_nesting_depth = curr_max_json_nesting_depth>
 __device__ parse_result parse_json_path(json_parser<max_json_nesting_depth>& j_parser,


I see that this function is called only once? If so, we can just eliminate it completely and put the code at the caller.

Updated, added annotation: inline

ttnghia · 2024-03-18T23:37:48Z

src/main/cpp/src/get_json_object.hpp

 #include <memory>

 namespace spark_rapids_jni {

+namespace detail {


So everything in this file and json_parser.hpp are running on host CPU?

In production, json_parser.hpp runs on GPU.
But for the unit test cases in json_parser_tests.cpp, they are running on CPU.

json_parser json_start_pos scans the JSON string view on GPU and writes output to cuDF string column.

json_parser is straightforward, it scans char array(parameter is char *) and writes to char buffer (parameter is char *). So it's convenient to use CUDF_HOST_DEVICE to test the logic.
Sorry for introduced extra HOST annotation.

Although the code can run on both CPU and GPU, I suspect that there may be some parts that cannot run exactly the same way on both places. We should better to implement it only as GPU code, and run unit tests on the GPU data.

OK, will update

res-life · 2024-03-19T10:02:36Z

Note: the integration test (between JSON generator, JSON parser, Main flow) is done.
The end to end integration test is not done.

res-life · 2024-03-20T04:04:27Z

@ttnghia Merge this to feature branch to unblock the End to End testing.
We can call another review when merging to Branch-24.04.

Signed-off-by: Chong Gao <res_life@163.com>

* get-json-object: Add JSON parser and parser utility (#1836) * Add Json Parser; Add Json Parser utility; Define internal interfaces; Copy get-json-obj CUDA code from cuDF; Signed-off-by: Chong Gao <res_life@163.com> * Code format --------- Signed-off-by: Chong Gao <res_life@163.com> Co-authored-by: Chong Gao <res_life@163.com> * get-json-object: match current field name (#1857) Signed-off-by: Chong Gao <res_life@163.com> Co-authored-by: Chong Gao <res_life@163.com> * get-json-object: add utility write_escaped_text for JSON generator (#1863) Signed-off-by: Chong Gao <res_life@163.com> Co-authored-by: Chong Gao <res_life@163.com> * Add JNI for GetJsonObject (#1862) * Add JNI for GetJsonObject Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * clean up Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * Parse json path in plugin Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * Apply suggestions from code review Co-authored-by: Nghia Truong <7416935+ttnghia@users.noreply.github.com> * Use table_view Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * Update java Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * Apply suggestions from code review Co-authored-by: Nghia Truong <7416935+ttnghia@users.noreply.github.com> * clean up Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * use matched enum for type Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * clean up Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * upmerge Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * format Signed-off-by: Haoyang Li <haoyangl@nvidia.com> --------- Signed-off-by: Haoyang Li <haoyangl@nvidia.com> Co-authored-by: Nghia Truong <7416935+ttnghia@users.noreply.github.com> * get-json-object: main flow (#1868) Signed-off-by: Chong Gao <res_life@163.com> Co-authored-by: Chong Gao <res_life@163.com> * Optimize memory usage in match_current_field_name (#1889) * Optimize match_current_field_name using less memory Signed-off-by: Chong Gao <res_life@163.com> * Convert a function to device code * Add a JNI test case * Add JNI test case * Change nesting depth to 4 * Change nesting depth to 8 to fix test Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * remove clang format change Signed-off-by: Haoyang Li <haoyangl@nvidia.com> --------- Signed-off-by: Chong Gao <res_life@163.com> Signed-off-by: Haoyang Li <haoyangl@nvidia.com> Co-authored-by: Chong Gao <res_life@163.com> * get-json-object: Recursive to iterative (#1890) * Change recursive to iterative Signed-off-by: Chong Gao <res_life@163.com> --------- Signed-off-by: Chong Gao <res_life@163.com> Co-authored-by: Chong Gao <res_life@163.com> * Fix bug * Format * Use uppercase for path_instruction_type Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * Add test cases from Baidu * Fix escape char error; add test case * getJsonObject number normalization (#1897) * Support number normalization Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * delete cpp test and add a java test case Signed-off-by: Haoyang Li <haoyangl@nvidia.com> --------- Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * Add test case * Fix a escape/unescape size bug Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * Fix bug: handle leading zeros for number; Refactor * Apply suggestions from code review Co-authored-by: Nghia Truong <7416935+ttnghia@users.noreply.github.com> * Address comments Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * fix java test Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * Add test cases; Fix a bug * follow up escape/unescape bug fix Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * Minor refactor * Add a case; Fix bug --------- Signed-off-by: Chong Gao <res_life@163.com> Signed-off-by: Haoyang Li <haoyangl@nvidia.com> Co-authored-by: Chong Gao <res_life@163.com> Co-authored-by: Haoyang Li <haoyangl@nvidia.com> Co-authored-by: Nghia Truong <7416935+ttnghia@users.noreply.github.com>

res-life requested review from ttnghia, SurajAralihalli and thirtiseven March 13, 2024 11:05

res-life changed the title ~~get-json-object: main flow~~ [WIP] get-json-object: main flow Mar 13, 2024

res-life requested a review from revans2 March 13, 2024 11:05

res-life self-assigned this Mar 13, 2024

res-life mentioned this pull request Mar 13, 2024

[FEA] GetJsonObject: Implement get-json-object main function #1832

Closed

revans2 reviewed Mar 13, 2024

View reviewed changes

res-life force-pushed the get-json-object-main-flow branch from 91c7b00 to dddd5de Compare March 14, 2024 11:44

SurajAralihalli mentioned this pull request Mar 14, 2024

[WIP] Implement JSON Generator to write Output from Parser. #1865

Closed

res-life force-pushed the get-json-object-main-flow branch 2 times, most recently from 0d8ae62 to 6d8584c Compare March 15, 2024 11:30

res-life marked this pull request as ready for review March 15, 2024 11:35

SurajAralihalli mentioned this pull request Mar 15, 2024

[FEA] GetJsonObject: Implement JSON generator to print JSON items #1831

Closed

4 tasks

res-life changed the title ~~[WIP] get-json-object: main flow~~ get-json-object: main flow Mar 18, 2024

res-life mentioned this pull request Mar 18, 2024

[BUG] GetJsonObject does not validate the input is JSON in the same way as Spark NVIDIA/spark-rapids#10194

Closed

revans2 reviewed Mar 18, 2024

View reviewed changes

ttnghia reviewed Mar 18, 2024

View reviewed changes

res-life mentioned this pull request Mar 19, 2024

[BUG] GetJsonObject does not process escape sequences in returned strings or queries NVIDIA/spark-rapids#10196

Closed

This was referenced Mar 20, 2024

[BUG] GpuGetJsonObject does not expand escaped characters NVIDIA/spark-rapids#9033

Closed

[FEA] Fix GetJsonObject NVIDIA/spark-rapids#10254

Open

res-life force-pushed the get-json-object-main-flow branch from 7f787c1 to fcd9d7c Compare March 20, 2024 03:36

res-life force-pushed the get-json-object-main-flow branch from fcd9d7c to 7a7efa5 Compare March 20, 2024 04:07

get-json-object: main flow

5de7a3e

Signed-off-by: Chong Gao <res_life@163.com>

res-life force-pushed the get-json-object-main-flow branch from 7a7efa5 to 5de7a3e Compare March 20, 2024 04:09

res-life merged commit 304bf71 into NVIDIA:get-json-object-feature Mar 20, 2024
2 checks passed

res-life deleted the get-json-object-main-flow branch March 20, 2024 05:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get-json-object: main flow #1868

get-json-object: main flow #1868

res-life commented Mar 13, 2024 •

edited

Loading

revans2 left a comment

revans2 Mar 13, 2024

res-life Mar 15, 2024

revans2 Mar 13, 2024

res-life Mar 15, 2024

res-life commented Mar 15, 2024 •

edited

Loading

res-life commented Mar 18, 2024

revans2 Mar 18, 2024

ttnghia Mar 18, 2024

res-life Mar 19, 2024

ttnghia Mar 18, 2024

res-life Mar 19, 2024

ttnghia Mar 18, 2024 •

edited

Loading

res-life Mar 19, 2024

ttnghia Mar 18, 2024 •

edited

Loading

res-life Mar 19, 2024

ttnghia Mar 18, 2024

res-life Mar 19, 2024

ttnghia Mar 18, 2024

res-life Mar 19, 2024

ttnghia Mar 18, 2024 •

edited

Loading

res-life Mar 19, 2024

ttnghia Mar 18, 2024

res-life Mar 19, 2024

ttnghia commented Mar 18, 2024 •

edited

Loading

ttnghia Mar 18, 2024

res-life Mar 19, 2024

ttnghia Mar 18, 2024

res-life Mar 19, 2024

ttnghia Mar 20, 2024

res-life Mar 22, 2024

res-life commented Mar 19, 2024

res-life commented Mar 20, 2024

	* "success but no data" return cases. For example, if you are reading the
	* "success but no data" return cases. For example, if you are reading the

	* (parse_result::ERROR) or you get nothing back (parse_result::EMPTY)
	* (`parse_result::ERROR`) or you get nothing back (`parse_result::EMPTY`).

get-json-object: main flow #1868

get-json-object: main flow #1868

Conversation

res-life commented Mar 13, 2024 • edited Loading

revans2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

res-life commented Mar 15, 2024 • edited Loading

res-life commented Mar 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttnghia Mar 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttnghia Mar 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttnghia Mar 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttnghia commented Mar 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

res-life commented Mar 19, 2024

res-life commented Mar 20, 2024

res-life commented Mar 13, 2024 •

edited

Loading

res-life commented Mar 15, 2024 •

edited

Loading

ttnghia Mar 18, 2024 •

edited

Loading

ttnghia Mar 18, 2024 •

edited

Loading

ttnghia Mar 18, 2024 •

edited

Loading

ttnghia commented Mar 18, 2024 •

edited

Loading