-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
get-json-object: main flow #1868
get-json-object: main flow #1868
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This generally looks like it follows the spark code, but it is kind of hard to tell until we actually have something working.
src/main/cpp/src/get_json_object.cu
Outdated
int len = parser.write_unescaped_text(output + output_len); | ||
output_len += len; | ||
} else { | ||
int len = parser.compute_unescaped_text(output + output_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why exactly does the compute_unescaped_text
take an output at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Now:
compute_unescaped_len()
src/main/cpp/src/get_json_object.cu
Outdated
int path_commands_size, | ||
char* out_buf, | ||
size_t out_buf_size, | ||
json_parser_options options) // TODO make this a reference? use a global singleton options? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or just don't have a json_parser_options at all. We don't have a way to change them, so why not just hard code it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'll hard code it.
91c7b00
to
dddd5de
Compare
0d8ae62
to
6d8584c
Compare
@ttnghia Help review. Ready to do integration test with JSON generator. [done]: Add more integration cases |
Added more UT cases from link |
src/main/cpp/src/get_json_object.cu
Outdated
* "success but no data" return cases. For example, if you are reading the | ||
* values of an array you might call a parse function in a while loop. You | ||
* would want to continue doing this until you either encounter an error | ||
* (parse_result::ERROR) or you get nothing back (parse_result::EMPTY) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But there is no parse_result::EMPTY in the enum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed parse_result
We do not need parse_result::EMPTY
.
If evaluate_path
success and result is empty, sets output_size
as 0 and evaluate_path
returns true.
src/main/cpp/src/get_json_object.cu
Outdated
* @brief Result of calling a parse function. | ||
* | ||
* The primary use of this is to distinguish between "success" and | ||
* "success but no data" return cases. For example, if you are reading the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* "success but no data" return cases. For example, if you are reading the | |
* "success but no data" return cases. For example, if you are reading the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed parse_result
src/main/cpp/src/get_json_object.cu
Outdated
* "success but no data" return cases. For example, if you are reading the | ||
* values of an array you might call a parse function in a while loop. You | ||
* would want to continue doing this until you either encounter an error | ||
* (parse_result::ERROR) or you get nothing back (parse_result::EMPTY) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* (parse_result::ERROR) or you get nothing back (parse_result::EMPTY) | |
* (`parse_result::ERROR`) or you get nothing back (`parse_result::EMPTY`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed parse_result
src/main/cpp/src/get_json_object.cu
Outdated
EMPTY, // success, but no data | ||
}; | ||
template <int max_json_nesting_depth> | ||
CUDF_HOST_DEVICE bool evaluate_path(json_parser<max_json_nesting_depth>& p, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Typically we should not use host+device qualifier too much. Only use it for short computation functions. If it does not need to be on device then just declare as a normal function. On the other hand, wee can just declare it as __device__
if it doesn't run on host.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refer to this comment
src/main/cpp/src/get_json_object.hpp
Outdated
enum class write_style { raw_style, quoted_style, flatten_style }; | ||
|
||
/** | ||
* path instruction type | ||
*/ | ||
enum class path_instruction_type { subscript, wildcard, key, index, named }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to our current style, enum values are ALL_CAP
.
enum class write_style { raw_style, quoted_style, flatten_style }; | |
/** | |
* path instruction type | |
*/ | |
enum class path_instruction_type { subscript, wildcard, key, index, named }; | |
enum class write_style { RAW_STYLE, ... }; | |
/** | |
* path instruction type | |
*/ | |
enum class path_instruction_type { subscript, wildcard, key, index, named }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update the enum to upper case after integration test is done.
|
||
namespace spark_rapids_jni { | ||
|
||
/** | ||
* write style when writing out JSON string | ||
*/ | ||
enum class write_style { | ||
// e.g.: '\\r' is a string with 2 chars '\' '\r', writes 1 char '\r' | ||
// e.g.: '\\r' is a string with 2 chars '\' 'r', writes 1 char '\r' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Sorry, again, following style guide, enum values are ALL_CAP
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update the enum to upper case after integration test is done.
copy_destination, // copy destination while parsing, nullptr means do not copy | ||
w_style); | ||
return try_parse_quoted_string(str_pos, | ||
'\'', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure where this function will run (host or device?), as a lot of functions here are marked as CUDF_HOST_DEVICE
. Generally we should not do that, only __device__
or without marking (host functions).
Anyway, this may crash, since it is passing a char*
pointer that may point to host memory but that pointer may be passed into device function. If that is a host function then it's fine. If it is a device function, pass a string_scalar
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure where this function will run (host or device?)
Refer to this comment
copy_destination, // copy destination while parsing, nullptr means do not copy | ||
w_style); | ||
return try_parse_quoted_string(str_pos, | ||
'\"', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, do not pass a string literal. Pass either std::string
or string_scalar
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The \"
char will be on GPU stack, IMO, it's safe to put it in the parameter.
I'll test this PR from end to end after this PR is done
I just did another review iteration. What's most confusing is Will recirculate another more careful review round today. |
src/main/cpp/src/get_json_object.cu
Outdated
* @param j_parser The incoming json string and associated parser | ||
* @param path_ptr The command buffer to be applied to the string. | ||
* @param path_size Command buffer size | ||
* @param output Buffer used to store the results of the query | ||
* @returns A result code indicating success/fail/empty. | ||
*/ | ||
template <int max_json_nesting_depth = curr_max_json_nesting_depth> | ||
__device__ parse_result parse_json_path(json_parser<max_json_nesting_depth>& j_parser, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that this function is called only once? If so, we can just eliminate it completely and put the code at the caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, added annotation: inline
#include <memory> | ||
|
||
namespace spark_rapids_jni { | ||
|
||
namespace detail { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So everything in this file and json_parser.hpp
are running on host CPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In production, json_parser.hpp
runs on GPU.
But for the unit test cases in json_parser_tests.cpp
, they are running on CPU.
json_parser
json_start_pos
scans the JSON string view on GPU and writes output to cuDF string column.
json_parser
is straightforward, it scans char array(parameter is char *
) and writes to char buffer (parameter is char *
). So it's convenient to use CUDF_HOST_DEVICE
to test the logic.
Sorry for introduced extra HOST
annotation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although the code can run on both CPU and GPU, I suspect that there may be some parts that cannot run exactly the same way on both places. We should better to implement it only as GPU code, and run unit tests on the GPU data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, will update
Note: the integration test (between JSON generator, JSON parser, Main flow) is done. |
7f787c1
to
fcd9d7c
Compare
@ttnghia Merge this to feature branch to unblock the End to End testing. |
fcd9d7c
to
7a7efa5
Compare
Signed-off-by: Chong Gao <res_life@163.com>
7a7efa5
to
5de7a3e
Compare
* get-json-object: Add JSON parser and parser utility (#1836) * Add Json Parser; Add Json Parser utility; Define internal interfaces; Copy get-json-obj CUDA code from cuDF; Signed-off-by: Chong Gao <res_life@163.com> * Code format --------- Signed-off-by: Chong Gao <res_life@163.com> Co-authored-by: Chong Gao <res_life@163.com> * get-json-object: match current field name (#1857) Signed-off-by: Chong Gao <res_life@163.com> Co-authored-by: Chong Gao <res_life@163.com> * get-json-object: add utility write_escaped_text for JSON generator (#1863) Signed-off-by: Chong Gao <res_life@163.com> Co-authored-by: Chong Gao <res_life@163.com> * Add JNI for GetJsonObject (#1862) * Add JNI for GetJsonObject Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * clean up Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * Parse json path in plugin Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * Apply suggestions from code review Co-authored-by: Nghia Truong <7416935+ttnghia@users.noreply.github.com> * Use table_view Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * Update java Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * Apply suggestions from code review Co-authored-by: Nghia Truong <7416935+ttnghia@users.noreply.github.com> * clean up Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * use matched enum for type Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * clean up Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * upmerge Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * format Signed-off-by: Haoyang Li <haoyangl@nvidia.com> --------- Signed-off-by: Haoyang Li <haoyangl@nvidia.com> Co-authored-by: Nghia Truong <7416935+ttnghia@users.noreply.github.com> * get-json-object: main flow (#1868) Signed-off-by: Chong Gao <res_life@163.com> Co-authored-by: Chong Gao <res_life@163.com> * Optimize memory usage in match_current_field_name (#1889) * Optimize match_current_field_name using less memory Signed-off-by: Chong Gao <res_life@163.com> * Convert a function to device code * Add a JNI test case * Add JNI test case * Change nesting depth to 4 * Change nesting depth to 8 to fix test Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * remove clang format change Signed-off-by: Haoyang Li <haoyangl@nvidia.com> --------- Signed-off-by: Chong Gao <res_life@163.com> Signed-off-by: Haoyang Li <haoyangl@nvidia.com> Co-authored-by: Chong Gao <res_life@163.com> * get-json-object: Recursive to iterative (#1890) * Change recursive to iterative Signed-off-by: Chong Gao <res_life@163.com> --------- Signed-off-by: Chong Gao <res_life@163.com> Co-authored-by: Chong Gao <res_life@163.com> * Fix bug * Format * Use uppercase for path_instruction_type Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * Add test cases from Baidu * Fix escape char error; add test case * getJsonObject number normalization (#1897) * Support number normalization Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * delete cpp test and add a java test case Signed-off-by: Haoyang Li <haoyangl@nvidia.com> --------- Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * Add test case * Fix a escape/unescape size bug Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * Fix bug: handle leading zeros for number; Refactor * Apply suggestions from code review Co-authored-by: Nghia Truong <7416935+ttnghia@users.noreply.github.com> * Address comments Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * fix java test Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * Add test cases; Fix a bug * follow up escape/unescape bug fix Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * Minor refactor * Add a case; Fix bug --------- Signed-off-by: Chong Gao <res_life@163.com> Signed-off-by: Haoyang Li <haoyangl@nvidia.com> Co-authored-by: Chong Gao <res_life@163.com> Co-authored-by: Haoyang Li <haoyangl@nvidia.com> Co-authored-by: Nghia Truong <7416935+ttnghia@users.noreply.github.com>
closes #1832
Signed-off-by: Chong Gao res_life@163.com