Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get-json-object: main flow #1868

Merged

Conversation

res-life
Copy link
Collaborator

@res-life res-life commented Mar 13, 2024

closes #1832

  • Implementation of get-json-object main flow
  • integration test with JSON generator

Signed-off-by: Chong Gao res_life@163.com

@res-life res-life changed the title get-json-object: main flow [WIP] get-json-object: main flow Mar 13, 2024
@res-life res-life requested a review from revans2 March 13, 2024 11:05
@res-life res-life self-assigned this Mar 13, 2024
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally looks like it follows the spark code, but it is kind of hard to tell until we actually have something working.

int len = parser.write_unescaped_text(output + output_len);
output_len += len;
} else {
int len = parser.compute_unescaped_text(output + output_len);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why exactly does the compute_unescaped_text take an output at all?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Now:
compute_unescaped_len()

int path_commands_size,
char* out_buf,
size_t out_buf_size,
json_parser_options options) // TODO make this a reference? use a global singleton options?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or just don't have a json_parser_options at all. We don't have a way to change them, so why not just hard code it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll hard code it.

@res-life res-life force-pushed the get-json-object-main-flow branch from 91c7b00 to dddd5de Compare March 14, 2024 11:44
@res-life res-life force-pushed the get-json-object-main-flow branch 2 times, most recently from 0d8ae62 to 6d8584c Compare March 15, 2024 11:30
@res-life res-life marked this pull request as ready for review March 15, 2024 11:35
@res-life
Copy link
Collaborator Author

res-life commented Mar 15, 2024

@ttnghia Help review.

Ready to do integration test with JSON generator.
Will fix more bugs when doing integration test.

[done]: Add more integration cases

@res-life res-life changed the title [WIP] get-json-object: main flow get-json-object: main flow Mar 18, 2024
@res-life
Copy link
Collaborator Author

Added more UT cases from link

* "success but no data" return cases. For example, if you are reading the
* values of an array you might call a parse function in a while loop. You
* would want to continue doing this until you either encounter an error
* (parse_result::ERROR) or you get nothing back (parse_result::EMPTY)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But there is no parse_result::EMPTY in the enum?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed parse_result
We do not need parse_result::EMPTY.
If evaluate_path success and result is empty, sets output_size as 0 and evaluate_path returns true.

* @brief Result of calling a parse function.
*
* The primary use of this is to distinguish between "success" and
* "success but no data" return cases. For example, if you are reading the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* "success but no data" return cases. For example, if you are reading the
* "success but no data" return cases. For example, if you are reading the

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed parse_result

* "success but no data" return cases. For example, if you are reading the
* values of an array you might call a parse function in a while loop. You
* would want to continue doing this until you either encounter an error
* (parse_result::ERROR) or you get nothing back (parse_result::EMPTY)
Copy link
Collaborator

@ttnghia ttnghia Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* (parse_result::ERROR) or you get nothing back (parse_result::EMPTY)
* (`parse_result::ERROR`) or you get nothing back (`parse_result::EMPTY`).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed parse_result

EMPTY, // success, but no data
};
template <int max_json_nesting_depth>
CUDF_HOST_DEVICE bool evaluate_path(json_parser<max_json_nesting_depth>& p,
Copy link
Collaborator

@ttnghia ttnghia Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Typically we should not use host+device qualifier too much. Only use it for short computation functions. If it does not need to be on device then just declare as a normal function. On the other hand, wee can just declare it as __device__ if it doesn't run on host.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refer to this comment

Comment on lines 36 to 41
enum class write_style { raw_style, quoted_style, flatten_style };

/**
* path instruction type
*/
enum class path_instruction_type { subscript, wildcard, key, index, named };
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to our current style, enum values are ALL_CAP.

Suggested change
enum class write_style { raw_style, quoted_style, flatten_style };
/**
* path instruction type
*/
enum class path_instruction_type { subscript, wildcard, key, index, named };
enum class write_style { RAW_STYLE, ... };
/**
* path instruction type
*/
enum class path_instruction_type { subscript, wildcard, key, index, named };

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the enum to upper case after integration test is done.


namespace spark_rapids_jni {

/**
* write style when writing out JSON string
*/
enum class write_style {
// e.g.: '\\r' is a string with 2 chars '\' '\r', writes 1 char '\r'
// e.g.: '\\r' is a string with 2 chars '\' 'r', writes 1 char '\r'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Sorry, again, following style guide, enum values are ALL_CAP.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the enum to upper case after integration test is done.

copy_destination, // copy destination while parsing, nullptr means do not copy
w_style);
return try_parse_quoted_string(str_pos,
'\'',
Copy link
Collaborator

@ttnghia ttnghia Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure where this function will run (host or device?), as a lot of functions here are marked as CUDF_HOST_DEVICE. Generally we should not do that, only __device__ or without marking (host functions).

Anyway, this may crash, since it is passing a char* pointer that may point to host memory but that pointer may be passed into device function. If that is a host function then it's fine. If it is a device function, pass a string_scalar instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure where this function will run (host or device?)

Refer to this comment

copy_destination, // copy destination while parsing, nullptr means do not copy
w_style);
return try_parse_quoted_string(str_pos,
'\"',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, do not pass a string literal. Pass either std::string or string_scalar.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The \" char will be on GPU stack, IMO, it's safe to put it in the parameter.
I'll test this PR from end to end after this PR is done

@ttnghia
Copy link
Collaborator

ttnghia commented Mar 18, 2024

I just did another review iteration. What's most confusing is CUDA_HOST_DEVICE is used everywhere. If possible, please just use __device__ or none if the function does not run on GPU. That will reduce confusion when we want to see where the function is running to check the relevant context.

Will recirculate another more careful review round today.

* @param j_parser The incoming json string and associated parser
* @param path_ptr The command buffer to be applied to the string.
* @param path_size Command buffer size
* @param output Buffer used to store the results of the query
* @returns A result code indicating success/fail/empty.
*/
template <int max_json_nesting_depth = curr_max_json_nesting_depth>
__device__ parse_result parse_json_path(json_parser<max_json_nesting_depth>& j_parser,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this function is called only once? If so, we can just eliminate it completely and put the code at the caller.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, added annotation: inline

#include <memory>

namespace spark_rapids_jni {

namespace detail {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So everything in this file and json_parser.hpp are running on host CPU?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In production, json_parser.hpp runs on GPU.
But for the unit test cases in json_parser_tests.cpp, they are running on CPU.

json_parser json_start_pos scans the JSON string view on GPU and writes output to cuDF string column.

json_parser is straightforward, it scans char array(parameter is char *) and writes to char buffer (parameter is char *). So it's convenient to use CUDF_HOST_DEVICE to test the logic.
Sorry for introduced extra HOST annotation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the code can run on both CPU and GPU, I suspect that there may be some parts that cannot run exactly the same way on both places. We should better to implement it only as GPU code, and run unit tests on the GPU data.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, will update

@res-life
Copy link
Collaborator Author

Note: the integration test (between JSON generator, JSON parser, Main flow) is done.
The end to end integration test is not done.

@res-life
Copy link
Collaborator Author

@ttnghia Merge this to feature branch to unblock the End to End testing.
We can call another review when merging to Branch-24.04.

@res-life res-life force-pushed the get-json-object-main-flow branch from fcd9d7c to 7a7efa5 Compare March 20, 2024 04:07
Signed-off-by: Chong Gao <res_life@163.com>
@res-life res-life force-pushed the get-json-object-main-flow branch from 7a7efa5 to 5de7a3e Compare March 20, 2024 04:09
@res-life res-life merged commit 304bf71 into NVIDIA:get-json-object-feature Mar 20, 2024
2 checks passed
@res-life res-life deleted the get-json-object-main-flow branch March 20, 2024 05:18
thirtiseven added a commit that referenced this pull request Mar 27, 2024
* get-json-object:  Add JSON parser and parser utility (#1836)

* Add Json Parser;
Add Json Parser utility;
Define internal interfaces;
Copy get-json-obj CUDA code from cuDF;

Signed-off-by: Chong Gao <res_life@163.com>

* Code format

---------

Signed-off-by: Chong Gao <res_life@163.com>
Co-authored-by: Chong Gao <res_life@163.com>

* get-json-object: match current field name (#1857)

Signed-off-by: Chong Gao <res_life@163.com>
Co-authored-by: Chong Gao <res_life@163.com>

* get-json-object: add utility write_escaped_text for JSON generator (#1863)

Signed-off-by: Chong Gao <res_life@163.com>
Co-authored-by: Chong Gao <res_life@163.com>

* Add JNI for GetJsonObject (#1862)

* Add JNI for GetJsonObject

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

* clean up

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

* Parse json path in plugin

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

* Apply suggestions from code review

Co-authored-by: Nghia Truong <7416935+ttnghia@users.noreply.github.com>

* Use table_view

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

* Update java

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

* Apply suggestions from code review

Co-authored-by: Nghia Truong <7416935+ttnghia@users.noreply.github.com>

* clean up

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

* use matched enum for type

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

* clean up

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

* upmerge

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

* format

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

---------

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Co-authored-by: Nghia Truong <7416935+ttnghia@users.noreply.github.com>

* get-json-object: main flow (#1868)

Signed-off-by: Chong Gao <res_life@163.com>
Co-authored-by: Chong Gao <res_life@163.com>

* Optimize memory usage in match_current_field_name (#1889)

* Optimize match_current_field_name using less memory

Signed-off-by: Chong Gao <res_life@163.com>

* Convert a function to device code

* Add a JNI test case

* Add JNI test case

* Change nesting depth to 4

* Change nesting depth to 8 to fix test

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

* remove clang format change

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

---------

Signed-off-by: Chong Gao <res_life@163.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Co-authored-by: Chong Gao <res_life@163.com>

* get-json-object: Recursive to iterative (#1890)

* Change recursive to iterative

Signed-off-by: Chong Gao <res_life@163.com>

---------

Signed-off-by: Chong Gao <res_life@163.com>
Co-authored-by: Chong Gao <res_life@163.com>

* Fix bug

* Format

* Use uppercase for path_instruction_type

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

* Add test cases from Baidu

* Fix escape char error; add test case

* getJsonObject number normalization (#1897)

* Support number normalization

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

* delete cpp test and add a java test case

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

---------

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

* Add test case

* Fix a escape/unescape size bug

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

* Fix bug: handle leading zeros for number; Refactor

* Apply suggestions from code review

Co-authored-by: Nghia Truong <7416935+ttnghia@users.noreply.github.com>

* Address comments

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

* fix java test

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

* Add test cases; Fix a bug

* follow up escape/unescape bug fix

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

* Minor refactor

* Add a case; Fix bug

---------

Signed-off-by: Chong Gao <res_life@163.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Co-authored-by: Chong Gao <res_life@163.com>
Co-authored-by: Haoyang Li <haoyangl@nvidia.com>
Co-authored-by: Nghia Truong <7416935+ttnghia@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants