-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New implementation of getJsonObject #1893
Conversation
* Add Json Parser; Add Json Parser utility; Define internal interfaces; Copy get-json-obj CUDA code from cuDF; Signed-off-by: Chong Gao <res_life@163.com> * Code format --------- Signed-off-by: Chong Gao <res_life@163.com> Co-authored-by: Chong Gao <res_life@163.com>
Signed-off-by: Chong Gao <res_life@163.com> Co-authored-by: Chong Gao <res_life@163.com>
…1863) Signed-off-by: Chong Gao <res_life@163.com> Co-authored-by: Chong Gao <res_life@163.com>
* Add JNI for GetJsonObject Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * clean up Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * Parse json path in plugin Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * Apply suggestions from code review Co-authored-by: Nghia Truong <7416935+ttnghia@users.noreply.github.com> * Use table_view Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * Update java Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * Apply suggestions from code review Co-authored-by: Nghia Truong <7416935+ttnghia@users.noreply.github.com> * clean up Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * use matched enum for type Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * clean up Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * upmerge Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * format Signed-off-by: Haoyang Li <haoyangl@nvidia.com> --------- Signed-off-by: Haoyang Li <haoyangl@nvidia.com> Co-authored-by: Nghia Truong <7416935+ttnghia@users.noreply.github.com>
Signed-off-by: Chong Gao <res_life@163.com> Co-authored-by: Chong Gao <res_life@163.com>
* Optimize match_current_field_name using less memory Signed-off-by: Chong Gao <res_life@163.com> * Convert a function to device code * Add a JNI test case * Add JNI test case * Change nesting depth to 4 * Change nesting depth to 8 to fix test Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * remove clang format change Signed-off-by: Haoyang Li <haoyangl@nvidia.com> --------- Signed-off-by: Chong Gao <res_life@163.com> Signed-off-by: Haoyang Li <haoyangl@nvidia.com> Co-authored-by: Chong Gao <res_life@163.com>
* Change recursive to iterative Signed-off-by: Chong Gao <res_life@163.com> --------- Signed-off-by: Chong Gao <res_life@163.com> Co-authored-by: Chong Gao <res_life@163.com>
build |
@ttnghia Help review |
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
build |
build |
src/main/cpp/src/JSONUtilsJni.cpp
Outdated
|
||
extern "C" { | ||
JNIEXPORT jlong JNICALL Java_com_nvidia_spark_rapids_jni_JSONUtils_getJsonObject( | ||
JNIEnv* env, jclass, jlong input_column, jint size, jobjectArray path_instructions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why do we need size if the path_instructions is an array that stores the size in it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
for (int i = 0; i < size; i++) { | ||
jobject instruction = env->GetObjectArrayElement(path_instructions, i); | ||
JNI_NULL_CHECK(env, instruction, "path_instruction is null", 0); | ||
jclass instruction_class = env->GetObjectClass(instruction); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we have a follow on where we change this to be 3 arrays? one that is an int array for the type, another that is a string array, and the last one would be a long array? This would reduce the number of reflection calls that would need to be done. This is very minor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Humnn, probably I was the one that asked for using array of structs instead of individual arrays. But yeah for data transfer from Java to C++ then individual arrays seem to be better. I'm sorry @res-life.
Edit: I have a better idea. We want to have array of structs because that is simpler to iterate through the instructions. With individual arrays, we still can use thrust::zip_iterator
to zip together multiple arrays with just one iterator and iterate through them easily. Of course this is not required to be in this PR but future work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thrust::zip_iterator
looks good, will do in a follow-on
@Test | ||
void getJsonObjectTest() { | ||
JSONUtils.PathInstructionJni[] query = new JSONUtils.PathInstructionJni[2]; | ||
query[0] = new JSONUtils.PathInstructionJni(JSONUtils.PathInstructionType.KEY, "", -1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we support arrays yet? I don't see any tests that include an array index that is not -1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a case to query $[1][2]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
} | ||
|
||
public enum PathInstructionType { | ||
SUBSCRIPT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would prefer to have the enum to int mapping be specific instead of using the ordinal. It is minor, but helps to prevent issues where one is updated and the other is missed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do in a follow-on
NAMED | ||
} | ||
|
||
public static class PathInstructionJni { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have some javadocs on this, and the constructor. It is not always clear what it does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do in a follow-on
} | ||
} | ||
|
||
public static ColumnVector getJsonObject(ColumnVector input, int size, PathInstructionJni[] path_instructions) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again not sure why size is needed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flush my first round of review for quick response while continuing the review process. Overall, I don't find any big concerns since we are currently prioritizing correctness over clean code. But definitely we need to have follow up work with a lot of things to do.
src/main/cpp/src/get_json_object.cu
Outdated
namespace spark_rapids_jni { | ||
|
||
namespace detail { | ||
// namespace { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: left over for removal. Or it should be uncommented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, removed.
src/main/cpp/src/get_json_object.cu
Outdated
__device__ bool evaluate_path(json_parser<>& p, | ||
json_generator<>& g, | ||
write_style style, | ||
path_instruction const* path_ptr, | ||
int path_size) | ||
{ | ||
return path_evaluator::evaluate_path(p, g, style, path_ptr, path_size); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh this is just calling to another function. Why do we ever need it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, removed.
src/main/cpp/src/get_json_object.cu
Outdated
* @param commands The command buffer to be applied to the string. Always ends | ||
* with a path_operator_type::END | ||
* @param out_buf Buffer user to store the results of the query (nullptr in the | ||
* size computation step) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This is better for readability.
* @param commands The command buffer to be applied to the string. Always ends | |
* with a path_operator_type::END | |
* @param out_buf Buffer user to store the results of the query (nullptr in the | |
* size computation step) | |
* @param commands The command buffer to be applied to the string. Always ends | |
* with a path_operator_type::END | |
* @param out_buf Buffer user to store the results of the query (nullptr in the | |
* size computation step) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/main/cpp/src/get_json_object.cu
Outdated
* This kernel operates in a 2-pass way. On the first pass, it computes | ||
* output sizes. On the second pass it fills in the provided output buffers | ||
* (chars and validity) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* This kernel operates in a 2-pass way. On the first pass, it computes | |
* output sizes. On the second pass it fills in the provided output buffers | |
* (chars and validity) | |
* This kernel operates in a 2-pass way. On the first pass it computes the | |
* output sizes. On the second pass, it fills in the provided output buffers | |
* (chars and validity). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/main/cpp/src/get_json_object.hpp
Outdated
CUDF_HOST_DEVICE context& operator=(const context& other) | ||
{ | ||
token = other.token; | ||
case_path = other.case_path; | ||
g = other.g; | ||
style = other.style; | ||
path_ptr = other.path_ptr; | ||
path_size = other.path_size; | ||
task_is_done = other.task_is_done; | ||
dirty = other.dirty; | ||
is_first_enter = other.is_first_enter; | ||
child_g = other.child_g; | ||
|
||
return *this; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to have nothing special, so just leave the compiler do it.
CUDF_HOST_DEVICE context& operator=(const context& other) | |
{ | |
token = other.token; | |
case_path = other.case_path; | |
g = other.g; | |
style = other.style; | |
path_ptr = other.path_ptr; | |
path_size = other.path_size; | |
task_is_done = other.task_is_done; | |
dirty = other.dirty; | |
is_first_enter = other.is_first_enter; | |
child_g = other.child_g; | |
return *this; | |
CUDF_HOST_DEVICE context& operator=(context const&) = default; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/main/cpp/src/get_json_object.hpp
Outdated
* input json string is invalid. | ||
*/ | ||
std::unique_ptr<cudf::column> get_json_object( | ||
cudf::strings_column_view const& col, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cudf::strings_column_view const& col, | |
cudf::strings_column_view const& input, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
template <bool allow_single_quotes = curr_allow_single_quotes, | ||
bool allow_unescaped_control_chars = curr_allow_unescaped_control_chars, | ||
int max_json_nesting_depth = curr_max_json_nesting_depth, | ||
int max_string_utf8_bytes = curr_max_string_utf8_bytes, | ||
int max_num_len = curr_max_num_len, | ||
bool allow_tailing_sub_string = curr_allow_tailing_sub_string> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Oh no, too many template parameters. They will make compilation very slow. We should better use run time parameters, except only the parameters that need to be known at compile time. In future work we have to clean up this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only have one template combination.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do in a follow-on.
/** | ||
* write style when writing out JSON string | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comment for the docs in this file is similar to the previous one: Don't use non-standard oxygen. I'm fine to leave them for now but we have to cleanup in the follow up work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do in a follow-on.
src/main/cpp/src/json_parser.hpp
Outdated
/** | ||
* is current position EOF | ||
*/ | ||
CUDF_HOST_DEVICE inline bool eof(char const* pos) { return pos >= json_end_pos; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also CUDF_HOST_DEVICE
s need to be only __device__
if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
Yes, follow up things:
|
build |
|
||
String JSON = "[100.0,200.000,351.980]"; | ||
String expectedStr = "[100.0,200.000,351.980]"; | ||
try ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems it is a temp test data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
* Support number normalization Signed-off-by: Haoyang Li <haoyangl@nvidia.com> * delete cpp test and add a java test case Signed-off-by: Haoyang Li <haoyangl@nvidia.com> --------- Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
/** | ||
* write JSON style | ||
*/ | ||
enum class write_style { raw_style, quoted_style, flatten_style }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Enum class values should be ALL_CAP
. In addition, they don't have to all have the suffix _style
.
enum class write_style { raw_style, quoted_style, flatten_style }; | |
enum class write_style { RAW, QUOTED, FLATTEN }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, why path_instruction_type
is outside while write_style
is inside namespace detail
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/main/cpp/src/JSONUtilsJni.cpp
refers to it.
case 't': | ||
if (nullptr != copy_dest && write_style::unescaped == w_style) { *copy_dest++ = '\t'; } | ||
if (copy_dest != nullptr && write_style::escaped == w_style) { | ||
if (copy_dest != nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check again: copy_dest != nullptr, it is redudant.
Go through all copy_dest
and w_style
related paths.
And add cases.
Could we consider changing the title of this PR? Perhaps "New implementation of getJsonObject", or something more descriptive? |
bool scientificNotation = (exp < -3) || (exp >= 7); | ||
|
||
// Values in the interval [1E-3, 1E7) are special. | ||
if (scientificNotation) { | ||
// Print in the format x.xxxxxE-yy. | ||
for (uint32_t i = 0; i < olength - 1; ++i) { | ||
uint32_t const c = output % 10; | ||
for (int i = 0; i < olength - 1; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessary to change this, but we could have gotten away with the following:
for (int i = 0; i < olength - 1; ++i) { | |
for (auto i = 0; i < olength - 1; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, let's use int32_t
consistently. No need to bring int
into it also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do in a follow-on.
Done. |
// no NaN in json | ||
if (exponent) { | ||
if (sign) { | ||
memcpy(result, "\"-Infinity\"", 11); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: Feel free to ignore:
I had to count the characters to make sure 11
is right. :]
One trick is to use sizeof("\"-Infinity\"") - 1
. That is constexpr
and should yield 11
.
I suspect strlen("\"-Infinity\"")
is more readable, but I'm not sure it's constexpr
in __device__
. (I know it is in __host__
C++.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do in a follow-on.
@@ -1424,4 +1475,15 @@ __device__ inline int format_float(double value, int digits, bool is_float, char | |||
} | |||
} | |||
|
|||
//===== json_parser utility ===== | |||
|
|||
__device__ inline int double_normalization(double value, char* output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's consider renaming this to normalize_double()
.
Also, consider a function doc line indicating the protocol of returning the number of characters if output
is nullptr
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do in a follow-on.
src/main/cpp/src/get_json_object.hpp
Outdated
: output(nullptr), output_len(0), hide_outer_array_tokens(_hide_outer_array_tokens) | ||
{ | ||
} | ||
__device__ json_generator<>& operator=(const json_generator<>& other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__device__ json_generator<>& operator=(const json_generator<>& other) | |
__device__ json_generator<>& operator=(json_generator<> const& other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
/** | ||
* write JSON style | ||
*/ | ||
enum class write_style { raw_style, quoted_style, flatten_style }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, why path_instruction_type
is outside while write_style
is inside namespace detail
?
/** | ||
* path evaluator which can run on both CPU and GPU | ||
*/ | ||
struct path_evaluator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Device code and/or class with device code should be placed in .cu
or .cuh
files. .hpp
files should never contain device code.
constexpr int max_path_depth = 32; | ||
|
||
// stack | ||
context stack[max_path_depth]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this declaration but feel really nervous about it. Typically, a GPU thread has very limited resources (memory/registers etc). Reserving a large array in a thread will stress on the resources, which results in fewer number of threads per block and reduces performance. Probably the better way is to declare this as a shared memory array instead. I'm not 100% sure until we can run profiling on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, the code below also has a lot of if/else branches without any thread sync. Thus, thread divergence may occur which causes further slowdown. I have little idea about what can we improve for that huge while
loop 😢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed a following on perf issue:
#1907
enum class write_style { | ||
// e.g.: '\\r' is a string with 2 chars '\' 'r', writes 1 char '\r' | ||
unescaped, | ||
|
||
// * e.g.: '"' is a string with 1 char '"', writes out 4 chars '"' '\' '\"' | ||
// '"' | ||
escaped | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enum class write_style { | |
// e.g.: '\\r' is a string with 2 chars '\' 'r', writes 1 char '\r' | |
unescaped, | |
// * e.g.: '"' is a string with 1 char '"', writes out 4 chars '"' '\' '\"' | |
// '"' | |
escaped | |
}; | |
enum class write_style { | |
// e.g.: '\\r' is a string with 2 chars '\' 'r', writes 1 char '\r' | |
UNESCAPED, | |
// * e.g.: '"' is a string with 1 char '"', writes out 4 chars '"' '\' '\"' | |
// '"' | |
ESCAPED | |
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix in a follow-on
class json_parser { | ||
public: | ||
__device__ inline json_parser(char const* const _json_start_pos, cudf::size_type const _json_len) | ||
: json_start_pos(_json_start_pos), | ||
json_end_pos(_json_start_pos + _json_len), | ||
curr_pos(_json_start_pos) | ||
{ | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, device code should be in .cuh
file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix in a follow-on
|
||
__device__ inline int double_normalization(double value, char* output) | ||
{ | ||
if (output == nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't fully grasped the control flow, so this might be a silly observation:
We might be inadvertently checking output == nullptr
multiple times, both from the call-site and here. Since this is a __device__
function, this might amount to a lot of calls.
In a follow-up, there might be value in separating those call chains. Computing the sizes can be done separately from actually writing the normalized value. Just a thought for future consideration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix in a follow-on
auto const& [type, name, index] = inst; | ||
switch (type) { | ||
case path_instruction_type::SUBSCRIPT: | ||
path_commands.emplace_back(path_instruction{path_instruction_type::SUBSCRIPT}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: If we're using emplace_back
, we might be able to get away with:
path_commands.emplace_back(path_instruction{path_instruction_type::SUBSCRIPT}); | |
path_commands.emplace_back(path_instruction_type::SUBSCRIPT); |
build |
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Done for the JNI case: |
build |
Now the
@ttnghia I think it's safe to merge this PR now. |
build |
Sorry, one more fix. |
Is there any chance to run |
Tests can pass with sanitizer on. Merging it… |
closes #1891
Add a new version of Get json object to replace old version
Signed-off-by: Haoyang Li haoyangl@nvidia.com
Signed-off-by: Chong Gao res_life@163.com