-
Notifications
You must be signed in to change notification settings - Fork 588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add @Utf16String annotation to map java.lang.String to unsigned short* #442
Conversation
|
You are right, I replaced it with GetStringRegion. |
GetStringRegion() is probably not going to be any faster than CallObjectMethod(), but I may be wrong. What is the difference in speed?
I don't see where StringAdapter assumes that wchar_t is 16-bit. Where does that happen? C++11 shouldn't be used if it's not necessary, yes, but of course there isn't any problem when the classes we want to map are only available in C++11 anyway. |
Difference in speed comes from not having intermediary CharPointer. I did some tests on OpenJDK 11 on Linux, and passing Of course, these were very synthetic tests, but difference is there.
|
Interesting, I did a bit of micro benchmarking with a small ASCII string and GetStringRegion() seems to be as fast as GetByteArrayElements(), so it does seem like it's storing strings as UTF-16 internally, even for Java 11 even though it shouldn't be: We can cast the data from StringAdapter to
If what you'd like to use are char16_t and char32_t, that's only available from C++11, yes. We can probably add those to StringAdapter with some |
Let's try to think of friendlier names for those annotations. Under C11 and C++11, Qt, probably the most popular platform in C++, has methods like |
Actually, to be consistent with |
OpenJDK typedefs
You mean, using
I think you meant |
Well, "modified" is not in the names of the corresponding JNI functions, so I don't think we need to be too worried about that. You want to have it like @AsModifiedUtf8? Why not I suppose. I don't think anyone needs modified UTF-8, but it's faster than going through Java code. That would be pretty much the only reason why one may want to use that I believe.
Yes, |
@@ -937,23 +938,44 @@ boolean classes(boolean handleExceptions, boolean defineAdapters, boolean conver | |||
out.println("}"); | |||
out.println(); | |||
if (handleExceptions || convertStrings) { | |||
out.println("static JavaCPP_noinline jstring JavaCPP_createString(JNIEnv* env, const char* ptr) {"); | |||
out.println("#include <string>"); | |||
out.println("static JavaCPP_noinline jstring JavaCPP_createStringUTF8(JNIEnv* env, const char* ptr, size_t length) {"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Methods like String(byte[])
and String.getBytes()
actually work for the native charset of the underlying OS, not necessarily UTF-8. We could add custom methods that somehow work with a user-specified encoding, but the default behavior of those methods should stay the same, and they shouldn't be named "UTF8".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I missed that.
I actually started to implement a more generic solution that would allow to pass any Charset to createString/getBytes (so that String could be created from byte array in any encoding), when I thought: how do we determine that byte array returned from getBytes() can be safely null-terminated?. I mean, when we are using default charset, there is no guarantee that returned array won't have null byte somewhere in the middle. Do we just shift responsibility to user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The null character gets added here manually after GetByteArrayRegion(), just like you did with GetStringRegion():
https://github.com/bytedeco/javacpp/blob/master/src/main/java/org/bytedeco/javacpp/tools/Generator.java#L972
As for passing a Charset, I'm thinking something defined globally at compile time as per issue #70 (comment) would be best. I'm not sure how we'd go about passing a Java object rather than just a string like that though. The latter entails potential exceptions at initialization time, but I guess that wouldn't be so bad. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about passing a charset name to getBytes(java.lang.String)
/String(byte[], java.lang.String)
overloads. Generator would extract charset name from annotation, hardcode it in C++ code which will create jstring from it using NewStringUTF (we can use modified utf-8 here I believe) and pass it to to getString()/String(). It has an advantage over define in that you could use different charsets for different functions.
The downside is that these NewStringUTF() calls will introduce additional overhead (JMH said that it would be almost 50% slower than using String(byte[])
) unless you find some way to cache them.
Setting encoding via define would of course also require passing jstring with charset name (unless you meant some other method?) but then you can easily cache it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking something like this too, but the issue with passing a charset name like that for every call is that it may through an exception at runtime, any time. It's better to resolve it at load time into a Charset object.
…(array of UTF-16 code units)
BTW, to obtain even higher performance, we should also add support for these functions via |
Currently, if your native string type is UTF-16 encoded, and you want to map it to java.lang.String, you need to expose it as CharPointer and call its getString() method. It works, but it involves creation of intermediary Java object (CharPointer), C++ -> Java method call to inititalize CharPointer, and additional array copy.
This patch allows to do it much faster by directly using NewString() JNI function to directly create String from
const jchar*
(akaconst unsigned short*
).