Add @Utf16String annotation to map java.lang.String to unsigned short* #442

equeim · 2020-12-16T11:35:49Z

Currently, if your native string type is UTF-16 encoded, and you want to map it to java.lang.String, you need to expose it as CharPointer and call its getString() method. It works, but it involves creation of intermediary Java object (CharPointer), C++ -> Java method call to inititalize CharPointer, and additional array copy.

This patch allows to do it much faster by directly using NewString() JNI function to directly create String from const jchar* (aka const unsigned short*).

saudet · 2020-12-16T11:45:14Z

NewString() is fine I suppose, but GetStringChars() does not return null terminated strings, so we can't use it in the general case.

equeim · 2020-12-19T13:49:17Z

You are right, I replaced it with GetStringRegion.
By the way, this doesn't work with StringAdapter("char16_t") when passing string to C++ on UNIX platforms yet, because of bug in StringAdapter (it assumes that unsigned short == wchar_t). I plan to rewrite StringAdapter in more generic way to fix this (also, am I correct that C++11 can't be used in generated C++ code?)

saudet · 2020-12-19T14:05:39Z

GetStringRegion() is probably not going to be any faster than CallObjectMethod(), but I may be wrong. What is the difference in speed?

By the way, this doesn't work with StringAdapter("char16_t") when passing string to C++ on UNIX platforms yet, because of bug in StringAdapter (it assumes that unsigned short == wchar_t). I plan to rewrite StringAdapter in more generic way to fix this (also, am I correct that C++11 can't be used in generated C++ code?)

I don't see where StringAdapter assumes that wchar_t is 16-bit. Where does that happen?

C++11 shouldn't be used if it's not necessary, yes, but of course there isn't any problem when the classes we want to map are only available in C++11 anyway.

equeim · 2020-12-19T16:02:49Z

Difference in speed comes from not having intermediary CharPointer.
In my Java code, I don't work with CharPointers, I work with Strings. When I pass them to C++, I convert them to CharPointer at the place of calling of native function, and when I pass CharPoiner to Java I call getString() on it immediately because I need java.lang.String, not CharPointer. These conversions add additional overhead.

I did some tests on OpenJDK 11 on Linux, and passing unsigned short* to Java in form of @Utf16String String is about 4x faster than passing it as CharPointer and calling getString() on it.
Passing @Utf16String String to C++ is about 12x faster than creating a new CharPointer from String and passing it to C++.

Of course, these were very synthetic tests, but difference is there.

I don't see where StringAdapter assumes that wchar_t is 16-bit. Where does that happen?

StringAdapter(const unsigned short* ptr, typename std::basic_string<T>::size_type size, void* owner) : ptr((T*)ptr), size(size), owner(owner),
        str2(ptr ? (T*)ptr : L"", ptr ? (size > 0 ? size : wcslen((wchar_t*)ptr)) : 0), str(str2) { }

operator     unsigned   short*() { return (unsigned short*)(operator wchar_t*)(); }

saudet · 2020-12-19T22:04:01Z

Interesting, I did a bit of micro benchmarking with a small ASCII string and GetStringRegion() seems to be as fast as GetByteArrayElements(), so it does seem like it's storing strings as UTF-16 internally, even for Java 11 even though it shouldn't be:
https://stackoverflow.com/questions/9699071/what-is-the-javas-internal-represention-for-string-modified-utf-8-utf-16
Maybe the trick that they are using allows for fast copy anyway? Very interesting.

We can cast the data from StringAdapter to int as well, so that should cover pretty much all implementations out there:

            out.println("    operator     unsigned   short*() { return (unsigned short*)(operator wchar_t*)(); }");
            out.println("    operator       signed     int*() { return (  signed   int*)(operator wchar_t*)(); }");

If what you'd like to use are char16_t and char32_t, that's only available from C++11, yes. We can probably add those to StringAdapter with some if __cplusplus >= 201103L || _MSC_VER >= 1900 though. Have you tried?

saudet · 2021-01-22T02:29:12Z

Let's try to think of friendlier names for those annotations. Under C11 and C++11, jchar is better known as char16_t, so we may want to have something that casts directly to that instead of jchar, something like @AsChar16T, maybe? However, on second thought, char16_t is a typedef of uint_least16_t that is not guaranteed to be a 16-bit type, which is weird, so it wouldn't make sense to force the association with that I suppose. We may also want to get "modified UTF-8" for performance reasons, see thread about that in issue #70, and there's a char8_t for that in C++20 apparently, but it's not something widely used, yet, so char is sufficient for that for now, but if we name the annotation for that @AsChar8T, it would be majorly confusing. There's also char32_t in C11 and C++11, but there's no efficient method to retrieve code points with JNI, so I guess we don't need to think about that, for now.

Qt, probably the most popular platform in C++, has methods like toUtf8(), utf16(), etc, see https://doc.qt.io/qt-6/qstring.html, so maybe annotations like @AsUtf8 and @AsUtf16 would be clearest after all. We could have those casted to const char* and const char16_t*, respectively, by default. I think it would make sense to assume that char16_t is 16-bit on almost all platforms, similarly to how we're assuming that int is 32-bit on all platforms that Java runs on. That way we'd also be able to use more easily the new annotation from pull #448 this way: @StdU16String @AsUtf16 String myString; and similarly for "modified UTF-8" from JNI: @StdString @AsUtf8 String myString. What do you think?

saudet · 2021-01-22T03:13:56Z

Actually, to be consistent with CharPointer, in the case of @AsUtf16, we should cast to unsigned short*. We could compensate for that by making the @StdU16String annotation map to @Adapter("BasicStringAdapter<char16_t, unsigned short>"), which should work fine, and wouldn't force users into C++11.

equeim · 2021-01-27T22:21:55Z

Under C11 and C++11, jchar is better known as char16_t

OpenJDK typedefs jchar as unsigned short on both Linux and Windows, which is distinct type from char16_t. Android NDK typedefs it to uint16_t which itself is usually typedef to unsigned short (except on platforms where unsigned short is not 16 bit, but I don't think Android supports them). Also, char16_t is required to have the same properties as uint_least16_t, but is always a distinct type.
But I agree that @AsUtf16 looks better.

and similarly for "modified UTF-8" from JNI: @stdstring @AsUtf8 String myString

You mean, using @AsUtf8 to convert to/from Modifed UTF-8 instead of normal UTF-8? Seems counterintuitive to me, it definitely should be in the name. Also, is there anyone who actually needs to pass Modified UTF-8 strings to/from native code (instead of UTF-8)? I though that it is mostly used by JVM internally.

We could compensate for that by making the @StdU16String annotation map to @Adapter("BasicStringAdapter<char16_t, unsigned short>"), which should work fine, and wouldn't force users into C++11.

I think you meant BasicStringAdapter<unsigned short, char16_t>, since it's defined as template<typename P, typename T = P> (i.e. first parameter is pointer that adapter converts to/from, and second one is template parameter of basic_string). I'm not sure if we want to hardcode @StdU16String with BasicStringAdapter<unsigned short, char16_t> though, since P depends on pointer type. For example, for String with @AsUtf16 and CharPointer it will be unsigned short but for ShortPointer it will be short. But this is discussion for #448.

saudet · 2021-01-28T00:32:00Z

You mean, using @AsUtf8 to convert to/from Modifed UTF-8 instead of normal UTF-8? Seems counterintuitive to me, it definitely should be in the name. Also, is there anyone who actually needs to pass Modified UTF-8 strings to/from native code (instead of UTF-8)? I though that it is mostly used by JVM internally.

Well, "modified" is not in the names of the corresponding JNI functions, so I don't think we need to be too worried about that. You want to have it like @AsModifiedUtf8? Why not I suppose. I don't think anyone needs modified UTF-8, but it's faster than going through Java code. That would be pretty much the only reason why one may want to use that I believe.

I think you meant BasicStringAdapter<unsigned short, char16_t>, since it's defined as template<typename P, typename T = P> (i.e. first parameter is pointer that adapter converts to/from, and second one is template parameter of basic_string). I'm not sure if we want to hardcode @StdU16String with BasicStringAdapter<unsigned short, char16_t> though, since P depends on pointer type. For example, for String with @AsUtf16 and CharPointer it will be unsigned short but for ShortPointer it will be short. But this is discussion for #448.

Yes, BasicStringAdapter<unsigned short, char16_t> is what I meant. It's not hard coded, it's always possible to deal with it using mainly @Cast, for example, native void process(@StdU16String @Cast("unsigned short*") ShortPointer myUtf16Data). It's also possible to do native void process(@Adapter("BasicStringAdapter<signed short, char16_t>") ShortPointer myUtf16Data), among other variants in the same theme.

saudet · 2021-02-02T01:15:19Z

src/main/java/org/bytedeco/javacpp/tools/Generator.java

@@ -937,23 +938,44 @@ boolean classes(boolean handleExceptions, boolean defineAdapters, boolean conver
        out.println("}");
        out.println();
        if (handleExceptions || convertStrings) {
-            out.println("static JavaCPP_noinline jstring JavaCPP_createString(JNIEnv* env, const char* ptr) {");
+            out.println("#include <string>");
+            out.println("static JavaCPP_noinline jstring JavaCPP_createStringUTF8(JNIEnv* env, const char* ptr, size_t length) {");


Methods like String(byte[]) and String.getBytes() actually work for the native charset of the underlying OS, not necessarily UTF-8. We could add custom methods that somehow work with a user-specified encoding, but the default behavior of those methods should stay the same, and they shouldn't be named "UTF8".

Yeah, I missed that.
I actually started to implement a more generic solution that would allow to pass any Charset to createString/getBytes (so that String could be created from byte array in any encoding), when I thought: how do we determine that byte array returned from getBytes() can be safely null-terminated?. I mean, when we are using default charset, there is no guarantee that returned array won't have null byte somewhere in the middle. Do we just shift responsibility to user?

The null character gets added here manually after GetByteArrayRegion(), just like you did with GetStringRegion():
https://github.com/bytedeco/javacpp/blob/master/src/main/java/org/bytedeco/javacpp/tools/Generator.java#L972
As for passing a Charset, I'm thinking something defined globally at compile time as per issue #70 (comment) would be best. I'm not sure how we'd go about passing a Java object rather than just a string like that though. The latter entails potential exceptions at initialization time, but I guess that wouldn't be so bad. What do you think?

I was thinking about passing a charset name to getBytes(java.lang.String)/String(byte[], java.lang.String) overloads. Generator would extract charset name from annotation, hardcode it in C++ code which will create jstring from it using NewStringUTF (we can use modified utf-8 here I believe) and pass it to to getString()/String(). It has an advantage over define in that you could use different charsets for different functions.

The downside is that these NewStringUTF() calls will introduce additional overhead (JMH said that it would be almost 50% slower than using String(byte[])) unless you find some way to cache them.

Setting encoding via define would of course also require passing jstring with charset name (unless you meant some other method?) but then you can easily cache it.

I'm thinking something like this too, but the issue with passing a charset name like that for every call is that it may through an exception at runtime, any time. It's better to resolve it at load time into a Charset object.

…(array of UTF-16 code units)

saudet · 2021-03-09T02:00:18Z

BTW, to obtain even higher performance, we should also add support for these functions via @CriticalRegion:
https://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/functions.html#GetStringCritical_ReleaseStringCritical

equeim force-pushed the utf16 branch from 414d598 to ff4d6f1 Compare December 18, 2020 22:13

saudet mentioned this pull request Jan 22, 2021

NullPointerException in Parser and poor Unicode support #70

Closed

equeim force-pushed the utf16 branch from ff4d6f1 to fa54c45 Compare February 1, 2021 22:21

saudet requested changes Feb 2, 2021

View reviewed changes

Add @AsUtf16 annotation to map java.lang.String to unsigned short* …

ecdfa32

…(array of UTF-16 code units)

equeim force-pushed the utf16 branch from fa54c45 to ecdfa32 Compare February 10, 2021 00:19

Update CHANGELOG.md and fix a few nits

42dbba9

saudet approved these changes Feb 12, 2021

View reviewed changes

saudet merged commit 12a8a2f into bytedeco:master Feb 12, 2021

saudet mentioned this pull request May 28, 2021

@AsUtf16 annotation is lost for setter methods #482

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add @Utf16String annotation to map java.lang.String to unsigned short* #442

Add @Utf16String annotation to map java.lang.String to unsigned short* #442

equeim commented Dec 16, 2020

saudet commented Dec 16, 2020

equeim commented Dec 19, 2020

saudet commented Dec 19, 2020

equeim commented Dec 19, 2020

saudet commented Dec 19, 2020

saudet commented Jan 22, 2021

saudet commented Jan 22, 2021

equeim commented Jan 27, 2021

saudet commented Jan 28, 2021 •

edited

Loading

saudet Feb 2, 2021

equeim Feb 9, 2021

saudet Feb 9, 2021 •

edited

Loading

equeim Feb 9, 2021

saudet Feb 10, 2021

saudet commented Mar 9, 2021

Add @Utf16String annotation to map java.lang.String to unsigned short* #442

Add @Utf16String annotation to map java.lang.String to unsigned short* #442

Conversation

equeim commented Dec 16, 2020

saudet commented Dec 16, 2020

equeim commented Dec 19, 2020

saudet commented Dec 19, 2020

equeim commented Dec 19, 2020

saudet commented Dec 19, 2020

saudet commented Jan 22, 2021

saudet commented Jan 22, 2021

equeim commented Jan 27, 2021

saudet commented Jan 28, 2021 • edited Loading

saudet Feb 2, 2021

Choose a reason for hiding this comment

equeim Feb 9, 2021

Choose a reason for hiding this comment

saudet Feb 9, 2021 • edited Loading

Choose a reason for hiding this comment

equeim Feb 9, 2021

Choose a reason for hiding this comment

saudet Feb 10, 2021

Choose a reason for hiding this comment

saudet commented Mar 9, 2021

saudet commented Jan 28, 2021 •

edited

Loading

saudet Feb 9, 2021 •

edited

Loading