Add GDExtension function to construct StringName directly from `char*` #78580

Bromeon · 2023-06-22T19:16:14Z

So far, an indirection via String was necessary, causing at least 2 allocations and copies (String; String inside StringName). Since StringNames often refer to string literals, this allows them to be directly constructed from C strings.

It also provides the is_static flag: when the source has static storage duration, no copy/allocation will be needed. However, the extension developer needs to uphold this lifetime guarantee.

Some things need to be discussed:

Does this actually support UTF-8? I've only seen ASCII StringNames in the wild...
I adopted the name from the String constructor function, feel free to suggest a better one (especially if 1 doesn't hold).
Are the memory assumptions correct?
- If a static StringName is created, its destructor should not be invoked?
- Non-static ones are reference-counted as usual?

dsnopek · 2023-06-23T13:31:24Z

Thanks! The overall idea here looks great :-)

Does this actually support UTF-8? I've only seen ASCII StringNames in the wild...

That's a really good question. Looking through the code, it seems as though StringNames created from char * may be interpreted as latin-1? It definitely doesn't seem like utf-8, though - String::hash() is used to find the right data in the table, and it's basically just casting the chars to uint32_t (as opposed to trying to interpret the multi-byte sequences from utf-8).

I adopted the name from the String constructor function, feel free to suggest a better one (especially if 1 doesn't hold).

Maybe gdextension_string_name_new_from_chars()? Or gdextension_string_name_new_from_latin1_chars() if it's really latin-1 and not just plain ASCII.

If a static StringName is created, its destructor should not be invoked?

I'm really not sure about this one. I think it may still be OK to invoke the destructor? I'm actually quite confused about how StringName::_Data::static_count is being used. This needs more investigation or an expert opinion.

dsnopek · 2023-06-23T13:32:08Z

core/extension/gdextension_interface.h

+
+/**
+ * @name string_name_new_with_utf8_chars
+ * @since 4.1


This is probably actually Godot 4.2 material -- 4.1 is about to go into RC next week!

Bromeon · 2023-07-30T14:06:13Z

It looks like the encoding depends on whether the StringName is constructed statically or not. string_name_new_with_chars is now named more generally to accomodate for both encodings.

I mentioned the semantics in the docs, but would appreciate a 2nd opinion:

If is_static is true, then:

The StringName will reuse the p_contents buffer instead of copying it.
You must guarantee that the buffer remains valid for the duration of the application (e.g. string literal).

The encoding is limited to Latin-1 instead of UTF-8.

You must not call a destructor for the StringName.

is_static is purely an optimization and can easily introduce undefined behavior if used wrong. In case of doubt, set it to false.

Since the semantics vary wildly (encoding, lifetime, destruction), would it make sense to split this into two functions?

string_name_new_with_utf8_chars
string_name_new_static_with_latin1_chars

(the "with_xy_chars" naming follows the string_new_* convention)

Bromeon · 2023-07-30T14:13:43Z

@dsnopek

If a static StringName is created, its destructor should not be invoked?

I'm really not sure about this one. I think it may still be OK to invoke the destructor? I'm actually quite confused about how StringName::_Data::static_count is being used. This needs more investigation or an expert opinion.

Regarding that, last time I ran into problems when destroying static StringNames, and IIRC there weren't any memory leaks reported by LSan when not destroying it. Would need to check again, was more than a month ago...

@AThousandShips

Note that a StringName constructed from a utf8 string that contains non-ascii characters will not match equality with one created from utf32, unless the utf8 one converts the buffer to utf32 before creating

From UTF-32, so you mean converted from a String? Because the char* source here is either UTF-8 or Latin-1 (or even ASCII).

But that's an interesting limitation... Why is the buffer not normalized to one format?
Or are you talking about the static case (but then it would be Latin-1/ASCII)?

AThousandShips · 2023-07-30T14:15:04Z

Meant char* with utf8, if it is utf8 it would have to be converted to utf32 to be matching with the hash, or done for the hash calculation

It would match with Latin-1/ASCII, but with anything in utf8 outside the ASCII range the hash won't match unless the hash function is altered

If it creates the StringName by creating a String then everything is good

Bromeon · 2023-07-30T14:16:49Z

But doesn't Godot use that internally with SNAME all the time?

Or that just happens to work because SNAME is used everywhere and not actually converted from String? But that is also hard to imagine, as user provided input like class names comes from String...

AThousandShips · 2023-07-30T14:17:15Z

If you use SNAME it is not guaranteed to work if it contains non-ASCII

AThousandShips · 2023-07-30T14:17:52Z

Reason being:

godot/core/string/string_name.cpp

Lines 212 to 223 in 75f9c97

    
           StringName::StringName(const char *p_name, bool p_static) { 
        
           	_data = nullptr; 
        
           	ERR_FAIL_COND(!configured); 
        
           	if (!p_name || p_name[0] == 0) { 
        
           		return; //empty, ignore 
        
           	} 
        
           	MutexLock lock(mutex); 
        
           	uint32_t hash = String::hash(p_name);

godot/core/string/ustring.cpp

Lines 2765 to 2775 in 75f9c97

    
           uint32_t String::hash(const char *p_cstr) { 
        
           	uint32_t hashv = 5381; 
        
           	uint32_t c = *p_cstr++; 
        
           	while (c) { 
        
           		hashv = ((hashv << 5) + hashv) + c; /* hash * 33 + c */ 
        
           		c = *p_cstr++; 
        
           	} 
        
           	return hashv; 
        
           }

vs:

godot/core/string/ustring.cpp

Lines 2816 to 2826 in 75f9c97

    
           uint32_t String::hash(const char32_t *p_cstr) { 
        
           	uint32_t hashv = 5381; 
        
           	uint32_t c = *p_cstr++; 
        
           	while (c) { 
        
           		hashv = ((hashv << 5) + hashv) + c; /* hash * 33 + c */ 
        
           		c = *p_cstr++; 
        
           	} 
        
           	return hashv; 
        
           }

AThousandShips · 2023-07-30T14:20:40Z

Causing for example "£" to be hashed as if it was "0xC2 0xA3" instead of "0xA3"

Bromeon · 2023-07-30T14:20:58Z

Note that a StringName constructed from a utf8 string that contains non-ascii characters will not match equality with one created from utf32, unless the utf8 one converts the buffer to utf32 before creating

So getting back to this, is it even a good idea to expose the UTF-8 based construction to GDExtension? You can be sure that users will run into this trap sooner or later...

Should we maybe just start by exposing the static ASCII constructor?
(I'm btw still not sure if ASCII or Latin-1...)

Or allow UTF-8 construction and convert to UTF-32 in the presence of non-ASCII chars (and leaving it otherwise)? Might still be fast enough for the average case where StringName is pure ASCII.

AThousandShips · 2023-07-30T14:22:24Z

I'd say to err on the side of caution and do conversion, arguably unconditionally, unless the performance benefits of having the shorter encoding is worth it

So either:

Clarify that it only supports Latin-1 (encoded as such)
Check for encoding and convert conditionally
Convert unconditionally (i.e. just use a String conversion in the creation step):
memnew_placement(r_dest, StringName(String::utf8(p_contents), static_cast<bool>(is_static)));

The difference here for ASCII/Latin-1 is that if it is encoded in Latin-1 it'll work, but if it's Latin-1 encoded as utf8 it'll break

While it won't cause problems with built-in StringNames as they are all reliably in the valid range, it will clash with ones created from GDScript if they contain non-ASCII characters

Bromeon · 2023-07-30T14:42:39Z

Going always through StringName (3) would however defeat the point of this addition, as that's already possible with the existing API. The main idea behind this PR is to allow faster StringName construction, especially from literals.

Option (1) could still provide an is_static flag and could look as follows:

string_name_new_with_latin1_chars(dest, c_string, is_static);

It would then be the caller's responsibility to make sure that c_string is Latin-1.
This would be symmetric with the existing constructor for String, namely string_new_with_latin1_chars.

Later (or now?), we could then provide a constructor from UTF-8 that would unconditionally convert, just like the String constructor.

AThousandShips · 2023-07-30T14:44:06Z

Sounds good!

dsnopek · 2023-07-31T22:12:45Z

So either:

Clarify that it only supports Latin-1 (encoded as such)

Check for encoding and convert conditionally

Convert unconditionally (i.e. just use a String conversion in the creation step):
memnew_placement(r_dest, StringName(String::utf8(p_contents), static_cast<bool>(is_static)));

I'd personally vote for option nr 1

StringNames are meant to be identifiers, not proper strings. So, in my opinion, sticking with ASCII is fine, it just happens that Godot treats C-strings as Latin-1 (as far as I can tell, at least) which is a mostly harmless superset of ASCII.

Later (or now?), we could then provide a constructor from UTF-8 that would unconditionally convert, just like the String constructor.

I'd vote for later. Given that it's already possible to convert a String to StringName, I don't think we're blocking anyone. We could wait to see if anyone actually asks for this.

(UPDATE: Although, if you really want to address it now, I'd be fine with that too :-))

Bromeon · 2023-08-01T06:05:31Z

So, in my opinion, sticking with ASCII is fine, it just happens that Godot treats C-strings as Latin-1 (as far as I can tell, at least) which is a mostly harmless superset of ASCII.

I think this is where things break: while StringNames constructed from ASCII char* compare fine with the StringNames constructed through String, this doesn't hold for Latin-1 char*. See also AThousandShips' comment here and the 2 messages preceding it.

Maybe the hash functions should be aligned? If yes, that would become blocking.

Or we only support ASCII and validate it. But in these constructor functions, there is currently no way to report errors. We could add a GDExtensionBool return type, but it would be inconsistent with other constructors like string_new_*.

I'd vote for later. Given that it's already possible to convert a String to StringName, I don't think we're blocking anyone. We could wait to see if anyone actually asks for this.

(UPDATE: Although, if you really want to address it now, I'd be fine with that too :-))

I already implemented it in the draft 🙂 still needs some testing though.

dsnopek · 2023-08-07T19:29:41Z

I think this is where things break: while StringNames constructed from ASCII char* compare fine with the StringNames constructed through String, this doesn't hold for Latin-1 char*. See also AThousandShips' comment here and the 2 messages preceding it.

Hm, looking at the code, I guess I still don't see why the hash is different for Latin-1 vs UTF-32.

The example of "£" is 0xA3 in Latin-1 and 0x000000A3 in UTF-32. I had thought this worked for all the Latin-1 characters? But I'm not an expert on this, so I could be wrong.

Looking at the hash function (for Latin-1 data):

uint32_t String::hash(const char *p_cstr) { 
 	uint32_t hashv = 5381; 
 	uint32_t c = *p_cstr++; 
  
 	while (c) { 
 		hashv = ((hashv << 5) + hashv) + c; /* hash * 33 + c */ 
 		c = *p_cstr++; 
 	} 
  
 	return hashv; 
 }

It's casting each character to a uint32_t before operating on it, which I think effectively converts the Latin-1 character to UTF-32 (assuming I'm correct above, about Latin-1 being the same as the first set of unicode characters).

Of course, if you passed UTF-8 data into String::hash(const char *p_cstr) then you'd get a very different hash, but I think this is meant to take Latin-1 data.

Or we only support ASCII and validate it. But in these constructor functions, there is currently no way to report errors. We could add a GDExtensionBool return type, but it would be inconsistent with other constructors like string_new_*.

However, if we operate under the assumption that I'm wrong about the hashing above - which very well may be the case :-) - and we can only correctly handle ASCII, I think it'd be fine documenting that it only takes ASCII and naming it accordingly (ie. string_name_new_with_ascii_chars()), I don't think we necessarily need to validate it.

The binding will have more knowledge about the data coming from the developer, and may already have validated that the data is ASCII (perhaps the language even only supports ASCII?) and adding extra validation in the GDExtension interface is unnecessary. There's already definitely parts of the GDExtension interface that expect you to pass correct data in with no validation, so this wouldn't be anything new.

AThousandShips · 2023-08-07T19:35:05Z

To clarify what I was saying, Latin-1, as in the encoding, should work correctly, but I don't know what encodings are used in source as C++ largely uses utf8 AFAIK, so that's my only point of concern, but assuming it is encoded as Latin-1 it should be fully safe

dsnopek · 2023-08-07T19:53:42Z

Ah, ok, thanks for clarifying!

The encoding used by the end-developer is going to depend on their specific programming language, so, in my opinion, it should be up to the language binding to ensure that it's passing data with the right encoding to GDExtension.

dsnopek · 2023-08-11T14:47:45Z

Discussed at the GDExtension meeting, and we think this API looks great! The only remaining question is if there is still an encoding issue. If not, this is probably good to move forward. If so, limiting the API to UTF-8 and ASCII (not Latin-1) would also be fine.

Bromeon · 2023-09-24T13:32:53Z

Found the reason of the discrepancy between the hashes:

StringName constructed from char* (Latin-1)
- uses the uint32_t String::hash(const char *p_cstr) method
- based on char, which has range [-128, 127] on Windows
- '±' has value -79
- assignment to uint32_t leads to value 4294967217
StringName constructed from existing String (UTF-8)
- uses the uint32_t String::hash() const method
- based on char32_t
- U'±' has value 177 (== 256 - 79)

How should we proceed?

Make sure the hashes are computed the same for all strings.
- Would need to be careful that the one which is used for method registrations doesn't change, otherwise all method hashes are invalidated.
- Any other breaking potential?
Make sure that StringName at least uses a cosistent hash function for itself.

I'll try with interpreting char* as uint8_t*...

core/extension/gdextension_interface.h

core/extension/gdextension_interface.cpp

Bromeon · 2023-09-24T14:20:02Z

Thanks for the review, will change parameter names.

Added a 2nd commit that would modify the char* hash and makes the two compare equal in my Rust tests, but we need to see if it doesn't break anything else.

We also need to see how we would do it for wchar_t*, as it essentially suffers from the same problem. It's a bit more difficult, because unlike char, it's size can vary, so I'm not sure if truncating to uint16_t would always give deterministic results.

So far, an indirection via String was necessary, causing at least 2 allocations and copies (String; String inside StringName). Since StringNames often refer to string literals, this allows them to be directly constructed from C strings. There are two formats: Latin-1 and UTF-8. The Latin-1 constructor also provides the `p_is_static` flag: when the source has static storage duration, no copy/allocation will be needed. However, the extension developer needs to uphold this lifetime guarantee.

Bromeon · 2023-09-24T16:33:04Z

Added a UTF-8 constructor that takes length, in addition to the one relying on null-termination. Should greatly enhance interoperability with other string types. (I can also remove the one expecting null-terminated char*, in case that's preferred).

Furthermore, I noticed the docs are imprecise -- the p_size argument sometimes means "number of bytes" (Latin-1, UTF-8) and sometimes "number of characters" (Wide, UTF-16, UTF-32). I renamed the latters to p_char_count and updated parameter docs accordingly. Does that make sense?

What's left is mostly the question whether the hash for wchar_t should also be fixed, and how (as it might be 2 or 4 bytes wide).

dsnopek · 2023-09-24T16:58:20Z

Found the reason of the discrepancy between the hashes

Aha, great catch! I wouldn't have thought to consider the signedness of character type.

Would need to be careful that the one which is used for method registrations doesn't change, otherwise all method hashes are invalidated.

Yes, this is important, I don't want to have to map more hashes :-)

Furthermore, I noticed the docs are imprecise -- the p_size argument sometimes means "number of bytes" (Latin-1, UTF-8) and sometimes "number of characters" (Wide, UTF-16, UTF-32). I renamed the latters to p_char_count and updated parameter docs accordingly. Does that make sense?

Sounds good to me!

What's left is mostly the question whether the hash for wchar_t should also be fixed, and how (as it might be 2 or 4 bytes wide).

Could we do something with constexpr? For example:

    uint32_t val;
    if constexpr (sizeof(wchar_t) == 2) {
        val = static_cast<uint16_t>(input_char);
    } else {
        val = static_cast<uint32_t>(input_char);
    }

Bromeon · 2023-09-24T17:35:11Z

Could we do something with constexpr?

Great idea! I actually ended up using std::conditional_t since the conversion was used in several places, and this adds much less code.

Yes, this is important, I don't want to have to map more hashes :-)

CI would report incompatibility if that happened, correct?

dsnopek · 2023-09-24T18:03:03Z

Great idea! I actually ended up using std::conditional_t since the conversion was used in several places, and this adds much less code.

This code looks great to me!

However, we'll have to see how the core and production teams feel about using std::conditional_t, since it isn't used anywhere in Godot yet, and we try to limit how much of the standard library and newer C++ features we use. (We do already use constexpr in a few places, so that could be a fallback, if necessary - but your version is much shorter and easier to follow, so I'd personally prefer yours.)

CI would report incompatibility if that happened, correct?

Yes, it would, and the tests have passed, so I think we're good :-)

Since char/wchar_t can be either signed or unsigned, its conversion to uint32_t leads to different values depending on platform. In particular, the same string represented as char* (Latin-1; StringName direct construction) or uint32_t (UTF-8; constructed via String) previously resulted in different hashes.

… vs. character count)

Bromeon · 2023-09-24T18:59:53Z

However, we'll have to see how the core and production teams feel about using std::conditional_t, since it isn't used anywhere in Godot yet, and we try to limit how much of the standard library and newer C++ features we use.

I see! I changed it to std::conditional<...>::type, which is already used in several places and should thus be less controversial.

akien-mga · 2023-09-24T21:45:03Z

Thanks!

Bromeon added enhancement topic:gdextension labels Jun 22, 2023

dsnopek added this to the 4.x milestone Jun 22, 2023

dsnopek reviewed Jun 23, 2023

View reviewed changes

Bromeon force-pushed the feature/gdextension-stringname-ctor branch 2 times, most recently from 10c9fad to 131fda4 Compare July 30, 2023 13:47

Bromeon mentioned this pull request Jul 30, 2023

GDExtension: PtrToArg::convert() creates unnecessary copies #80074

Closed

Bromeon force-pushed the feature/gdextension-stringname-ctor branch from 131fda4 to 2dcb5f8 Compare July 30, 2023 21:08

Bromeon force-pushed the feature/gdextension-stringname-ctor branch from 2dcb5f8 to 83ff151 Compare August 7, 2023 20:05

Bromeon force-pushed the feature/gdextension-stringname-ctor branch from 83ff151 to 13fb823 Compare September 24, 2023 14:16

AThousandShips reviewed Sep 24, 2023

View reviewed changes

core/extension/gdextension_interface.h Outdated Show resolved Hide resolved

core/extension/gdextension_interface.cpp Outdated Show resolved Hide resolved

Bromeon force-pushed the feature/gdextension-stringname-ctor branch from 13fb823 to 1fe778b Compare September 24, 2023 14:22

Bromeon force-pushed the feature/gdextension-stringname-ctor branch from 1fe778b to f284d55 Compare September 24, 2023 16:27

Bromeon marked this pull request as ready for review September 24, 2023 16:40

Bromeon requested review from a team as code owners September 24, 2023 16:40

Bromeon force-pushed the feature/gdextension-stringname-ctor branch from f284d55 to 2608410 Compare September 24, 2023 17:32

dsnopek approved these changes Sep 24, 2023

View reviewed changes

Bromeon added 2 commits September 24, 2023 20:57

More precise documentation for GDExtension String constructors (bytes…

c770937

… vs. character count)

Bromeon force-pushed the feature/gdextension-stringname-ctor branch from 2608410 to c770937 Compare September 24, 2023 18:59

akien-mga modified the milestones: 4.x, 4.2 Sep 24, 2023

akien-mga merged commit 40b48b1 into godotengine:master Sep 24, 2023
15 checks passed

Bromeon deleted the feature/gdextension-stringname-ctor branch September 25, 2023 08:11

akien-mga changed the title Add GDExtension function to construct StringName directly from char* Add GDExtension function to construct StringName directly from char * Oct 3, 2023

akien-mga changed the title ~~Add GDExtension function to construct StringName directly from char *~~ Add GDExtension function to construct StringName directly from char* Oct 3, 2023

dsnopek mentioned this pull request Oct 12, 2023

Use the new string_name_new_with_latin1_chars function to improve StringName construction performance godotengine/godot-cpp#1263

Merged

Bromeon mentioned this pull request Dec 8, 2023

Can StringName literals be optimized? godot-rust/gdext#531

Closed

mihe mentioned this pull request Feb 5, 2024

Error regarding non-existent interface function when loading in Godot 4.2-dev2 godot-jolt/godot-jolt#770

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GDExtension function to construct StringName directly from `char*` #78580

Add GDExtension function to construct StringName directly from `char*` #78580

Bromeon commented Jun 22, 2023

dsnopek commented Jun 23, 2023 •

edited

Loading

dsnopek Jun 23, 2023

Bromeon commented Jul 30, 2023

Bromeon commented Jul 30, 2023 •

edited

Loading

AThousandShips commented Jul 30, 2023 •

edited

Loading

Bromeon commented Jul 30, 2023 •

edited

Loading

AThousandShips commented Jul 30, 2023

AThousandShips commented Jul 30, 2023 •

edited

Loading

AThousandShips commented Jul 30, 2023

Bromeon commented Jul 30, 2023 •

edited

Loading

AThousandShips commented Jul 30, 2023 •

edited

Loading

Bromeon commented Jul 30, 2023

AThousandShips commented Jul 30, 2023

dsnopek commented Jul 31, 2023 •

edited

Loading

Bromeon commented Aug 1, 2023 •

edited

Loading

dsnopek commented Aug 7, 2023 •

edited

Loading

AThousandShips commented Aug 7, 2023 •

edited

Loading

dsnopek commented Aug 7, 2023

dsnopek commented Aug 11, 2023

Bromeon commented Sep 24, 2023 •

edited

Loading

Bromeon commented Sep 24, 2023

Bromeon commented Sep 24, 2023 •

edited

Loading

dsnopek commented Sep 24, 2023

Bromeon commented Sep 24, 2023

dsnopek commented Sep 24, 2023

Bromeon commented Sep 24, 2023

akien-mga commented Sep 24, 2023

Add GDExtension function to construct StringName directly from char* #78580

Add GDExtension function to construct StringName directly from char* #78580

Conversation

Bromeon commented Jun 22, 2023

dsnopek commented Jun 23, 2023 • edited Loading

dsnopek Jun 23, 2023

Choose a reason for hiding this comment

Bromeon commented Jul 30, 2023

Bromeon commented Jul 30, 2023 • edited Loading

AThousandShips commented Jul 30, 2023 • edited Loading

Bromeon commented Jul 30, 2023 • edited Loading

AThousandShips commented Jul 30, 2023

AThousandShips commented Jul 30, 2023 • edited Loading

AThousandShips commented Jul 30, 2023

Bromeon commented Jul 30, 2023 • edited Loading

AThousandShips commented Jul 30, 2023 • edited Loading

Bromeon commented Jul 30, 2023

AThousandShips commented Jul 30, 2023

dsnopek commented Jul 31, 2023 • edited Loading

Bromeon commented Aug 1, 2023 • edited Loading

dsnopek commented Aug 7, 2023 • edited Loading

AThousandShips commented Aug 7, 2023 • edited Loading

dsnopek commented Aug 7, 2023

dsnopek commented Aug 11, 2023

Bromeon commented Sep 24, 2023 • edited Loading

Bromeon commented Sep 24, 2023

Bromeon commented Sep 24, 2023 • edited Loading

dsnopek commented Sep 24, 2023

Bromeon commented Sep 24, 2023

dsnopek commented Sep 24, 2023

Bromeon commented Sep 24, 2023

akien-mga commented Sep 24, 2023

Add GDExtension function to construct StringName directly from `char*` #78580

Add GDExtension function to construct StringName directly from `char*` #78580

dsnopek commented Jun 23, 2023 •

edited

Loading

Bromeon commented Jul 30, 2023 •

edited

Loading

AThousandShips commented Jul 30, 2023 •

edited

Loading

Bromeon commented Jul 30, 2023 •

edited

Loading

AThousandShips commented Jul 30, 2023 •

edited

Loading

Bromeon commented Jul 30, 2023 •

edited

Loading

AThousandShips commented Jul 30, 2023 •

edited

Loading

dsnopek commented Jul 31, 2023 •

edited

Loading

Bromeon commented Aug 1, 2023 •

edited

Loading

dsnopek commented Aug 7, 2023 •

edited

Loading

AThousandShips commented Aug 7, 2023 •

edited

Loading

Bromeon commented Sep 24, 2023 •

edited

Loading

Bromeon commented Sep 24, 2023 •

edited

Loading