-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concrete strings #16
Comments
It feels like this is going to be a tough sell if it has a noticeable performance effect on how strings are typically used by CPython. OTOH the unicode object already has many representations under the hood. Maybe it would be possible to add another? It would have to wrap and own the Java string object. (Something would have to wrap and own it, we can't just cast a Java object pointer to |
The only requirement here is that when using the string something needs to be called to prepare the string before usage and the memory for the string may be stored elsewhere. What will happen in for Java or C# will be that string ready will check to see if the object has already been transferred in which case it will return immediately, or it will call some routine to make the memory available using one of the existing Python protocols. Java uses a funny encoding that is neither UTF8 nor UTF16 but something inbetween. When the string is destroyed it would then need to release the memory. For non-abstract strings it would just check the slot and find there was no abstract string slot so it proceeds. So the cost would be one slot check per usage. Thus the proposal would be that there was a PyStringReady() slot and slot for destroying the string. The memory space for the string could hold a pointer to the external memory. Bindings would use lazy transferring to move their string into Python when ready is called and check to see if the string was ever readied in which case it would release the external memory. Ideally this all happens behind the scenes such that there is never any changes on the users side. Calls that access the string data (ie PyUnicode_1BYTE_DATA, PyUnicode_2BYTE_DATA, PyUnicode_4BYTE_DATA) and those that are reporting (PyUnicode_KIND) call the ready slot which causes all the fields in the string to be filled out. |
I'm not sure I follow all that, but fortunately, this issue tracker is not for solutions but for problems, and the problem seems clear enough. |
I think this is very solvable, thanks to Inada-san's work in improving the PyUnicode API. FWIW, this would enable adding performant (but somewhat tricky to use) API for zero-copy strings e.g. from mapped files or from/to languages like Rust. |
FWIW: I also ran into this with PyObjC, which contains a subtype of PyUnicode_Type just to be able to use Objective-C strings transparently with extension functions that expect a string arguments. That subtype is inherently fragile because its implementation uses implementation details of PyUnicode_Type. Luckily that implementation hasn't seen a lot of changes so far, other than the migration to the current representation earlier in Python 3's development. A problem with integration could be the representation of foreign strings, e.g. Java and Objective-C strings logically are UCS2 while Python's string is UCS4. That can probably be solved by using UTF-8 in a hypothetical string protocol. |
From jpype-project/jpype#1071 (comment) :
The text was updated successfully, but these errors were encountered: