Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concrete strings #16

Open
encukou opened this issue May 10, 2023 · 5 comments
Open

Concrete strings #16

encukou opened this issue May 10, 2023 · 5 comments

Comments

@encukou
Copy link
Contributor

encukou commented May 10, 2023

From jpype-project/jpype#1071 (comment) :

In theory Java and Python have compatible definitions of strings. Both are immutable and thus one should just be able to wrap a Java string as a Python string and be done. However, Strings are not a protocol but rather concrete objects, thus I can't just implement an interface for the C API of Python string and make them compatible. This failing forces either immediate conversion (which itself is problematic if a string is very large and the user does not intend to work with in Python extensively), or use a conversion. This problem does not just affect Java wrappers, but Qt wrappers and many other language bindings where immutable strings are available.

@gvanrossum
Copy link

It feels like this is going to be a tough sell if it has a noticeable performance effect on how strings are typically used by CPython. OTOH the unicode object already has many representations under the hood. Maybe it would be possible to add another? It would have to wrap and own the Java string object. (Something would have to wrap and own it, we can't just cast a Java object pointer to PyObject *.)

@Thrameos
Copy link

The only requirement here is that when using the string something needs to be called to prepare the string before usage and the memory for the string may be stored elsewhere. What will happen in for Java or C# will be that string ready will check to see if the object has already been transferred in which case it will return immediately, or it will call some routine to make the memory available using one of the existing Python protocols. Java uses a funny encoding that is neither UTF8 nor UTF16 but something inbetween. When the string is destroyed it would then need to release the memory.

For non-abstract strings it would just check the slot and find there was no abstract string slot so it proceeds. So the cost would be one slot check per usage.

Thus the proposal would be that there was a PyStringReady() slot and slot for destroying the string. The memory space for the string could hold a pointer to the external memory. Bindings would use lazy transferring to move their string into Python when ready is called and check to see if the string was ever readied in which case it would release the external memory.

Ideally this all happens behind the scenes such that there is never any changes on the users side. Calls that access the string data (ie PyUnicode_1BYTE_DATA, PyUnicode_2BYTE_DATA, PyUnicode_4BYTE_DATA) and those that are reporting (PyUnicode_KIND) call the ready slot which causes all the fields in the string to be filled out.

@gvanrossum
Copy link

I'm not sure I follow all that, but fortunately, this issue tracker is not for solutions but for problems, and the problem seems clear enough.

@encukou
Copy link
Contributor Author

encukou commented May 17, 2023

I think this is very solvable, thanks to Inada-san's work in improving the PyUnicode API.
(And I'd enjoy solving it, but can't fit it in my priorities.)

FWIW, this would enable adding performant (but somewhat tricky to use) API for zero-copy strings e.g. from mapped files or from/to languages like Rust.

@ronaldoussoren
Copy link

FWIW: I also ran into this with PyObjC, which contains a subtype of PyUnicode_Type just to be able to use Objective-C strings transparently with extension functions that expect a string arguments.

That subtype is inherently fragile because its implementation uses implementation details of PyUnicode_Type. Luckily that implementation hasn't seen a lot of changes so far, other than the migration to the current representation earlier in Python 3's development.

A problem with integration could be the representation of foreign strings, e.g. Java and Objective-C strings logically are UCS2 while Python's string is UCS4. That can probably be solved by using UTF-8 in a hypothetical string protocol.

@iritkatriel iritkatriel removed the v label Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants