-
-
Notifications
You must be signed in to change notification settings - Fork 30.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-89653: PEP 670: Convert PyUnicode_KIND() macro to function #92705
Conversation
@serhiy-storchaka @methane @erlend-aasland: I propose to define the My first motivation is that the only existing function using a kind is The Previously, the exact C type of the PyUnicode_KIND() result was never well defined. It was documented as I also prepared the PR #92704 which defines all "kind" variables as I already merged the commit d0c9353 which prepared the code and the API. |
This PR only impacts C extensions which opt-in for the limited C API version 3.12 and newer. Other C extensions are not affected (no change). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
In the limited C API version 3.12, PyUnicode_KIND() is now implemented as a static inline function. Keep the macro for the regular C API and for the limited C API version 3.11 and older to prevent introducing new compiler warnings. Update _decimal.c and stringlib/eq.h for PyUnicode_KIND().
PR rebased to fix merge conflicts. |
Thanks for your reviews @methane and @erlend-aasland, I really wasn't sure about the PyUnicode_KIND() return type. Next: deprecate the whole API which expose kind+data, it's too low-level, too specific to PEP 393. I seriously consider suggesting higher level APIs. It would help to experiment using UTF-8 internally in CPython, as PyPy does (with an index to keep O(1) complexity for |
I predict a serious performance hit in modules which parse strings (like |
The API would be part of the internal C API and so can be used by stdlib extensions, even if they are built as shared libraries. Yeah, the performance overhead of 3rd party C extensions should be measured to consider the idea. |
I have 2 motivations to hide the (kind, data) low-level API:
The _PyUnicodeWriter private API has a _PyUnicodeWriter_Finish() function which builds the final immutable string. It function calls realloc() if the buffer was overallocated. I would prefer to expose such "writer" API for 3rd party C extensions. |
Perf hit would be measurable, but not end of the world. In 2018, I port MarkupSafe to PEP 393. I wrote logic in macro and 3 functions for each kind. In 2020, I port Genshi to Python 3.3+, but used UTF-8 instead of PEP 393 API. I think optimizing for PEP 393 won't pay for most cases. |
For I would not want to lose such advantage. |
The _PyUnicodeWriter API should be as efficient as using the (kind, data) API. Overallocation can be enabled or disabled, and it's possible to request max character when the writer is initialized to avoid expensive conversions from UCS1 to UCS2 and then to UCS4. But it's usually hard to know in advance to know what will be the final maximum character. _PyUnicodeWriter also allows to change the buffer format at each write from UCS1, to UCS2 and then UCS4. In the early days of Python 3.3, most code only used UCS4 buffers (simple C array of Py_UCS4 characters allocated on the heap) and then downcasted to UCS1 or UCS2 when the final string was created. I'm not proposing to expose directly the _PyUnicodeWriter API. I'm thinking about proposing a public C API with the same idea: "append only" API and a "finish" function which returns a Python str object. |
Would it be acceptable for the regex module to use the internal C API to access directly to the Python str internals: the (kind, data) API? Anyway, this idea is in the PEP area: use cases should be written down, and the performance of each use case should be measured. If possible, I would prefer to have no impact or minor overhead on CPython, but the main goal is to get "native speed" for other Python implementations, especially PyPy which uses UTF-8 internally. |
Can we move this discussion to Discourse or a separate issue? |
The future of the (kind, data) API is discussed at: https://discuss.python.org/t/un-deprecate-pyunicode-ready-for-future-unicode-improvement/15717 |
In the limited C API version 3.12, PyUnicode_KIND() is now
implemented as a static inline function. Keep the macro for the
regular C API and for the limited C API version 3.11 and older to
prevent introducing new compiler warnings.
Update _decimal.c and stringlib/eq.h for PyUnicode_KIND().