gh-89653: PEP 670: Convert PyUnicode_KIND() macro to function #92705

vstinner · 2022-05-12T00:33:12Z

In the limited C API version 3.12, PyUnicode_KIND() is now
implemented as a static inline function. Keep the macro for the
regular C API and for the limited C API version 3.11 and older to
prevent introducing new compiler warnings.

Update _decimal.c and stringlib/eq.h for PyUnicode_KIND().

vstinner · 2022-05-12T00:44:24Z

@serhiy-storchaka @methane @erlend-aasland: I propose to define the PyUnicode_KIND() result type as int. Are you ok with that?

My first motivation is that the only existing function using a kind is PyUnicode_FromKindAndData() and the kind parameter type is int.

The int type is very common in C and easy to use. IMO using unsigned int is less convenient. Enums must be avoided in the C API, they are not portable and cause issues in programming languages other than C: so enum PyUnicode_Kind cannot be used.

Previously, the exact C type of the PyUnicode_KIND() result was never well defined. It was documented as unsigned int. But int (signed) and enum PyUnicode_Kind were also accepted and didn't emit any compiler warning (even with -Wconversion).

I also prepared the PR #92704 which defines all "kind" variables as int.

I already merged the commit d0c9353 which prepared the code and the API.

vstinner · 2022-05-12T00:45:40Z

This PR only impacts C extensions which opt-in for the limited C API version 3.12 and newer. Other C extensions are not affected (no change).

Objects/stringlib/eq.h

erlend-aasland

Looks good to me

In the limited C API version 3.12, PyUnicode_KIND() is now implemented as a static inline function. Keep the macro for the regular C API and for the limited C API version 3.11 and older to prevent introducing new compiler warnings. Update _decimal.c and stringlib/eq.h for PyUnicode_KIND().

vstinner · 2022-05-12T22:13:48Z

PR rebased to fix merge conflicts.

vstinner · 2022-05-13T09:57:57Z

Thanks for your reviews @methane and @erlend-aasland, I really wasn't sure about the PyUnicode_KIND() return type.

Next: deprecate the whole API which expose kind+data, it's too low-level, too specific to PEP 393. I seriously consider suggesting higher level APIs. It would help to experiment using UTF-8 internally in CPython, as PyPy does (with an index to keep O(1) complexity for str[index] character index).

serhiy-storchaka · 2022-05-13T10:20:33Z

I predict a serious performance hit in modules which parse strings (like re or json, or third-party alternatives) if we get rid of this API.

vstinner · 2022-05-13T13:07:14Z

I predict a serious performance hit in modules which parse strings (like re or json, or third-party alternatives) if we get rid of this API.

The API would be part of the internal C API and so can be used by stdlib extensions, even if they are built as shared libraries.

Yeah, the performance overhead of 3rd party C extensions should be measured to consider the idea.

vstinner · 2022-05-13T13:11:50Z

I have 2 motivations to hide the (kind, data) low-level API:

It prevents CPython to use something else, like UTF-8
Other Python implementations must emulate it in their implementation of the Python C API: that's annoying since the API has no function to say "i'm done with modifying the mutable strings: convert it to a final immutable object". The whole API is based on the assumption that we can freely modify mutable strings with the assumption that it's not possible to access this strings that we are building.

The _PyUnicodeWriter private API has a _PyUnicodeWriter_Finish() function which builds the final immutable string. It function calls realloc() if the buffer was overallocated.

I would prefer to expose such "writer" API for 3rd party C extensions.

methane · 2022-05-13T13:20:55Z

Perf hit would be measurable, but not end of the world.

In 2018, I port MarkupSafe to PEP 393. I wrote logic in macro and 3 functions for each kind.
It is very difficult to maintain.
https://github.com/pallets/markupsafe/pull/64/files

In 2020, I port Genshi to Python 3.3+, but used UTF-8 instead of PEP 393 API.
It looks like regular string processing C code.
https://github.com/edgewall/genshi/pull/26/files

I think optimizing for PEP 393 won't pay for most cases.

serhiy-storchaka · 2022-05-13T13:43:39Z

For re, the difference was enormous, up to 4 times for non-ASCII data in microbenchmarks and up to 25% in standard pure ASCII benchmarks. See #62885. This may be a reason of regex been ~25% slower than re in many benchmarks.

I would not want to lose such advantage.

vstinner · 2022-05-13T15:27:51Z

The _PyUnicodeWriter API should be as efficient as using the (kind, data) API. Overallocation can be enabled or disabled, and it's possible to request max character when the writer is initialized to avoid expensive conversions from UCS1 to UCS2 and then to UCS4. But it's usually hard to know in advance to know what will be the final maximum character. _PyUnicodeWriter also allows to change the buffer format at each write from UCS1, to UCS2 and then UCS4.

In the early days of Python 3.3, most code only used UCS4 buffers (simple C array of Py_UCS4 characters allocated on the heap) and then downcasted to UCS1 or UCS2 when the final string was created.

I'm not proposing to expose directly the _PyUnicodeWriter API. I'm thinking about proposing a public C API with the same idea: "append only" API and a "finish" function which returns a Python str object.

vstinner · 2022-05-13T15:33:06Z

For re, the difference was enormous, up to 4 times for non-ASCII data in microbenchmarks and up to 25% in standard pure ASCII benchmarks. See #62885. This may be a reason of regex been ~25% slower than re in many benchmarks.

Would it be acceptable for the regex module to use the internal C API to access directly to the Python str internals: the (kind, data) API?

Anyway, this idea is in the PEP area: use cases should be written down, and the performance of each use case should be measured.

If possible, I would prefer to have no impact or minor overhead on CPython, but the main goal is to get "native speed" for other Python implementations, especially PyPy which uses UTF-8 internally.

erlend-aasland · 2022-05-13T20:25:26Z

Can we move this discussion to Discourse or a separate issue?

vstinner · 2022-05-16T08:04:04Z

The future of the (kind, data) API is discussed at: https://discuss.python.org/t/un-deprecate-pyunicode-ready-for-future-unicode-improvement/15717

vstinner requested a review from rhettinger as a code owner May 12, 2022 00:33

bedevere-bot added the awaiting core review label May 12, 2022

vstinner added the skip news label May 12, 2022

vstinner mentioned this pull request May 12, 2022

gh-89653: Use int type for Unicode kind #92704

Merged

erlend-aasland reviewed May 12, 2022

View reviewed changes

Objects/stringlib/eq.h Outdated Show resolved Hide resolved

erlend-aasland approved these changes May 12, 2022

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting core review labels May 12, 2022

erlend-aasland approved these changes May 12, 2022

View reviewed changes

methane approved these changes May 12, 2022

View reviewed changes

vstinner merged commit db388df into python:main May 13, 2022

bedevere-bot removed the awaiting merge label May 13, 2022

vstinner deleted the kind_func branch May 13, 2022 09:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-89653: PEP 670: Convert PyUnicode_KIND() macro to function #92705

gh-89653: PEP 670: Convert PyUnicode_KIND() macro to function #92705

vstinner commented May 12, 2022

vstinner commented May 12, 2022

vstinner commented May 12, 2022

erlend-aasland left a comment

vstinner commented May 12, 2022

vstinner commented May 13, 2022

serhiy-storchaka commented May 13, 2022

vstinner commented May 13, 2022

vstinner commented May 13, 2022

methane commented May 13, 2022 •

edited

Loading

serhiy-storchaka commented May 13, 2022

vstinner commented May 13, 2022

vstinner commented May 13, 2022

erlend-aasland commented May 13, 2022

vstinner commented May 16, 2022

gh-89653: PEP 670: Convert PyUnicode_KIND() macro to function #92705

gh-89653: PEP 670: Convert PyUnicode_KIND() macro to function #92705

Conversation

vstinner commented May 12, 2022

vstinner commented May 12, 2022

vstinner commented May 12, 2022

erlend-aasland left a comment

Choose a reason for hiding this comment

vstinner commented May 12, 2022

vstinner commented May 13, 2022

serhiy-storchaka commented May 13, 2022

vstinner commented May 13, 2022

vstinner commented May 13, 2022

methane commented May 13, 2022 • edited Loading

serhiy-storchaka commented May 13, 2022

vstinner commented May 13, 2022

vstinner commented May 13, 2022

erlend-aasland commented May 13, 2022

vstinner commented May 16, 2022

methane commented May 13, 2022 •

edited

Loading