Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-89653: PEP 670: Convert PyUnicode_KIND() macro to function #92705

Merged
merged 1 commit into from
May 13, 2022
Merged

gh-89653: PEP 670: Convert PyUnicode_KIND() macro to function #92705

merged 1 commit into from
May 13, 2022

Conversation

vstinner
Copy link
Member

In the limited C API version 3.12, PyUnicode_KIND() is now
implemented as a static inline function. Keep the macro for the
regular C API and for the limited C API version 3.11 and older to
prevent introducing new compiler warnings.

Update _decimal.c and stringlib/eq.h for PyUnicode_KIND().

@vstinner
Copy link
Member Author

@serhiy-storchaka @methane @erlend-aasland: I propose to define the PyUnicode_KIND() result type as int. Are you ok with that?

My first motivation is that the only existing function using a kind is PyUnicode_FromKindAndData() and the kind parameter type is int.

The int type is very common in C and easy to use. IMO using unsigned int is less convenient. Enums must be avoided in the C API, they are not portable and cause issues in programming languages other than C: so enum PyUnicode_Kind cannot be used.

Previously, the exact C type of the PyUnicode_KIND() result was never well defined. It was documented as unsigned int. But int (signed) and enum PyUnicode_Kind were also accepted and didn't emit any compiler warning (even with -Wconversion).

I also prepared the PR #92704 which defines all "kind" variables as int.

I already merged the commit d0c9353 which prepared the code and the API.

@vstinner
Copy link
Member Author

This PR only impacts C extensions which opt-in for the limited C API version 3.12 and newer. Other C extensions are not affected (no change).

Copy link
Contributor

@erlend-aasland erlend-aasland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

In the limited C API version 3.12, PyUnicode_KIND() is now
implemented as a static inline function. Keep the macro for the
regular C API and for the limited C API version 3.11 and older to
prevent introducing new compiler warnings.

Update _decimal.c and stringlib/eq.h for PyUnicode_KIND().
@vstinner
Copy link
Member Author

PR rebased to fix merge conflicts.

@vstinner vstinner merged commit db388df into python:main May 13, 2022
@vstinner vstinner deleted the kind_func branch May 13, 2022 09:50
@vstinner
Copy link
Member Author

Thanks for your reviews @methane and @erlend-aasland, I really wasn't sure about the PyUnicode_KIND() return type.

Next: deprecate the whole API which expose kind+data, it's too low-level, too specific to PEP 393. I seriously consider suggesting higher level APIs. It would help to experiment using UTF-8 internally in CPython, as PyPy does (with an index to keep O(1) complexity for str[index] character index).

@serhiy-storchaka
Copy link
Member

I predict a serious performance hit in modules which parse strings (like re or json, or third-party alternatives) if we get rid of this API.

@vstinner
Copy link
Member Author

I predict a serious performance hit in modules which parse strings (like re or json, or third-party alternatives) if we get rid of this API.

The API would be part of the internal C API and so can be used by stdlib extensions, even if they are built as shared libraries.

Yeah, the performance overhead of 3rd party C extensions should be measured to consider the idea.

@vstinner
Copy link
Member Author

I have 2 motivations to hide the (kind, data) low-level API:

  • It prevents CPython to use something else, like UTF-8
  • Other Python implementations must emulate it in their implementation of the Python C API: that's annoying since the API has no function to say "i'm done with modifying the mutable strings: convert it to a final immutable object". The whole API is based on the assumption that we can freely modify mutable strings with the assumption that it's not possible to access this strings that we are building.

The _PyUnicodeWriter private API has a _PyUnicodeWriter_Finish() function which builds the final immutable string. It function calls realloc() if the buffer was overallocated.

I would prefer to expose such "writer" API for 3rd party C extensions.

@methane
Copy link
Member

methane commented May 13, 2022

Perf hit would be measurable, but not end of the world.

In 2018, I port MarkupSafe to PEP 393. I wrote logic in macro and 3 functions for each kind.
It is very difficult to maintain.
https://github.com/pallets/markupsafe/pull/64/files

In 2020, I port Genshi to Python 3.3+, but used UTF-8 instead of PEP 393 API.
It looks like regular string processing C code.
https://github.com/edgewall/genshi/pull/26/files

I think optimizing for PEP 393 won't pay for most cases.

@serhiy-storchaka
Copy link
Member

For re, the difference was enormous, up to 4 times for non-ASCII data in microbenchmarks and up to 25% in standard pure ASCII benchmarks. See #62885. This may be a reason of regex been ~25% slower than re in many benchmarks.

I would not want to lose such advantage.

@vstinner
Copy link
Member Author

The _PyUnicodeWriter API should be as efficient as using the (kind, data) API. Overallocation can be enabled or disabled, and it's possible to request max character when the writer is initialized to avoid expensive conversions from UCS1 to UCS2 and then to UCS4. But it's usually hard to know in advance to know what will be the final maximum character. _PyUnicodeWriter also allows to change the buffer format at each write from UCS1, to UCS2 and then UCS4.

In the early days of Python 3.3, most code only used UCS4 buffers (simple C array of Py_UCS4 characters allocated on the heap) and then downcasted to UCS1 or UCS2 when the final string was created.

I'm not proposing to expose directly the _PyUnicodeWriter API. I'm thinking about proposing a public C API with the same idea: "append only" API and a "finish" function which returns a Python str object.

@vstinner
Copy link
Member Author

For re, the difference was enormous, up to 4 times for non-ASCII data in microbenchmarks and up to 25% in standard pure ASCII benchmarks. See #62885. This may be a reason of regex been ~25% slower than re in many benchmarks.

Would it be acceptable for the regex module to use the internal C API to access directly to the Python str internals: the (kind, data) API?

Anyway, this idea is in the PEP area: use cases should be written down, and the performance of each use case should be measured.

If possible, I would prefer to have no impact or minor overhead on CPython, but the main goal is to get "native speed" for other Python implementations, especially PyPy which uses UTF-8 internally.

@erlend-aasland
Copy link
Contributor

Can we move this discussion to Discourse or a separate issue?

@vstinner
Copy link
Member Author

The future of the (kind, data) API is discussed at: https://discuss.python.org/t/un-deprecate-pyunicode-ready-for-future-unicode-improvement/15717

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants