bpo-45490: Convert unicodeobject.h macros to static inline functions #31221

vstinner · 2022-02-08T19:51:40Z

Convert unicodeobject.h macros to static inline functions.
Reorder functions to declare functions before their first usage.
Add "kind" variable to PyUnicode_READ_CHAR() and
PyUnicode_MAX_CHAR_VALUE() functions to only call PyUnicode_KIND()
once.
PyUnicode_KIND() now returns an "enum PyUnicode_Kind".
Simplify PyUnicode_GET_SIZE().
Add assertions to PyUnicode_WRITE() on the max value.
Add cast macros:
- _PyASCIIObject_CAST()
- _PyCompactUnicodeObject_CAST()
- _PyUnicodeObject_CAST()
The following functions are now declared as deprecated using
Py_DEPRECATED(3.3):
- PyUnicode_GET_SIZE()
- PyUnicode_GET_DATA_SIZE()
- PyUnicode_AS_UNICODE()
- PyUnicode_AS_DATA()
- The implementation of these functions disable deprecation
  warnings in their body.
PyUnicode_READ_CHAR() now uses PyUnicode_1BYTE_DATA(),
PyUnicode_2BYTE_DATA() and PyUnicode_4BYTE_DATA().
Replace "const PyObject*" with "PyObject*" in _decimal.c
and pystrhex.c: PyUnicode_READY() can modify the object.
Replace "const void *data" with "void *data" in some unicodedata.c
and unicodeobject.c functions which use PyUnicode_WRITE(): data is
used to modify the string.

https://bugs.python.org/issue45490

* Convert unicodeobject.h macros to static inline functions. * Reorder functions to declare functions before their first usage. * Add "kind" variable to PyUnicode_READ_CHAR() and PyUnicode_MAX_CHAR_VALUE() functions to only call PyUnicode_KIND() once. * PyUnicode_KIND() now returns an "enum PyUnicode_Kind". * Simplify PyUnicode_GET_SIZE(). * Add assertions to PyUnicode_WRITE() on the max value. * Add cast macros: * _PyASCIIObject_CAST() * _PyCompactUnicodeObject_CAST() * _PyUnicodeObject_CAST() * The following functions are now declared as deprecated using Py_DEPRECATED(3.3): * PyUnicode_GET_SIZE() * PyUnicode_GET_DATA_SIZE() * PyUnicode_AS_UNICODE() * PyUnicode_AS_DATA() * The implementation of these functions disable deprecation warnings in their body. * PyUnicode_READ_CHAR() now uses PyUnicode_1BYTE_DATA(), PyUnicode_2BYTE_DATA() and PyUnicode_4BYTE_DATA(). * Replace "const PyObject*" with "PyObject*" in _decimal.c and pystrhex.c: PyUnicode_READY() can modify the object. * Replace "const void *data" with "void *data" in some unicodedata.c and unicodeobject.c functions which use PyUnicode_WRITE(): data is used to modify the string.

vstinner · 2022-02-08T19:53:09Z

@erlend-aasland: All in one PR to convert (almost) all macros of Include/cpython/unicodeobject.h.

I created a single PR to show what can be done with PEP 670, but IMO it will be better to split this large PR into smaller PRs to ease review, and apply (minor) API changes / cleanup in following PRs (not do everything at once).

erlend-aasland · 2022-02-09T09:48:18Z

I created a single PR to show what can be done with PEP 670, but IMO it will be better to split this large PR into smaller PRs to ease review, and apply (minor) API changes / cleanup in following PRs (not do everything at once).

Sounds good.

erlend-aasland · 2022-02-09T10:09:46Z

IMO, this is a great improvement when it comes to readability/maintainability.

erlend-aasland · 2022-02-09T10:10:31Z

Modules/_decimal/_decimal.c

@@ -1895,7 +1895,7 @@ is_space(enum PyUnicode_Kind kind, const void *data, Py_ssize_t pos)
   Return NULL if malloc fails and an empty string if invalid characters
   are found. */
 static char *
-numeric_as_ascii(const PyObject *u, int strip_ws, int ignore_underscores)


Do we really need to remove const? Ditto for the rest of the PR.

If the "u" string is not ready, PyUnicode_READY() will modify it. It's not a read-only operation.

In Python 3.12, PyUnicode_WCHAR_KIND will be removed: https://www.python.org/dev/peps/pep-0623/

In the meanwhile, I prefer to not stop lying: we do modify the object :-)

gpshead

not a whole review, just dropping some notes.

gpshead · 2022-02-09T19:30:41Z

Include/cpython/unicodeobject.h

+    assert(PyUnicode_Check(op));
+    assert(PyUnicode_IS_READY(op));
+    return _PyASCIIObject_CAST(op)->length;
+}

 /* In the access macros below, "kind" may be evaluated more than once.


presumably update comments like these to just mention that they used to be macros with these caveats in previous python versions.

Oh thanks, I didn't look at comments at all. I laser focused on macros code and make sure that I don't change the code :-)

gpshead · 2022-02-09T19:32:47Z

Include/cpython/unicodeobject.h

+    if (kind == PyUnicode_1BYTE_KIND) {
+        return PyUnicode_1BYTE_DATA(unicode)[index];
+    }
+    else if (PyUnicode_KIND(unicode) == PyUnicode_2BYTE_KIND) {


as a function this no longer needs to be called twice. (and the comment above becomes less true)

gpshead · 2022-02-09T20:30:22Z

Include/cpython/unicodeobject.h

@@ -280,26 +341,24 @@ PyAPI_FUNC(int) _PyUnicode_CheckConsistency(
 #define SSTATE_INTERNED_IMMORTAL 2

 /* Use only if you know it's a string */
-#define PyUnicode_CHECK_INTERNED(op) \
-    (((PyASCIIObject *)(op))->state.interned)
+static inline int PyUnicode_CHECK_INTERNED(PyObject *op) {


to avoid the cast going away, consider doing what Py_INCREF did & add indirection through a macro for the cast:

#define PyUnicode_CHECK_INTERNED(op) \ _PyUnicode_CHECK_INTERNED(_PyASCIIObject_CAST(op))

The SC asked to not add such macro :-) You're right that without such macro, there is a risk of introducing new compiler warnings.

If possible I would prefer to keep PyObject* for functions in unicodeobject.c, since it's the type used for arguments in existing functions and the type returned by functions creating strings like PyUnicode_New(), PyUnicode_FromString(), etc.

Maybe for this specific header file, we can avoid casts.

For me, PyASCIIObject is an implementation detail which should be hidden. If possible, it should even be moved to the internal C API, but that's a way larger topic which may require a PEP ;-)

vstinner · 2022-02-09T22:05:35Z

The PyUnicode_KIND() now returns an "enum PyUnicode_Kind" change adds new warnings:

comparison of integer expressions of different signedness: ‘enum PyUnicode_Kind’ and ‘int’ [-Wsign-compare]

Changing PyUnicode_KIND() should be done in separated PR. I'm not sure what's the best return type for that. I would prefer to not add new compiler warnings!

vstinner · 2022-02-09T22:25:08Z

Macros not casting their arguments:

PyUnicode_1BYTE_DATA()
PyUnicode_2BYTE_DATA()
PyUnicode_4BYTE_DATA()
PyUnicode_AS_DATA()
PyUnicode_DATA()
PyUnicode_GET_DATA_SIZE()
PyUnicode_MAX_CHAR_VALUE()
PyUnicode_READ()
PyUnicode_READ_CHAR()
PyUnicode_WRITE()
_PyUnicodeWriter_Prepare()
_PyUnicodeWriter_PrepareKind()

Macro casting its argument to PyObject*:

PyUnicode_READY()

Macros using a cast in their implementation:

Cast to PyASCIIObject* (and sometimes to other types):
- PyUnicode_AS_UNICODE()
- PyUnicode_CHECK_INTERNED()
- PyUnicode_GET_LENGTH()
- PyUnicode_GET_SIZE()
- PyUnicode_IS_ASCII()
- PyUnicode_IS_COMPACT()
- PyUnicode_IS_COMPACT_ASCII()
- PyUnicode_IS_READY()
- PyUnicode_KIND()
- _PyUnicode_COMPACT_DATA()
Cast to PyUnicodeObject*:
- _PyUnicode_NONCOMPACT_DATA()

The majority of macros use PyObject* for its Python str object parameter.

PyUnicode_READ() and PyUnicode_WRITE() expect (kind, data, index) and (kind, data, index, value) arguments: no Python object.

vstinner · 2022-02-23T23:31:32Z

This PR was an example. I updated PEP 670 from the discussion on this PR. If PEP 670 is accepted, I will rewrite this PR with smaller changes to ease the review.

vstinner added the skip news label Feb 8, 2022

the-knights-who-say-ni added the CLA signed label Feb 8, 2022

bedevere-bot added the awaiting core review label Feb 8, 2022

erlend-aasland reviewed Feb 9, 2022

View reviewed changes

gpshead reviewed Feb 9, 2022

View reviewed changes

vstinner mentioned this pull request Feb 21, 2022

PEP 670: clarify cast; don't change return type python/peps#2349

Merged

vstinner closed this Feb 23, 2022

vstinner deleted the unicode_static_inline branch February 23, 2022 23:30

This was referenced Apr 19, 2022

[C API] PEP 670: Convert macros to functions in the Python C API #89653

Closed

gh-89653: PEP 670: Convert unicodeobject.h macros to functions #91696

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpo-45490: Convert unicodeobject.h macros to static inline functions #31221

bpo-45490: Convert unicodeobject.h macros to static inline functions #31221

vstinner commented Feb 8, 2022 •

edited by bedevere-bot

Loading

vstinner commented Feb 8, 2022

erlend-aasland commented Feb 9, 2022

erlend-aasland commented Feb 9, 2022

erlend-aasland Feb 9, 2022 •

edited

Loading

vstinner Feb 9, 2022

gpshead left a comment

gpshead Feb 9, 2022

vstinner Feb 9, 2022

gpshead Feb 9, 2022

gpshead Feb 9, 2022

vstinner Feb 9, 2022

vstinner commented Feb 9, 2022

vstinner commented Feb 9, 2022

vstinner commented Feb 23, 2022

bpo-45490: Convert unicodeobject.h macros to static inline functions #31221

bpo-45490: Convert unicodeobject.h macros to static inline functions #31221

Conversation

vstinner commented Feb 8, 2022 • edited by bedevere-bot Loading

vstinner commented Feb 8, 2022

erlend-aasland commented Feb 9, 2022

erlend-aasland commented Feb 9, 2022

erlend-aasland Feb 9, 2022 • edited Loading

Choose a reason for hiding this comment

vstinner Feb 9, 2022

Choose a reason for hiding this comment

gpshead left a comment

Choose a reason for hiding this comment

gpshead Feb 9, 2022

Choose a reason for hiding this comment

vstinner Feb 9, 2022

Choose a reason for hiding this comment

gpshead Feb 9, 2022

Choose a reason for hiding this comment

gpshead Feb 9, 2022

Choose a reason for hiding this comment

vstinner Feb 9, 2022

Choose a reason for hiding this comment

vstinner commented Feb 9, 2022

vstinner commented Feb 9, 2022

vstinner commented Feb 23, 2022

vstinner commented Feb 8, 2022 •

edited by bedevere-bot

Loading

erlend-aasland Feb 9, 2022 •

edited

Loading