Prepare for removing the legacy Unicode C API #80527

serhiy-storchaka · 2019-03-18T14:23:02Z

BPO	36346
Nosy	@malemburg, @ronaldoussoren, @pitrou, @scoder, @vstinner, @ezio-melotti, @methane, @serhiy-storchaka, @willingc, @corona10, @miss-islington, @shihai1991, @iritkatriel
PRs	bpo-36346: Prepare for removing the legacy Unicode C API. #12409 bpo-36346: array: Don't use deprecated APIs #19653 bpo-36346: Add Py_DEPRECATED to deprecated unicode APIs #20878 bpo-36346: Document removal schedule of deprecate APIs #20879 bpo-36346: Emit DeprecationWarning for PyArg_Parse() with 'u' or 'Z'. #20927 [3.9] bpo-36346: Add Py_DEPRECATED to deprecated unicode APIs (GH-20878) #20932 bpo-36346: Raise DeprecationWarning when creating legacy Unicode #20933 bpo-36346: Make unicodeobject.h C89 compatible #20934 [3.9] bpo-36346: Add Py_DEPRECATED to deprecated unicode APIs (GH-20878) #20941 bpo-36346: Prepare for removing the legacy Unicode C API (AC only). #21223 bpo-36346: Undeprecate private function _PyUnicode_AsUnicode(). #21336 bpo-36346: Do not use legacy Unicode C API in ctypes. #21429 bpo-36346: Make using the legacy Unicode C API optional #21437 bpo-36346: Doc: Update removal schedule of legacy Unicode #21479 [3.9] bpo-36346: Doc: Update removal schedule of legacy Unicode (GH-21479) #21738 [3.8] bpo-36346: Doc: Update removal schedule of legacy Unicode (GH-21479) #21739 [3.9] bpo-36346: Document removal schedule of deprecate APIs (GH-20879) #24625 [3.8] bpo-36346: Document removal schedule of deprecate APIs (GH-20879) #24626
Dependencies	bpo-36387: Refactor getenvironment() in _winapi.c

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2022-01-29.03:10:20.373>
created_at = <Date 2019-03-18.14:23:01.818>
labels = ['interpreter-core', 'expert-C-API', '3.8', 'expert-unicode']
title = 'Prepare for removing the legacy Unicode C API'
updated_at = <Date 2022-01-29.03:10:20.372>
user = 'https://github.com/serhiy-storchaka'

bugs.python.org fields:

activity = <Date 2022-01-29.03:10:20.372>
actor = 'methane'
assignee = 'none'
closed = True
closed_date = <Date 2022-01-29.03:10:20.373>
closer = 'methane'
components = ['Interpreter Core', 'Unicode', 'C API']
creation = <Date 2019-03-18.14:23:01.818>
creator = 'serhiy.storchaka'
dependencies = ['36387']
files = []
hgrepos = []
issue_num = 36346
keywords = ['patch']
message_count = 36.0
messages = ['338228', '338284', '338285', '338286', '338289', '338290', '338331', '338340', '338343', '338344', '338565', '339860', '355535', '368615', '368653', '371730', '371731', '371734', '371735', '371745', '371795', '372656', '372658', '373032', '373035', '373450', '373478', '374855', '374856', '374857', '387513', '387545', '387546', '387828', '412025', '412047']
nosy_count = 13.0
nosy_names = ['lemburg', 'ronaldoussoren', 'pitrou', 'scoder', 'vstinner', 'ezio.melotti', 'methane', 'serhiy.storchaka', 'willingc', 'corona10', 'miss-islington', 'shihai1991', 'iritkatriel']
pr_nums = ['12409', '19653', '20878', '20879', '20927', '20932', '20933', '20934', '20941', '21223', '21336', '21429', '21437', '21479', '21738', '21739', '24625', '24626']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue36346'
versions = ['Python 3.8']

Linked PRs

serhiy-storchaka · 2019-03-18T14:23:02Z

The legacy Unicode C API was deprecated in 3.3. Its support consumes resources: more memory usage by Unicode objects, additional code for handling Unicode objects created with the legacy C API. Currently every Unicode object has a cache for the wchar_t representation.

The proposed PR adds two compile time options: HAVE_UNICODE_WCHAR_CACHE and USE_UNICODE_WCHAR_CACHE. Both are set to 1 by default.

If USE_UNICODE_WCHAR_CACHE is set to 0, CPython will not use the wchar_t cache internally. The new wchar_t based C API will be used instead of the Py_UNICODE based C API. This can add small performance penalty for creating a temporary buffer for the wchar_t representation. On other hand, this will decrease the long-term memory usage. This build is binary compatible with the standard build and third-party extensions can use the legacy Unicode C API.

If HAVE_UNICODE_WCHAR_CACHE is set to 0, the wchar_t cache will be completely removed. The legacy Unicode C API will be not available, and functions that need it (e.g. PyArg_ParseTuple() with the "u" format unit) will always fail. This build is binary incompatible with the standard build if you use the legacy or non-stable Unicode C API.

I hope that these options will help third-party projects to prepare for removing the legacy Unicode C API in future.

scoder · 2019-03-18T20:05:26Z

Thanks for implementing this, Serhiy.
Since these C macros are public, should they be named PY_* ?

scoder · 2019-03-18T20:46:26Z

I think this is a good preparation that makes it clear what code will eventually be removed, and allows testing without it.

No idea how happy Windows users will be about all of this, but I consider it quite an overall improvement for the Unicode implementation. Once this gets removed, that is.

Removing the "unicode_internal" codec entirely (which is changed by this PR) is discussed in bpo-36297.

malemburg · 2019-03-18T21:06:40Z

I'd change the title of this bpo item to "Prepare for removing the whcar_t caching in the Unicode C API".

Note that the wchar_t caching was put in place to allow for external applications and C code to easily and efficiently interface with Python. By removing it you will slow down such code significantly, esp. on Linux and Windows where wchar_t code is fairly common (one of the reasons we added UCS4 in Python was to make the interaction with Linux wchar_t code more efficient).

This should be clearly mentioned as part of the change and the compile time flags.

BTW: You have a few other changes in the PR which don't have anything to do with the intended removal:

-    envsize = PySequence_Fast_GET_SIZE(keys);
-    if (PySequence_Fast_GET_SIZE(values) != envsize) {
+    envsize = PyList_GET_SIZE(keys);
+    if (PyList_GET_SIZE(values) != envsize) {

scoder · 2019-03-18T21:33:36Z

I had also looked through the unrelated changes, and while, yes, they are unrelated, they seemed to be correct and reasonable modernisations of the code base while touching it. They could be moved to a separate PR, but there is a relatively high risk of conflicts, so I'm ok with keeping them in here for now.

malemburg · 2019-03-18T21:53:01Z

On 18.03.2019 22:33, Stefan Behnel wrote:

I had also looked through the unrelated changes, and while, yes, they are unrelated, they seemed to be correct and reasonable modernisations of the code base while touching it. They could be moved to a separate PR, but there is a relatively high risk of conflicts, so I'm ok with keeping them in here for now.

I don't think changing sequence iteration to list iteration only
is something that should be hidden in a wchar_t removal PR.

My guess is that these changes have made it into the PR by mistake.
They deserve a separate PR and discussion.

methane · 2019-03-19T08:58:10Z

I'm not sure we need two options.
Does USE_UNICODE_WCHAR_CACHE=0 really helps preparing to the removal?

serhiy-storchaka · 2019-03-19T10:45:21Z

I wrote this PR just to see how much code should be changed after removing the wchar_t cache, and what be performance impact. Get it, experiment with it, run tests and benchmarks. I think we could set USE_UNICODE_WCHAR_CACHE to 0 by default. If this will cause significant troubles, it is easy to set it to 1.

I am going to add configure options for switching these options. On Windows you will still need to edit the config file manually.

I'm not sure we need two options.
Does USE_UNICODE_WCHAR_CACHE=0 really helps preparing to the removal?

Currently some of the legacy functions are not decorated with Py_DEPRECATED, because this would cause compiler warnings in the code that uses these functions. If USE_UNICODE_WCHAR_CACHE is 0, these functions will no longer used, so we can add compiler warnings for them.

I don't think changing sequence iteration to list iteration only
is something that should be hidden in a wchar_t removal PR.

getenvironment() is the function that has been rewritten to the new API without preserving the old variant. Since the code was rewritten so much, I performed some code clean up. PyMapping_Keys() and PyMapping_Values() always return a list now, so that using the PySequence_Fast API is superfluous. They could return a tuple in the past, but this provoked bugs because the user code used PyList API for it.

I'll open a separate issue for this.

Since these C macros are public, should they be named PY_* ?

CPython configuration macros (like HAVE_ACOSH or USE_COMPUTED_GOTOS) do not have the PY_ prefix.

methane · 2019-03-19T11:21:51Z

FYI, I had created PR 12340 which removes use of deprecated API in ctypes.

ronaldoussoren · 2019-03-19T11:47:08Z

One thing to keep in mind: HAVE_UNICODE_WCHAR_CACHE == 1 and HAVE_UNICODE_WCHAR_CACHE == 0 have a different ABI due to a different struct layout. This should probably affect the ABI tag for extension modules.

pitrou · 2019-03-21T19:39:00Z

The proposed PR adds two compile time options: HAVE_UNICODE_WCHAR_CACHE and USE_UNICODE_WCHAR_CACHE

I don't think this is a good approach. Most projects and developers don't recompile Python. It's especially a chore when you have many dependencies with C extensions, because you'll have to recompile them all as well.

I would recommend simply removing that cache.

methane · 2019-04-10T12:57:16Z

I think these ABI incompatible options are used many people.
But it is helpful to find extensions which using legacy APIs before Python 3.10 is released.

I had found ujson and MarkupSafe used legacy APIs. I fixed MarkupSafe.
I don't care ujson because it is wrapper of wchar_t based C library
and there are enough json libraries.

I suppose there are some other packages in PyPI, but I'm not sure.

vstinner · 2019-10-28T11:35:14Z

I closed bpo-38604 as a duplicate. Copy of my messages.

msg355475 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-10-27 16:02

Python 3.3 deprecated the C API functions using Py_UNICODE type. Examples in the doc:

Currently, functions removal is scheduled for Python 4.0 but I would prefer that Python 4.0 doesn't have a long list of removed features, but no more than usual. So I'm trying to remove a few functions from Python 3.9, and try to prepare removal for others.

Py_UNICODE C API was mostly kept for backward compatibility with Python 2. Since Python 2 support ends at the end of the year, can we start to organize Py_UNICODE C API removal?

There are multiple questions:

Should we drop the whole API at once? Or can we/should we start by removing a few functions, and then the others?
Deprecation warnings are emitted at compilation. But I'm not aware of DeprecationWarning emited at runtime. IMHO we should emit DesprecationWarning at runtime during at least one release, so most developers ignore compilation warnings.

I propose to:

(Right now) write an exhaustive list of all deprecated APIs: functions, constants, types, etc.
Modify C code to emit DeprecationWarning at runtime in Python 3.9
Experiment a modified Python without these APIs and test how many projects are broken by this removal: see PEP-608
Schedule the actual removal of all these APIS from Python 3.10

Honestly, if the removal is causing too much issues, I'm fine to make slowdown the removal. It's just a matter of clearly communicating our intent.

Maybe we should also announce the scheduled removal in What's in Python 3.9 and in the capi-sig mailing list.

msg355478 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-10-27 16:15

(Right now) write an exhaustive list of all deprecated APIs: functions, constants, types, etc.

I searched "4.0" in the documentation:

Py_UNICODE type
array.array: "u" type
PyArg_ParseTuple, Py_BuildValue: "u", "u#", "Z", "Z#" formats
PyUnicode_FromUnicode()
PyUnicode_GetSize(), PyUnicode_GET_SIZE()
PyUnicode_AsUnicode(), PyUnicode_AS_UNICODE(), PyUnicode_AS_DATA()
PyUnicode_AsUnicodeAndSize()
PyUnicode_AsUnicodeCopy()
PyUnicode_FromObject()
PyLong_FromUnicode()
PyUnicode_TransformDecimalToASCII()
PyUnicode_Encode()
PyUnicode_EncodeUTF7()
PyUnicode_EncodeUTF8()
PyUnicode_EncodeUTF32()
PyUnicode_EncodeUTF16()
PyUnicode_EncodeUnicodeEscape()
PyUnicode_EncodeRawUnicodeEscape()
PyUnicode_EncodeLatin1()
PyUnicode_EncodeASCII()
PyUnicode_EncodeMBCS()
PyUnicode_EncodeCharmap()
PyUnicode_TranslateCharmap()

msg355524 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-10-28 11:06

A preleminary step was to modify PyUnicode_AsWideChar() and PyUnicode_AsWideCharString() to remove the internal caching: it has been done in Python 3.8.0 with bpo-30863.

methane · 2020-05-11T06:37:32Z

New changeset d5d9a71 by Inada Naoki in branch 'master':
bpo-36346: array: Don't use deprecated APIs (GH-19653)
d5d9a71

vstinner · 2020-05-11T21:18:31Z

bpo-36346: array: Don't use deprecated APIs (GH-19653)

Thanks INADA-san! Another nail into Py_UNICODE coffin!

methane · 2020-06-17T11:09:49Z

New changeset 2c4928d by Inada Naoki in branch 'master':
bpo-36346: Add Py_DEPRECATED to deprecated unicode APIs (GH-20878)
2c4928d

vstinner · 2020-06-17T12:01:04Z

bpo-36346: Add Py_DEPRECATED to deprecated unicode APIs (GH-20878)

This change broke test_distutils on multiple buildbots. Examples:

methane · 2020-06-17T12:27:49Z

Oh, why I can not use C99?

/home/buildbot/buildarea/3.x.cstratak-RHEL7-ppc64le/build/Include/cpython/unicodeobject.h: In function ‘Py_UNICODE_FILL’:
/home/buildbot/buildarea/3.x.cstratak-RHEL7-ppc64le/build/Include/cpython/unicodeobject.h:56:5: error: ‘for’ loop initial declarations are only allowed in C99 mode
     for (Py_ssize_t i = 0; i < length; i++) {
     ^
/home/buildbot/buildarea/3.x.cstratak-RHEL7-ppc64le/build/Include/cpython/unicodeobject.h:56:5: note: use option -std=c99 or -std=gnu99 to compile your code

vstinner · 2020-06-17T12:30:50Z

Oh, why I can not use C99?

PEP-7 requires C99 to build Python, but I think that we can try to keep C89 compatibility for the public header files (Python C API).

methane · 2020-06-17T14:43:09Z

New changeset 8e34e92 by Inada Naoki in branch 'master':
bpo-36346: Make unicodeobject.h C89 compatible (GH-20934)
8e34e92

methane · 2020-06-18T08:31:23Z

New changeset 610a60c by Inada Naoki in branch '3.9':
bpo-36346: Add Py_DEPRECATED to deprecated unicode APIs (GH-20878)
610a60c

serhiy-storchaka · 2020-06-30T06:03:23Z

New changeset 349f76c by Serhiy Storchaka in branch 'master':
bpo-36346: Prepare for removing the legacy Unicode C API (AC only). (GH-21223)
349f76c

methane · 2020-06-30T06:27:03Z

New changeset 038dd0f by Inada Naoki in branch 'master':
bpo-36346: Raise DeprecationWarning when creating legacy Unicode (GH-20933)
038dd0f

serhiy-storchaka · 2020-07-05T15:13:29Z

There is no need to deprecate _PyUnicode_AsUnicode. It is a private function. Undeprecating it will make the code clearer.

serhiy-storchaka · 2020-07-05T15:53:55Z

New changeset b3dd5cd by Serhiy Storchaka in branch 'master':
bpo-36346: Undeprecate private function _PyUnicode_AsUnicode(). (GH-21336)
b3dd5cd

serhiy-storchaka · 2020-07-10T08:17:26Z

New changeset d878349 by Serhiy Storchaka in branch 'master':
bpo-36346: Do not use legacy Unicode C API in ctypes. (bpo-21429)
d878349

serhiy-storchaka · 2020-07-10T20:26:14Z

New changeset 4c8f09d by Serhiy Storchaka in branch 'master':
bpo-36346: Make using the legacy Unicode C API optional (GH-21437)
4c8f09d

methane · 2020-08-05T01:49:18Z

New changeset 270b4ad by Inada Naoki in branch 'master':
bpo-36346: Doc: Update removal schedule of legacy Unicode (GH-21479)
270b4ad

miss-islington · 2020-08-05T01:56:15Z

New changeset ea68063 by Miss Islington (bot) in branch '3.9':
bpo-36346: Doc: Update removal schedule of legacy Unicode (GH-21479)
ea68063

miss-islington · 2020-08-05T01:57:14Z

New changeset f0e030c by Miss Islington (bot) in branch '3.8':
bpo-36346: Doc: Update removal schedule of legacy Unicode (GH-21479)
f0e030c

methane · 2021-02-22T13:12:22Z

New changeset 91a639a by Inada Naoki in branch 'master':
bpo-36346: Emit DeprecationWarning for PyArg_Parse() with 'u' or 'Z'. (GH-20927)
91a639a

methane · 2021-02-22T23:06:57Z

New changeset 2d6f2ee by Inada Naoki in branch 'master':
bpo-36346: Document removal schedule of deprecate APIs (GH-20879)
2d6f2ee

miss-islington · 2021-02-22T23:31:11Z

New changeset 93853b7 by Miss Islington (bot) in branch '3.9':
bpo-36346: Document removal schedule of deprecate APIs (GH-20879)
93853b7

willingc · 2021-03-01T02:14:40Z

New changeset 346afeb by Miss Islington (bot) in branch '3.8':
bpo-36346: Document removal schedule of deprecate APIs (GH-20879) (bpo-24626)
346afeb

iritkatriel · 2022-01-28T19:05:09Z

Is there anything left to do here?

methane · 2022-01-29T03:09:44Z

No. I just waiting Python 3.11 become Bata.

Deprecate functions: * PyUnicode_AS_DATA() * PyUnicode_AS_UNICODE() * PyUnicode_GET_DATA_SIZE() * PyUnicode_GET_SIZE() Previously, these functions were macros and so it wasn't possible to decorate them with Py_DEPRECATED().

The decorator now requires to be called: @support.requires_legacy_unicode_capi() instead of: @support.requires_legacy_unicode_capi The implementation now only imports _testcapi when the decorator is called, so "import test.support" no longer imports the _testcapi extension.

The decorator now requires to be called with parenthesis: @support.requires_legacy_unicode_capi() instead of: @support.requires_legacy_unicode_capi The implementation now only imports _testcapi when the decorator is called, so "import test.support" no longer imports the _testcapi extension.

…GH-108438) The decorator now requires to be called with parenthesis: @support.requires_legacy_unicode_capi() instead of: @support.requires_legacy_unicode_capi The implementation now only imports _testcapi when the decorator is called, so "import test.support" no longer imports the _testcapi extension. (cherry picked from commit 995f4c4) Co-authored-by: Victor Stinner <vstinner@python.org>

…8438) (#108446) gh-80527: Change support.requires_legacy_unicode_capi() (GH-108438) The decorator now requires to be called with parenthesis: @support.requires_legacy_unicode_capi() instead of: @support.requires_legacy_unicode_capi The implementation now only imports _testcapi when the decorator is called, so "import test.support" no longer imports the _testcapi extension. (cherry picked from commit 995f4c4) Co-authored-by: Victor Stinner <vstinner@python.org>

serhiy-storchaka added 3.8 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode labels Mar 18, 2019

vstinner added the topic-C-API label Dec 9, 2019

methane closed this as completed Jan 29, 2022

ezio-melotti transferred this issue from another repository Apr 10, 2022

vstinner mentioned this issue Apr 26, 2022

[C API] PEP 670: Convert macros to functions in the Python C API #89653

Closed

bedevere-bot mentioned this issue Aug 24, 2023

gh-80527: Change support.requires_legacy_unicode_capi() #108438

Merged

bedevere-bot mentioned this issue Aug 24, 2023

[3.12] gh-80527: Change support.requires_legacy_unicode_capi() (GH-108438) #108446

Merged

Prepare for removing the legacy Unicode C API #80527

Prepare for removing the legacy Unicode C API #80527

Comments

serhiy-storchaka commented Mar 18, 2019 • edited by bedevere-bot Loading

Linked PRs

serhiy-storchaka commented Mar 18, 2019

scoder commented Mar 18, 2019

scoder commented Mar 18, 2019

malemburg commented Mar 18, 2019

scoder commented Mar 18, 2019

malemburg commented Mar 18, 2019

methane commented Mar 19, 2019

serhiy-storchaka commented Mar 19, 2019

methane commented Mar 19, 2019

ronaldoussoren commented Mar 19, 2019

pitrou commented Mar 21, 2019

methane commented Apr 10, 2019

vstinner commented Oct 28, 2019

methane commented May 11, 2020

vstinner commented May 11, 2020

methane commented Jun 17, 2020

vstinner commented Jun 17, 2020

methane commented Jun 17, 2020

vstinner commented Jun 17, 2020

methane commented Jun 17, 2020

methane commented Jun 18, 2020

serhiy-storchaka commented Jun 30, 2020

methane commented Jun 30, 2020

serhiy-storchaka commented Jul 5, 2020

serhiy-storchaka commented Jul 5, 2020

serhiy-storchaka commented Jul 10, 2020

serhiy-storchaka commented Jul 10, 2020

methane commented Aug 5, 2020

miss-islington commented Aug 5, 2020

miss-islington commented Aug 5, 2020

methane commented Feb 22, 2021

methane commented Feb 22, 2021

miss-islington commented Feb 22, 2021

willingc commented Mar 1, 2021

iritkatriel commented Jan 28, 2022

methane commented Jan 29, 2022

serhiy-storchaka commented Mar 18, 2019 •

edited by bedevere-bot

Loading