Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-110289: C API: Add PyUnicode_EqualToUTF8() function #110297

Merged
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 14 additions & 4 deletions Doc/c-api/unicode.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1396,18 +1396,28 @@ They all return ``NULL`` or ``-1`` if an exception occurs.
:c:func:`PyErr_Occurred` to check for errors.


.. c:function:: int PyUnicode_EqualToUTF8(PyObject *unicode, const char *string)
.. c:function:: int PyUnicode_EqualToUTF8AndSize(PyObject *unicode, const char *string, Py_ssize_t size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about renaming string to utf8_str? The utf8_ would be another way to document that it's expected to be encoded to UTF-8 and also it's easier (for me) to distinguish that the second argument is a bytes string, since string name is quite generic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a part of the bigger issue. See #62897.


Compare a Unicode object with a UTF-8 or ASCII encoded C string
and return true (``1``) if they are equal, or false (``0``) otherwise.
If the Unicode object contains null or surrogate characters or
Compare a Unicode object with a char buffer which is interpreted as
being UTF-8 or ASCII encoded and return true (``1``) if they are equal,
or false (``0``) otherwise.
If the Unicode object contains surrogate characters or
the C string is not valid UTF-8, false (``0``) is returned.
vstinner marked this conversation as resolved.
Show resolved Hide resolved

This function does not raise exceptions.
serhiy-storchaka marked this conversation as resolved.
Show resolved Hide resolved
serhiy-storchaka marked this conversation as resolved.
Show resolved Hide resolved

.. versionadded:: 3.13


.. c:function:: int PyUnicode_EqualToUTF8(PyObject *unicode, const char *string)

Similar to :c:func:`PyUnicode_EqualToUTF8AndSize`, but compute the string
serhiy-storchaka marked this conversation as resolved.
Show resolved Hide resolved
length using :c:func:`!strlen`.
If the Unicode object contains null characters, false (``0``) is returned.

.. versionadded:: 3.13


.. c:function:: int PyUnicode_CompareWithASCIIString(PyObject *uni, const char *string)

Compare a Unicode object, *uni*, with *string* and return ``-1``, ``0``, ``1`` for less
Expand Down
1 change: 1 addition & 0 deletions Doc/data/stable_abi.dat

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 4 additions & 4 deletions Doc/whatsnew/3.13.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1003,10 +1003,10 @@ New Features
functions on Python 3.11 and 3.12.
(Contributed by Victor Stinner in :gh:`107073`.)

* Add :c:func:`PyUnicode_EqualToUTF8` function: compare Unicode object with
a :c:expr:`const char*` UTF-8 encoded string and return true (``1``)
if they are equal, or false (``0``) otherwise.
This function does not raise exceptions.
* Add :c:func:`PyUnicode_EqualToUTF8AndSize` and :c:func:`PyUnicode_EqualToUTF8`
functions: compare Unicode object with a :c:expr:`const char*` UTF-8 encoded
string and return true (``1``) if they are equal, or false (``0``) otherwise.
These functions do not raise exceptions.
(Contributed by Serhiy Storchaka in :gh:`110289`.)

* Add :c:func:`PyThreadState_GetUnchecked()` function: similar to
Expand Down
1 change: 1 addition & 0 deletions Include/unicodeobject.h
Original file line number Diff line number Diff line change
Expand Up @@ -963,6 +963,7 @@ PyAPI_FUNC(int) PyUnicode_CompareWithASCIIString(
This function does not raise exceptions. */

PyAPI_FUNC(int) PyUnicode_EqualToUTF8(PyObject *, const char *);
PyAPI_FUNC(int) PyUnicode_EqualToUTF8AndSize(PyObject *, const char *, Py_ssize_t);
#endif

/* Rich compare two strings and return one of the following:
Expand Down
58 changes: 53 additions & 5 deletions Lib/test/test_capi/test_unicode.py
Original file line number Diff line number Diff line change
Expand Up @@ -1320,6 +1320,7 @@ def test_equaltoutf8(self):
self.assertEqual(equaltoutf8(s + 'x', b + b'x'), 1)
self.assertEqual(equaltoutf8(s + 'x', b + b'y'), 0)
self.assertEqual(equaltoutf8(s + '\0', b + b'\0'), 0)
self.assertEqual(equaltoutf8(s + '\0', b), 0)
self.assertEqual(equaltoutf8(s2, b + b'x'), 0)
self.assertEqual(equaltoutf8(s2, b[:-1]), 0)
self.assertEqual(equaltoutf8(s2, b[:-1] + b'x'), 0)
Expand All @@ -1337,19 +1338,66 @@ def test_equaltoutf8(self):
self.assertEqual(equaltoutf8('\ud801',
'\ud801'.encode("utf8", "surrogatepass")), 0)

@support.cpython_only
@unittest.skipIf(_testcapi is None, 'need _testcapi module')
def test_equaltoutf8andsize(self):
"""Test PyUnicode_EqualToUTF8AndSize()"""
erlend-aasland marked this conversation as resolved.
Show resolved Hide resolved
from _testcapi import unicode_equaltoutf8andsize as equaltoutf8andsize
from _testcapi import unicode_asutf8andsize as asutf8andsize

strings = [
serhiy-storchaka marked this conversation as resolved.
Show resolved Hide resolved
'abc', '\xa1\xa2\xa3', '\u4f60\u597d\u4e16',
'\U0001f600\U0001f601\U0001f602',
'\U0010ffff',
]
for s in strings:
# Call PyUnicode_AsUTF8AndSize() which creates the UTF-8
# encoded string cached in the Unicode object.
asutf8andsize(s, 0)
b = s.encode()
self.assertEqual(equaltoutf8andsize(s, b), 1) # Use the UTF-8 cache.
s2 = b.decode() # New Unicode object without the UTF-8 cache.
self.assertEqual(equaltoutf8andsize(s2, b), 1)
self.assertEqual(equaltoutf8andsize(s + 'x', b + b'x'), 1)
self.assertEqual(equaltoutf8andsize(s + 'x', b + b'y'), 0)
self.assertEqual(equaltoutf8andsize(s + '\0', b + b'\0'), 1)
self.assertEqual(equaltoutf8andsize(s + '\0', b), 0)
self.assertEqual(equaltoutf8andsize(s2, b + b'x'), 0)
self.assertEqual(equaltoutf8andsize(s2, b[:-1]), 0)
self.assertEqual(equaltoutf8andsize(s2, b[:-1] + b'x'), 0)
# Not null-terminated,
self.assertEqual(equaltoutf8andsize(s, b + b'x', len(b)), 1)
self.assertEqual(equaltoutf8andsize(s2, b + b'x', len(b)), 1)
self.assertEqual(equaltoutf8andsize(s + '\0', b + b'\0x', len(b) + 1), 1)
self.assertEqual(equaltoutf8andsize(s2, b, len(b) - 1), 0)

# embedded null chars/bytes
self.assertEqual(equaltoutf8andsize('abc', b'abc\0def\0'), 0)
self.assertEqual(equaltoutf8andsize('a\0bc', b'abc'), 0)
self.assertEqual(equaltoutf8andsize('abc', b'a\0bc'), 0)
serhiy-storchaka marked this conversation as resolved.
Show resolved Hide resolved

# Surrogate characters are always treated as not equal
self.assertEqual(equaltoutf8andsize('\udcfe',
'\udcfe'.encode("utf8", "surrogateescape")), 0)
self.assertEqual(equaltoutf8andsize('\udcfe',
'\udcfe'.encode("utf8", "surrogatepass")), 0)
self.assertEqual(equaltoutf8andsize('\ud801',
'\ud801'.encode("utf8", "surrogatepass")), 0)

def check_not_equal_encoding(text, encoding):
self.assertEqual(equaltoutf8(text, text.encode(encoding)), 0)
self.assertEqual(equaltoutf8andsize(text, text.encode(encoding)), 0)
self.assertNotEqual(text.encode(encoding), text.encode("utf8"))

# Strings encoded to other encodings are not equal to expected UTF8-encoding string
check_not_equal_encoding('Stéphane', 'latin1')
check_not_equal_encoding('Stéphane', 'utf-16-le') # embedded null characters
check_not_equal_encoding('北京市', 'gbk')

# CRASHES equaltoutf8(b'abc', b'abc')
# CRASHES equaltoutf8([], b'abc')
# CRASHES equaltoutf8(NULL, b'abc')
# CRASHES equaltoutf8('abc', NULL)
# CRASHES equaltoutf8andsize('abc', b'abc', -1)
# CRASHES equaltoutf8andsize(b'abc', b'abc')
# CRASHES equaltoutf8andsize([], b'abc')
# CRASHES equaltoutf8andsize(NULL, b'abc')
# CRASHES equaltoutf8andsize('abc', NULL)

@support.cpython_only
@unittest.skipIf(_testcapi is None, 'need _testcapi module')
Expand Down
1 change: 1 addition & 0 deletions Lib/test/test_stable_abi_ctypes.py

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Original file line number Diff line number Diff line change
@@ -1 +1 @@
Add :c:func:`PyUnicode_EqualToUTF8` function.
Add :c:func:`PyUnicode_EqualToUTF8AndSize` and :c:func:`PyUnicode_EqualToUTF8` functions.
2 changes: 2 additions & 0 deletions Misc/stable_abi.toml
Original file line number Diff line number Diff line change
Expand Up @@ -2462,3 +2462,5 @@
added = '3.13'
[function.PyUnicode_EqualToUTF8]
added = '3.13'
[function.PyUnicode_EqualToUTF8AndSize]
added = '3.13'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated question, but is there a plan to generate this file from Doc/data/stable_abi.dat or the reverse?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that Doc/data/stable_abi.dat is generated from Misc/stable_abi.toml.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok, thank you!

24 changes: 24 additions & 0 deletions Modules/_testcapi/unicode.c
Original file line number Diff line number Diff line change
Expand Up @@ -1448,6 +1448,29 @@ unicode_equaltoutf8(PyObject *self, PyObject *args)
return PyLong_FromLong(result);
}

/* Test PyUnicode_EqualToUTF8AndSize() */
static PyObject *
unicode_equaltoutf8andsize(PyObject *self, PyObject *args)
{
PyObject *left;
const char *right = NULL;
Py_ssize_t right_len;
Py_ssize_t size = -100;
int result;

if (!PyArg_ParseTuple(args, "Oz#|n", &left, &right, &right_len, &size)) {
return NULL;
}

NULLABLE(left);
if (size == -100) {
size = right_len;
}
result = PyUnicode_EqualToUTF8AndSize(left, right, size);
assert(!PyErr_Occurred());
return PyLong_FromLong(result);
}

/* Test PyUnicode_RichCompare() */
static PyObject *
unicode_richcompare(PyObject *self, PyObject *args)
Expand Down Expand Up @@ -2064,6 +2087,7 @@ static PyMethodDef TestMethods[] = {
{"unicode_compare", unicode_compare, METH_VARARGS},
{"unicode_comparewithasciistring",unicode_comparewithasciistring,METH_VARARGS},
{"unicode_equaltoutf8", unicode_equaltoutf8, METH_VARARGS},
{"unicode_equaltoutf8andsize",unicode_equaltoutf8andsize, METH_VARARGS},
{"unicode_richcompare", unicode_richcompare, METH_VARARGS},
{"unicode_format", unicode_format, METH_VARARGS},
{"unicode_contains", unicode_contains, METH_VARARGS},
Expand Down
36 changes: 23 additions & 13 deletions Objects/unicodeobject.c
Original file line number Diff line number Diff line change
Expand Up @@ -10675,39 +10675,47 @@ PyUnicode_CompareWithASCIIString(PyObject* uni, const char* str)

int
PyUnicode_EqualToUTF8(PyObject *unicode, const char *str)
{
return PyUnicode_EqualToUTF8AndSize(unicode, str, strlen(str));
}

int
PyUnicode_EqualToUTF8AndSize(PyObject *unicode, const char *str, Py_ssize_t size)
{
assert(_PyUnicode_CHECK(unicode));
assert(str);

if (PyUnicode_IS_ASCII(unicode)) {
size_t len = (size_t)PyUnicode_GET_LENGTH(unicode);
return strlen(str) == len &&
Py_ssize_t len = PyUnicode_GET_LENGTH(unicode);
return size == len &&
memcmp(PyUnicode_1BYTE_DATA(unicode), str, len) == 0;
}
if (PyUnicode_UTF8(unicode) != NULL) {
size_t len = (size_t)PyUnicode_UTF8_LENGTH(unicode);
return strlen(str) == len &&
Py_ssize_t len = PyUnicode_UTF8_LENGTH(unicode);
return size == len &&
memcmp(PyUnicode_UTF8(unicode), str, len) == 0;
}

const unsigned char *s = (const unsigned char *)str;
Py_ssize_t len = PyUnicode_GET_LENGTH(unicode);
if ((size_t)len >= (size_t)size || (size_t)len < (size_t)size / 4) {
return 0;
}
const unsigned char *s = (const unsigned char *)str;
const unsigned char *ends = s + (size_t)size;
int kind = PyUnicode_KIND(unicode);
const void *data = PyUnicode_DATA(unicode);
/* Compare Unicode string and UTF-8 string */
for (Py_ssize_t i = 0; i < len; i++) {
Py_UCS4 ch = PyUnicode_READ(kind, data, i);
if (ch == 0) {
return 0;
}
else if (ch < 0x80) {
if (s[0] != ch) {
if (ch < 0x80) {
if (ends == s || s[0] != ch) {
return 0;
}
s += 1;
}
else if (ch < 0x800) {
if (s[0] != (0xc0 | (ch >> 6)) ||
if (ends - s < 2 ||
serhiy-storchaka marked this conversation as resolved.
Show resolved Hide resolved
s[0] != (0xc0 | (ch >> 6)) ||
s[1] != (0x80 | (ch & 0x3f)))
{
return 0;
Expand All @@ -10716,6 +10724,7 @@ PyUnicode_EqualToUTF8(PyObject *unicode, const char *str)
}
else if (ch < 0x10000) {
if (Py_UNICODE_IS_SURROGATE(ch) ||
ends - s < 3 ||
s[0] != (0xe0 | (ch >> 12)) ||
s[1] != (0x80 | ((ch >> 6) & 0x3f)) ||
s[2] != (0x80 | (ch & 0x3f)))
Expand All @@ -10726,7 +10735,8 @@ PyUnicode_EqualToUTF8(PyObject *unicode, const char *str)
}
else {
assert(ch <= MAX_UNICODE);
if (s[0] != (0xf0 | (ch >> 18)) ||
if (ends - s < 4 ||
s[0] != (0xf0 | (ch >> 18)) ||
s[1] != (0x80 | ((ch >> 12) & 0x3f)) ||
s[2] != (0x80 | ((ch >> 6) & 0x3f)) ||
s[3] != (0x80 | (ch & 0x3f)))
Expand All @@ -10736,7 +10746,7 @@ PyUnicode_EqualToUTF8(PyObject *unicode, const char *str)
s += 4;
}
}
return *s == 0;
return s == ends;
}

int
Expand Down
1 change: 1 addition & 0 deletions PC/python3dll.c

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.