fix: validate UTF-8 at C++ -> Lean boundary #3963

digama0 · 2024-04-20T20:10:47Z

Continuation of #3958. To ensure that lean code is able to uphold the invariant that Strings are valid UTF-8 (which is assumed by the lean model), we have to make sure that no lean objects are created with invalid UTF-8. #3958 covers the case of lean code creating strings via fromUTF8Unchecked, but there are still many cases where C++ code constructs strings from a const char * or std::string with unclear UTF-8 status.

To address this and minimize accidental missed validation, the (lean_)mk_string function is modified to validate UTF-8. The original function is renamed to mk_string_unchecked, with several other variants depending on whether we know the string is UTF-8 or ASCII and whether we have the length and/or utf8 char count on hand. I reviewed every function which leads to mk_string or its variants in the C code, and used the appropriate validation function, defaulting to mk_string if the provenance is unclear.

This PR adds no new error handling paths, meaning that incorrect UTF-8 will still produce incorrect results in e.g. IO functions, they are just not causing unsound behavior anymore. A subsequent PR will handle adding better error reporting for bad UTF-8.

digama0 · 2024-04-20T20:16:39Z

!bench

digama0 · 2024-04-20T20:17:00Z

(disabled auto-merge to run a benchmark first)

leanprover-community-mathlib4-bot · 2024-04-20T20:29:09Z

Mathlib CI status (docs):

❗ Std/Mathlib CI will not be attempted unless your PR branches off the nightly-with-mathlib branch. Try git rebase 62cdb51ed5b9d8487877d5a4adbcd4659d81fc6a --onto 291bb84c972dae470e8f5602729e9d5a5e9433a2. (2024-04-20 20:29:09)
❗ Std/Mathlib CI will not be attempted unless your PR branches off the nightly-with-mathlib branch. Try git rebase fb135b8cfe83c8d9286a9d0198114b7cb306a67a --onto 62cdb51ed5b9d8487877d5a4adbcd4659d81fc6a. (2024-04-23 21:32:17)

leanprover-bot · 2024-04-20T20:34:10Z

Here are the benchmark results for commit 88cad9a.
There were significant changes against commit 62cdb51:

  Benchmark   Metric             Change
  ================================================
+ stdlib      tactic execution    -1.9% (-302.2 σ)

digama0 · 2024-04-20T21:26:13Z

The affected tests are regression tests from #301, #1707, #1690, all of which appear to have been strange behaviors of lean caused by invalid UTF-8. The new behavior is for the C code to throw an unspecific error before even calling lean code; this is inevitable because the lean parser/elaborator assumes it can store the file in a String so we can't really use most of lean's error handling.

nomeata · 2024-04-22T07:38:27Z

src/include/lean/lean.h

@@ -2064,11 +2066,11 @@ static inline uint8_t lean_version_get_is_release(lean_obj_arg _unit) {
 }

 static inline lean_obj_res lean_version_get_special_desc(lean_obj_arg _unit) {
-    return lean_mk_string(LEAN_SPECIAL_VERSION_DESC);
+    return lean_mk_ascii_string(LEAN_SPECIAL_VERSION_DESC);


I’d probably lean towards using mk_utf8_string here. It’s probably not performance critical, and maybe some German developer with a ü in their name builds lean from a branch with v0.7.8-Günnar….

Suggested change

return lean_mk_ascii_string(LEAN_SPECIAL_VERSION_DESC);

return lean_mk_utf8_string(LEAN_SPECIAL_VERSION_DESC);

nomeata · 2024-04-22T07:39:01Z

src/include/lean/lean.h

 }

 static inline lean_obj_res lean_system_platform_target(lean_obj_arg _unit) {
-    return lean_mk_string(LEAN_PLATFORM_TARGET);
+    return lean_mk_ascii_string(LEAN_PLATFORM_TARGET);


Suggested change

return lean_mk_ascii_string(LEAN_PLATFORM_TARGET);

return lean_mk_utf8_string(LEAN_PLATFORM_TARGET);

Unlikely to matter, but better safe than sorry.

well, the "better safe than sorry" option is lean_mk_string. If we have any reason to believe that the data isn't necessarily in the subset it appears to be in, we should just validate it. (Note: is it confusing that the function mk_utf8_string does not validate? Your comments here suggest that you've misunderstood the role of the functions in which case the name should probably be changed to something less confusing.)

Indeed, after I wrote this and was away from the computer I paused wondered if my suggestion is bogus.

It’s not entirely bogus, though: If the there is utf8 in the TARGET, then mk_ascii_string will produce an invalid length. So if LEAN_SPECIAL_VERSION_DESC etc. happens to be utf8, then mk_utf8_string is better than mk_ascii_string, but not as safe as a validating function.

I agree that it’s worth changing the lean_mk_{utf8,utf8}_string function names to include `unchecked to avoid future bugs.

Agreed on better names

nomeata · 2024-04-22T07:40:12Z

src/runtime/platform.cpp

@@ -40,7 +40,7 @@ extern "C" LEAN_EXPORT uint8 lean_system_platform_emscripten(obj_arg) {
 #endif
 }

-extern "C" object * lean_get_githash(obj_arg) { return lean_mk_string(LEAN_GITHASH); }
+extern "C" object * lean_get_githash(obj_arg) { return lean_mk_ascii_string(LEAN_GITHASH); }


Also unlikely, but since the source is out of our (local) control, better safe than sorry.

Suggested change

extern "C" object * lean_get_githash(obj_arg) { return lean_mk_ascii_string(LEAN_GITHASH); }

extern "C" object * lean_get_githash(obj_arg) { return lean_mk_utf8_string(LEAN_GITHASH); }

src/include/lean/lean.h

Kha · 2024-04-22T15:38:58Z

src/include/lean/lean.h

 }

 static inline lean_obj_res lean_system_platform_target(lean_obj_arg _unit) {
-    return lean_mk_string(LEAN_PLATFORM_TARGET);
+    return lean_mk_ascii_string(LEAN_PLATFORM_TARGET);


Agreed on better names

src/runtime/object.cpp

Kha · 2024-04-22T15:43:55Z

src/runtime/object.cpp

+    if (validate_utf8((const uint8_t *)s, len)) {
+        return lean_mk_string_from_bytes(s, len);
+    } else {
+        return lean_mk_string_core("", 0, 0);


Is this really a sensible fallback? I matches the panic value but there is no message. Shouldn't this raise an exception if done as part of I/O?

Not really. I was planning to put in a follow-up PR which does something like rust's String::from_utf8_lossy with replacement characters. In most cases it's not really a good idea to rely on this for error handling, and instead code should call validate separately and throw IO errors etc as appropriate.

I think it would be good to elaborate on this plan before going forward. It would also be helpful to mention such relevant plans in the PR description :) .

If we want to get to the point where no-one, especially not Lean APIs, depends on this branch, should it become a hard assertion?

I have a revision in the works which addresses many of the review comments. I'll post that and then we can talk about how we want to proceed here.

digama0 · 2024-04-23T21:36:20Z

I pushed a new version of the PR. Summary of changes (w.r.t master, not ver.1):

lean_mk_string_unchecked(const char * s, size_t sz, size_t utf8_chars) was previously lean_mk_string_core
- Use this when we have both the length and utf8_length values available and also know the input is valid UTF-8.
lean_mk_string_from_bytes_unchecked(const char * s, size_t sz) was lean_mk_string_from_bytes
- Use this when we have the length and know the input is valid UTF-8, but need to compute the utf8_chars.
lean_mk_string_from_bytes(const char * s, size_t sz) does validation now, and if validation fails it converts invalid UTF-8 sequences to U+FFFD instead of just returning the empty string.
- Use this when we have a (char*, len) pair but do not know it encodes valid UTF-8
lean_mk_string(const char * s) is a wrapper around lean_mk_string_from_bytes
- Use this when we have char*, not necessarily UTF-8 and need strlen to get the length
lean_string_from_utf8_unchecked was lean_string_from_utf8
- This is the implementation of String.fromUTF8, you should probably not call it directly. I added _unchecked to the name because the C version doesn't validate, even though the lean version takes a validity proof witness as an argument.
lean_mk_ascii_string_unchecked(char const * s) was lean_mk_ascii_string
- Use this when you know the string is valid ASCII but need strlen to get the length.
lean::mk_ascii_string_unchecked(std::string const & s) was lean::mk_ascii_string
- Use this when you know the string is valid ASCII and have a std::string (which counts the length).

Other (non-API) changes:

The code generator produces calls to lean_mk_string_unchecked now instead of lean_mk_string_from_bytes, since we don't want to validate string literals, plus we can calculate the utf8_chars in advance.
bool validate_utf8(uint8_t const * str, size_t size, size_t & pos, size_t & i) also returns the byte position of the first bad character, which can be used for better error reporting in addition to its use here as a fast path for the replacement character algorithm.
I removed the special error handling for invalid UTF-8 in lean files. These cases, in addition to the IO uses of mk_string, are handled by the replacement algorithm, meaning that lean will complain in the bad UTF-8 tests with a "more normal" error saying that U+FFFD is not a token; if you stick it inside a comment or guillemets it will be accepted like any other unicode character. This is not necessarily a good choice, but at least it's a less damaging default behavior than erasing the whole string. You can however do separate validation, by calling validate_utf8 and lean_mk_string_unchecked instead of lean_mk_string.

nomeata · 2024-04-24T07:17:25Z

Sounds like a nice and thorough cleanup, thanks a lot!

I'd lean towards rejecting invalid utf8 in lean files (not replacing, not using the empty string). Who knows what kind of surprises will occur when people accidentally have invalid utf8 and lean does something, but not what they expect to.

Kha · 2024-04-24T08:30:24Z

I'm happy as well modulo below -

I'd lean towards rejecting invalid utf8 in lean files (not replacing, not using the empty string)

Agreed, and more generally I/O functions should at least by default throw an error when decoding fails. I'm pretty sure this is the default in all the big stdlibs out there as well, no?

digama0 · 2024-04-24T17:09:07Z

I agree, but can we agree that this is an improvement on the status quo and do that in a follow up PR? Handling UTF-8 errors properly means having new kinds of error reporting functions we don't have currently, as well as touching a bunch more code. The main purpose of this PR is to avoid undefined behavior in the present version caused by invalid lean objects being created, not to add new error reporting mechanisms.

Kha · 2024-04-25T07:59:49Z

Sure, I just want to make sure it is documented and discussed that this is a partial solution in that way. That is very relevant context for reviewing.

nomeata · 2024-04-25T08:10:33Z

Sounds like as soon as the PR description reflects the current state I can press the button?

digama0 · 2024-04-25T16:35:40Z

PR description updated

nomeata · 2024-04-25T18:05:52Z

Linux-debug fails with

1658/1862 Test #1661: leanruntest_subset.lean ...................................***Failed    1.07 sec
../../common.sh: line 42: 48489 Aborted                 (core dumped) LEAN_BACKTRACE=0 "$@" 2>&1
     48490 Done                    | perl -pe 's/(\?(\w|_\w+))\.[0-9]+/\1/g' > "$f.produced.out"
Unexpected return code 134 executing 'lean -Dlinter.all=false subset.lean'; expected 0. Output:
LEAN ASSERTION VIOLATION
File: /home/runner/work/lean4/lean4/build/stage1/include/lean/lean.h
Line: 473
lean_is_string(o)
(C)ontinue, (A)bort/exit, (S)top/trap

Kha · 2024-05-21T09:43:08Z

@digama0 Do you have time to look into the assertion failure?

digama0 · 2024-06-18T21:02:25Z

Sorry for the delay, the assertion failure should be fixed now.

digama0 requested a review from joehendrix as a code owner April 20, 2024 20:10

digama0 changed the title ~~fix: validate UTF-8 at C++ -> Lean gboundary~~ fix: validate UTF-8 at C++ -> Lean boundary Apr 20, 2024

nomeata approved these changes Apr 20, 2024

View reviewed changes

nomeata enabled auto-merge April 20, 2024 20:13

digama0 disabled auto-merge April 20, 2024 20:16

github-actions bot added the toolchain-available A toolchain is available for this PR, at leanprover/lean4-pr-releases:pr-release-NNNN label Apr 20, 2024

Kha self-requested a review April 20, 2024 20:37

leodemoura requested a review from nomeata April 21, 2024 21:19

nomeata reviewed Apr 22, 2024

View reviewed changes

Kha reviewed Apr 22, 2024

View reviewed changes

digama0 requested a review from leodemoura as a code owner April 23, 2024 21:08

digama0 force-pushed the validate_utf8_c branch from 1e6967c to 248a17a Compare April 23, 2024 21:13

Kha approved these changes Apr 25, 2024

View reviewed changes

Kha added this pull request to the merge queue Apr 25, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 25, 2024

nomeata added the full-ci label Apr 25, 2024

Kha removed the full-ci label May 24, 2024

kim-em self-requested a review as a code owner May 29, 2024 16:23

Kha added the awaiting-author Waiting for PR author to address issues label Jun 17, 2024

digama0 added 4 commits June 17, 2024 22:39

fix: validate UTF-8 at C++ -> Lean gboundary

2ebc9a9

fix tests

9699ec5

ver.2

96c94ce

fix segfault

7f7d041

digama0 force-pushed the validate_utf8_c branch from 248a17a to 7f7d041 Compare June 18, 2024 21:01

digama0 added full-ci and removed awaiting-author Waiting for PR author to address issues labels Jun 18, 2024

leanprover-community-mathlib4-bot added a commit to leanprover-community/batteries that referenced this pull request Jun 18, 2024

Update lean-toolchain for testing leanprover/lean4#3963

5775e9e

leanprover-community-mathlib4-bot added a commit to leanprover-community/mathlib4 that referenced this pull request Jun 18, 2024

Update lean-toolchain for testing leanprover/lean4#3963

29f9310

Kha added this pull request to the merge queue Jun 19, 2024

Merged via the queue into leanprover:master with commit 0a1a855 Jun 19, 2024
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: validate UTF-8 at C++ -> Lean boundary #3963

fix: validate UTF-8 at C++ -> Lean boundary #3963

digama0 commented Apr 20, 2024 •

edited

Loading

digama0 commented Apr 20, 2024

digama0 commented Apr 20, 2024

leanprover-community-mathlib4-bot commented Apr 20, 2024 •

edited

Loading

leanprover-bot commented Apr 20, 2024

digama0 commented Apr 20, 2024

nomeata Apr 22, 2024

nomeata Apr 22, 2024

digama0 Apr 22, 2024 •

edited

Loading

nomeata Apr 22, 2024

Kha Apr 22, 2024

nomeata Apr 22, 2024

Kha Apr 22, 2024

Kha Apr 22, 2024

digama0 Apr 22, 2024

Kha Apr 23, 2024

digama0 Apr 23, 2024

digama0 commented Apr 23, 2024 •

edited

Loading

nomeata commented Apr 24, 2024

Kha commented Apr 24, 2024

digama0 commented Apr 24, 2024 •

edited

Loading

Kha commented Apr 25, 2024

nomeata commented Apr 25, 2024

digama0 commented Apr 25, 2024

nomeata commented Apr 25, 2024

Kha commented May 21, 2024

digama0 commented Jun 18, 2024

	return lean_mk_ascii_string(LEAN_SPECIAL_VERSION_DESC);
	return lean_mk_utf8_string(LEAN_SPECIAL_VERSION_DESC);

	return lean_mk_ascii_string(LEAN_PLATFORM_TARGET);
	return lean_mk_utf8_string(LEAN_PLATFORM_TARGET);

	extern "C" object * lean_get_githash(obj_arg) { return lean_mk_ascii_string(LEAN_GITHASH); }
	extern "C" object * lean_get_githash(obj_arg) { return lean_mk_utf8_string(LEAN_GITHASH); }

fix: validate UTF-8 at C++ -> Lean boundary #3963

fix: validate UTF-8 at C++ -> Lean boundary #3963

Conversation

digama0 commented Apr 20, 2024 • edited Loading

digama0 commented Apr 20, 2024

digama0 commented Apr 20, 2024

leanprover-community-mathlib4-bot commented Apr 20, 2024 • edited Loading

leanprover-bot commented Apr 20, 2024

digama0 commented Apr 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

digama0 Apr 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

digama0 commented Apr 23, 2024 • edited Loading

nomeata commented Apr 24, 2024

Kha commented Apr 24, 2024

digama0 commented Apr 24, 2024 • edited Loading

Kha commented Apr 25, 2024

nomeata commented Apr 25, 2024

digama0 commented Apr 25, 2024

nomeata commented Apr 25, 2024

Kha commented May 21, 2024

digama0 commented Jun 18, 2024

digama0 commented Apr 20, 2024 •

edited

Loading

leanprover-community-mathlib4-bot commented Apr 20, 2024 •

edited

Loading

digama0 Apr 22, 2024 •

edited

Loading

digama0 commented Apr 23, 2024 •

edited

Loading

digama0 commented Apr 24, 2024 •

edited

Loading