Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work around popen not being thread safe on MacOS #695

Merged
merged 79 commits into from
Nov 29, 2022

Conversation

Thomas1664
Copy link
Contributor

@Thomas1664 Thomas1664 commented Sep 4, 2022

Related microsoft/vcpkg#26651

vcpkg occasionally crashed at decompress_in_parallel > cmd_execute_and_capture_output_parallel because threads that are launched by cmd_execute_and_capture_output_parallel were writing to the result vector without locking the vector.

Not a problem because they access different indexes of the vector and don't cause a resize.

@omartijn Thanks for providing the debug information.

max_concurrency()

get_concurrency() returned int which didn't make any sense because there can't be -1 threads.

@@ -120,7 +120,7 @@ namespace vcpkg::Commands::DependInfo
std::string value = iter->second;
try
{
return std::stoi(value);
return std::max(std::stoi(value), NO_RECURSE_LIMIT_VALUE);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who says a user puts only numbers >= -1 in here?

}
catch (std::exception&)
{
Checks::msg_exit_with_message(VCPKG_LINE_INFO, msgOptionMustBeInteger, msg::option = "VCPKG_MAX_CONCURRENCY");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BillyONeal Should we throw if VCPKG_MAX_CONCURRENCY is invalid or should we simply continue with std::thread::hardware_concurrency()

@Thomas1664 Thomas1664 changed the title Fix rare crash Fix bugs related to concurrency Sep 5, 2022
@Thomas1664
Copy link
Contributor Author

Thomas1664 commented Sep 6, 2022

@omartijn Gotcha!

I created a unit test that executes the command echo a where a is a string of the char a with a length that matches the current index, i.e. `` for index 0, a for index 1, `aa` for index 2 and so on.

Here are the results:

...
Index: 36, length: 36
Index: 37, length: 37
Index: 38, length: 18446744073709551615

The problem is that in random positions the string returned from the child process is missing a \0 which leads to a string length of SIZE_T_MAX. I subtracted 1 in my test to align with the current index.

The segfault comes into place when someone tries to read at index > 38 in this case.

@Thomas1664 Thomas1664 marked this pull request as draft September 6, 2022 23:03
@autoantwort
Copy link
Contributor

@Thomas1664 Can the change for the mutex around popen be extracted to another PR? Its very annoying when vcpkg crashes and I don't want to wait longer :)

I think the issue you found with asan already fixes the crash that I tried to fix with this PR?

I fixed another crash, but there are multiple ones: #695 (comment)

@autoantwort
Copy link
Contributor

I only see this:
image

@Thomas1664
Copy link
Contributor Author

Never mind, the crash isn't fixed

@Thomas1664
Copy link
Contributor Author

Thomas1664 commented Nov 19, 2022

@autoantwort If unit tests succeed on OSX in CI and the message Mutex enabled is printed to the console, could you please run the unit tests from this branch on your Mac (IIRC you do have one) 100-1000 times and if it crashes try setting the buffer to 1023 and try again? Seems like CI can't reliably test this.

EDIT: Never mind, I noticed that the debug message I added wasn't printed.

@autoantwort
Copy link
Contributor

EDIT: Never mind, I noticed that the debug message I added wasn't printed.

Because you have to pass --debug but you can't so that for the vcpkg-test executable

@autoantwort
Copy link
Contributor

autoantwort commented Nov 19, 2022

Seems that it fails with std::async because that also calls popen (or pthread_create)

@Thomas1664
Copy link
Contributor Author

Seems that it fails with std::async because that also calls popen (or pthread_create)

@autoantwort Although it shouldn't matter but did you also try to set the buffer back to 1023 at the line you commented about #695 (comment) ?

@autoantwort
Copy link
Contributor

Finally found it. The pclose call also has to be protected by the mutex.

@autoantwort
Copy link
Contributor

autoantwort commented Nov 19, 2022

All the errors I got happens before the fgets line. This has nothing to do with the buffer size. All got all errors while the buffer size was 1023.

@autoantwort
Copy link
Contributor

Apply this patch to fix the pipeline:

diff --git a/src/vcpkg/base/system.process.cpp b/src/vcpkg/base/system.process.cpp
index c6eb7820..c00713c8 100644
--- a/src/vcpkg/base/system.process.cpp
+++ b/src/vcpkg/base/system.process.cpp
@@ -960,7 +960,15 @@ namespace vcpkg
             return format_system_error_message("feof", errno);
         }
 
-        int ec = pclose(pipe);
+        int ec;
+        {
+#if defined(__APPLE__)
+            // `pclose` sometimes returns 127 on OSX when executed in parallel.
+            // Related: https://github.com/microsoft/vcpkg-tool/pull/695#discussion_r973364608
+            std::lock_guard guard(mtx);
+#endif
+            ec = pclose(pipe);
+        }
         if (WIFEXITED(ec))
         {
             ec = WEXITSTATUS(ec);

Copy link
Contributor

@autoantwort autoantwort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve the macOS specific changes

@Thomas1664
Copy link
Contributor Author

Finally found it. The pclose call also has to be protected by the mutex.

@autoantwort Thanks a lot for your help!

@Thomas1664
Copy link
Contributor Author

It is even documented that our use of popen was wrong: Although popen is thread safe, "It is even possible that calling MT-Safe functions in sequence does not yield an MT-Safe combination." (source: https://www.man7.org/linux/man-pages/man7/attributes.7.html ). Note that we might call this function in sequence across multiple threads:

auto work = [&]() {
std::size_t item;
while (item = work_item.fetch_add(1), item < cmd_lines.size())
{
res[item] = cmd_execute_and_capture_output(cmd_lines[item], wd, env);
}
};

@BillyONeal BillyONeal changed the title Fix bugs related to concurrency Work around popen being not thread safe on MacOS Nov 25, 2022
@BillyONeal BillyONeal changed the title Work around popen being not thread safe on MacOS Work around popen not being thread safe on MacOS Nov 25, 2022
@ras0219-msft
Copy link
Contributor

It is even documented that our use of popen was wrong: Although popen is thread safe, "It is even possible that calling MT-Safe functions in sequence does not yield an MT-Safe combination." (source: https://www.man7.org/linux/man-pages/man7/attributes.7.html ).

I think adding more context to the quote makes more sense:

For example, having a thread call two MT-
Safe functions one right after the other does not
guarantee behavior equivalent to atomic execution of a
combination of both functions, since concurrent calls in
other threads may interfere in a destructive way.

My interpretation is that this is restating how

int a = x.fetch_add(1);
int b = x.fetch_add(1);

doesn't guarantee b - a == 1 -- another thread could have modified x between. Thus, our use should still be permissible.

But all of this is largely philosophical. Given the evidence in this thread, I think adding a lock around popen and pclose has presented as a 100% fix to the issue (with a small sample size).

@BillyONeal
Copy link
Member

But all of this is largely philosophical. Given the evidence in this thread, I think adding a lock around popen and pclose has presented as a 100% fix to the issue (with a small sample size).

Does that mean you are OK with this change? Note that it is currently unable to merge because you are marked 'request changes'

@BillyONeal BillyONeal merged commit 478557d into microsoft:main Nov 29, 2022
@BillyONeal
Copy link
Member

Thanks :)

@autoantwort
Copy link
Contributor

Whoop whoop. Thank you! ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants