-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Binary cache: async push_success #908
base: main
Are you sure you want to change the base?
Binary cache: async push_success #908
Conversation
How or when should "upload messages" (like |
Doesn't this have the same problem as #694 that the working thread might exit due to calls to |
Kind of. In general we need an option to decide if a binary cache failure should be a hard error or only a warning |
The problem is that we almost never can be sure that there isn't some nested API call that exits on failure. But it seems like #909 at least partially addresses this issue. |
Yeah but in the binary cache are nearly no hard exists. It currently also only prints warnings. |
# Conflicts: # src/vcpkg.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this direction; unblocking I/O work has great potential for making vcpkg much faster.
However we need to be very careful about the impacts of concurrency -- deadlocks suck :(
src/vcpkg.cpp
Outdated
@@ -156,6 +157,7 @@ namespace vcpkg::Checks | |||
// Implements link seam from basic_checks.h | |||
void on_final_cleanup_and_exit() | |||
{ | |||
BinaryCache::wait_for_async_complete(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think we can do this here. This is on the critical path for ctrl-c handling and should only be used for extremely fast, emergency tear-down behavior (like restoring the console).
If there happens to be an exit anywhere in any BinaryCache implementation, this would deadlock. Importantly, this include any sort of assertion we might want to do, like checking pointers for null.
Unfortunately, the only path forward I see is to call this (or appropriately scope the BinaryCache
itself) at the relevant callers. The consequence of possibly not uploading some set of binary caches in the case of some unhandled program error (such as permissions issue on a directory expected to be writable) is vastly preferable to deadlocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed the BinaryCache::wait_for_async_complete()
implementation so it does not deadlock anymore.
I also moved the call to Checks::exit_with_code
which is not called when crtl+c is handled. (I personally would like to have a way to terminate vcpkg but wait until the binary cache is done so that I don't lose progress.)
And I prefer it when build packages are uploaded to the binary caches before vcpkg exits because of an error, otherwise I have to build the already build packages again at a later point when there is no cache entry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that it is, desirable to finish uploading on "understood" errors. For example, if a package failed to build or failed to be installed.
I was also wrong about my original assessment of a deadlock. My concern was the call path of the binary upload thread calling Checks::unreachable()
or .value_or_exit()
, but it seems that std::thread::join()
does have a carve-out to handle this specific case: it will throw a resource_deadlock_would_occur
if you try to join yourself.
I've put some other concerns below, but I don't want those to distract from my main point: We must make it as trivial / correct-by-construction as possible to guarantee that the binary cache thread NEVER attempts to wait on itself. I think the best approach for vcpkg right now is to add calls from Install::perform()
etc to BinaryCache::wait_for_async_complete()
before any "user-facing" error, such as the exit guarded by result.code != BuildResult::SUCCEEDED && keep_going == KeepGoing::NO
. This is motivated by the perspective that it's always safer to terminate than to join and possibly deadlock / race condition / etc.
There's still a UB data race if the main thread and binary upload thread attempt to exit at the same time:
Concurrently calling join() on the same thread object from multiple threads constitutes a data race that results in undefined behavior.
-- https://en.cppreference.com/w/cpp/thread/thread/join
There's also a serious "scalability" problem if we ever want a second background thread for whatever reason, because BGThread A would join on BGThread B, while BGThread B tries to join on BGThread A. This might be solvable with ever more complex structures, such as a thread ownership DAG that gets threads to join only on their direct children, but I don't think the benefit is worth the cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The UB and the joining itself could simply be prevented by doing a if (std::this_thread::get_id() == instance->push_thread.get_id())
. My concern with the explicit approach is that it is easy to forget to call the waiting function of the BinaryCache and every time you want to exit you have to remember to call it. This seems to me to be very prone to human error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now implemented your request
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ras0219-msft Is there anything left that is preventing this PR from being merged?
Co-authored-by: Robert Schumacher <roschuma@microsoft.com>
Co-authored-by: Robert Schumacher <roschuma@microsoft.com>
…ages between package installs Co-authored-by: Robert Schumacher <roschuma@microsoft.com>
# Conflicts: # src/vcpkg/build.cpp
# Conflicts: # src/vcpkg/base/messages.cpp
# Conflicts: # include/vcpkg/base/messages.h # src/vcpkg/base/messages.cpp
# Conflicts: # src/vcpkg/commands.ci.cpp
# Conflicts: # src/vcpkg/commands.ci.cpp # src/vcpkg/commands.install.cpp # src/vcpkg/commands.set-installed.cpp # src/vcpkg/commands.upgrade.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly nitpicks.
I think we should use size_t
instead of int
when we're using a variable effectively as an index inside a loop.
Furthermore, I noticed that some variable names are just letters. For me, this is confusing because I don't know what the variable is for a few lines later.
include/vcpkg/binarycaching.h
Outdated
|
||
BGMessageSink m_bg_msg_sink; | ||
BGThreadBatchQueue<ActionToPush> m_actions_to_push; | ||
std::atomic_int m_remaining_packages_to_push = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::atomic_int m_remaining_packages_to_push = 0; | |
std::atomic_size_t m_remaining_packages_to_push = 0; |
I think this should be size_t
because we effectively could have size_t
packages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This member existing at all is a smell to me. The number of items in the queue should be a part of the queue, not something tracked externally. Moreover, the queue already contains locks and stuff so I'm not sure why we need an atomic here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I'm going to tackle the structural error reporting etc. thing now that I think this change is really exposing / crying out for. In particular, I think we should consider a design which more formally associates chunks of console output with the package in question rather than a free for all on the message sink type requiring a lot of 'reverse engineer from stuff the caller already knew' kind of behavior.
- I think the discussion of how we are achieving thread safety across several large components like this needs the comment like I described in a comment.
- The comment about '
BinaryCache
is supposed to hideunique_ptr
' should get fixed, which will also eliminate a lot of the changes in this PR attempting to adapt to that change.
When I have a solution for (1) are you interested in me just pushing changes into this PR or do you want a PR for your PR?
(By and large I think this change is structurally good)
include/vcpkg/binarycaching.h
Outdated
|
||
BinaryCache(const Filesystem& fs); | ||
BinaryCache(const BinaryCache&) = delete; | ||
BinaryCache(BinaryCache&&) = default; | ||
BinaryCache(BinaryCache&&) = delete; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree with this design change. BinaryCache
itself already contains several unique_ptr
-alikes, and intends to firewall that storage decision from its customers. If there are immovable bits the unique_ptr
juggling should happen inside rather than outside.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment about 'BinaryCache is supposed to hide unique_ptr' should get fixed, which will also eliminate a lot of the changes in this PR attempting to adapt to that change.
How would you implement that? The function push_thread_main
uses the members of the class, which gets moved away if you move the BinaryCache. So I guess using pimpl is the only solution here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now put all data in an extra struct that is hold via a std::unique_ptr by the BinaryCache class.
@@ -196,23 +200,41 @@ namespace vcpkg | |||
|
|||
struct BinaryCache : ReadOnlyBinaryCache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When adding threading functionality to the codebase that is not effectively 'do a single function but faster', I think there needs to be a discussion in a comment of 'this is how the threads and communication between them work'.
For instance:
// compression and upload of binary cache entries happens on a single 'background' thread, `m_push_thread`
// Thread safety is achieved within the binary cache providers by:
// 1. Only using one thread in the background for this work.
// 2. Forming a queue of work for that thread to consume in `m_actions_to_push`, which maintains its own thread safety
// 3. Sending any replies from the background thread through `m_bg_msg_sink`
// 4. Ensuring any supporting data, such as tool exes, is provided before the background thread is started.
// 5. Ensuring that work is not submitted to the background thread until the corresponding `packages` directory to upload is no longer being actively touched by the foreground thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does your approach here survive #1076 ? Hard links mean that we have a lot less certainty on the packages directory being 'hermetic'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#1076 does not have an impact here. The files "copied" to the installed dir are not later changed by another package. Even if they are overwritten, the hard link in the packages folder would still link to the same original file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has an impact on windows if #802 gets merged before this. Because when the packages folder gets compressed by 7z, the next feature gets tested and stuff gets removed from the installed dir and on windows you cant remove a hard link even if the linked file is only opened via another hard link -.-
PS: Not sure, but I noticed this behavior only on the windows dev drive, but maybe the normal filesystem was simply too slow so that this never happened
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was finally able to catch this situation that happens in combination of this PR with #802:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly the catch seems to only happen with this specific rocksdb.lib file and error nearly only happens with lib files.
And in general it only happens on the dev drive and not on a normal NTFS drive. Strage 😕
// buffers messages until newline is reached | ||
// guarded by m_print_directly_lock | ||
std::vector<std::pair<Color, std::string>> m_unpublished; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like that? Maybe? I think what this really argues is that we really need a semi-standardized 'document' error type @ras0219-msft has been asking for ages... we're almost done getting rid of ParseControlErrorInfo
which means we will finally have a unified way of handling errors and can finally look at messing with that.
include/vcpkg/binarycaching.h
Outdated
|
||
BGMessageSink m_bg_msg_sink; | ||
BGThreadBatchQueue<ActionToPush> m_actions_to_push; | ||
std::atomic_int m_remaining_packages_to_push = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This member existing at all is a smell to me. The number of items in the queue should be a part of the queue, not something tracked externally. Moreover, the queue already contains locks and stuff so I'm not sure why we need an atomic here.
|
||
bool empty() const { return forward.empty(); } | ||
|
||
void pop(std::vector<T>& out) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this thing being called queue given that this is how it works. Given that we expect this to be a multi producer single consumer queue, can we instead put the vector inside and note that only one thread may call pop but any number of threads may call push? That would also resolve the criticism over separate tracking atomics below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BatchQueue
alone is not thread safe.
Given that we expect this to be a multi producer single consumer queue, can we instead put the vector inside and note that only one thread may call pop but any number of threads may call push?
I don't get what you have in mind here 😅
I don't like this thing being called queue given that this is how it works.
Do you have an idea for a better name? :)
std::lock_guard<std::mutex> print_lk(m_print_directly_lock); | ||
std::lock_guard<std::mutex> lk(m_published_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::lock_guard<std::mutex> print_lk(m_print_directly_lock); | |
std::lock_guard<std::mutex> lk(m_published_lock); | |
std::lock_guard<std::mutex, std::mutex> lk(m_print_directly_lock, m_published_lock); |
if this survives
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you actually mean std::scoped_lock
.
You can just push to this PR. I will look through the other comments in the following days :) |
# Conflicts: # include/vcpkg/base/message-data.inc.h # locales/messages.json
…oved while the data is accessed from a thread
# Conflicts: # src/vcpkg/binarycaching.cpp
…ome clear that we need a thread safe channel through which errors and warnings can both be reasonably reported. Now that microsoft#1279 is landed and functionally everything in the codebase uses ExpectedL, we can look at what the new thing that fixes issues is. Consider the following: ```c++ ExpectedL<T> example_api(int a); ExpectedL<std::unique_ptr<SourceControlFile>> try_load_port_manifest_text(StringView text, StringView control_path, MessageSink& warning_sink); ``` The reason this can't return the warnings through the ExpectedL channel is that we don't want the 'error' state to be engaged when there are merely warnings. Moreover, that these channels are different channels means that situations that might want to return errors and warnings together, as happens when parsing files, means that order relationships between errors and warnings is lost. It is probably a good idea in general to put warnings and errors about the same location next to each other in the output, but that's hard to do with this interface. Rather than multiplexing everything through the return value, this proposal is to multiplex only the success or failure through the return value, and report any specific error information through an out parameter. 1. Distinguish whether an overall operation succeeded or failed in the return value, but 2. record any errors or warnings via an out parameter. Applying this to the above gives: ```c++ Optional<T> example_api(MessageContext& context, int a); // unique_ptr is already 'optional' std::unique_ptr<SourceControlFile> try_load_port_manifest_text(MessageContext& context, StringView text, StringView control_path); ``` Issues this new mechanism fixes: * Errors and warnings can share the same channel and thus be printed together * The interface between code wanting to report events and the code wanting to consume them is a natural thread synchronization boundary. Other attempts to fix this have been incorrect by synchronizing individual print calls ( microsoft#1290 ) or complex enough that we are not sure they are correct by trying to recover boundaries by reparsing our own error output ( microsoft#908 ) * This shuts down the "error: error:" and similar bugs where it isn't clear who is formatting the overall error message vs. talking about individual components Known issues that are not fixed by this change: * This still doesn't make it easy for callers to programmatically handle specific types of errors. Currently, we have some APIs that still use explicit `std::error_code` because they want to do different things for 'file does not exist' vs. 'there was an I/O error'. Given that this condition isn't well served by the ExpectedL mechanism I don't want to wait until we have a better solution to it to proceed. * Because we aren't making the context parameter the 'success carrier' it's more complex to implement 'warnings as errors' or similar functionality where the caller decides how 'important' something is. I would be in favor of moving all success tests to the context parameter but I'm not proposing that because the other vcpkg maintainers do not like it. * Contextual information / stack problems aren't solved. However, the context parameter might be extended in the future to help with this.
…ome clear that we need a thread safe channel through which errors and warnings can both be reasonably reported. Now that microsoft#1279 is landed and functionally everything in the codebase uses ExpectedL, we can look at what the new thing that fixes issues is. Consider the following: ```c++ ExpectedL<T> example_api(int a); ExpectedL<std::unique_ptr<SourceControlFile>> try_load_port_manifest_text(StringView text, StringView control_path, MessageSink& warning_sink); ``` The reason this can't return the warnings through the ExpectedL channel is that we don't want the 'error' state to be engaged when there are merely warnings. Moreover, that these channels are different channels means that situations that might want to return errors and warnings together, as happens when parsing files, means that order relationships between errors and warnings is lost. It is probably a good idea in general to put warnings and errors about the same location next to each other in the output, but that's hard to do with this interface. Rather than multiplexing everything through the return value, this proposal is to multiplex only the success or failure through the return value, and report any specific error information through an out parameter. 1. Distinguish whether an overall operation succeeded or failed in the return value, but 2. record any errors or warnings via an out parameter. Applying this to the above gives: ```c++ Optional<T> example_api(MessageContext& context, int a); // unique_ptr is already 'optional' std::unique_ptr<SourceControlFile> try_load_port_manifest_text(MessageContext& context, StringView text, StringView control_path); ``` Issues this new mechanism fixes: * Errors and warnings can share the same channel and thus be printed together * The interface between code wanting to report events and the code wanting to consume them is a natural thread synchronization boundary. Other attempts to fix this have been incorrect by synchronizing individual print calls ( microsoft#1290 ) or complex enough that we are not sure they are correct by trying to recover boundaries by reparsing our own error output ( microsoft#908 ) * This shuts down the "error: error:" and similar bugs where it isn't clear who is formatting the overall error message vs. talking about individual components Known issues that are not fixed by this change: * This still doesn't make it easy for callers to programmatically handle specific types of errors. Currently, we have some APIs that still use explicit `std::error_code` because they want to do different things for 'file does not exist' vs. 'there was an I/O error'. Given that this condition isn't well served by the ExpectedL mechanism I don't want to wait until we have a better solution to it to proceed. * Because we aren't making the context parameter the 'success carrier' it's more complex to implement 'warnings as errors' or similar functionality where the caller decides how 'important' something is. I would be in favor of moving all success tests to the context parameter but I'm not proposing that because the other vcpkg maintainers do not like it. * Contextual information / stack problems aren't solved. However, the context parameter might be extended in the future to help with this.
…ome clear that we need a thread safe channel through which errors and warnings can both be reasonably reported. Now that microsoft#1279 is landed and functionally everything in the codebase uses ExpectedL, we can look at what the new thing that fixes issues is. Consider the following: ```c++ ExpectedL<T> example_api(int a); ExpectedL<std::unique_ptr<SourceControlFile>> try_load_port_manifest_text(StringView text, StringView control_path, MessageSink& warning_sink); ``` The reason this can't return the warnings through the ExpectedL channel is that we don't want the 'error' state to be engaged when there are merely warnings. Moreover, that these channels are different channels means that situations that might want to return errors and warnings together, as happens when parsing files, means that order relationships between errors and warnings is lost. It is probably a good idea in general to put warnings and errors about the same location next to each other in the output, but that's hard to do with this interface. Rather than multiplexing everything through the return value, this proposal is to multiplex only the success or failure through the return value, and report any specific error information through an out parameter. 1. Distinguish whether an overall operation succeeded or failed in the return value, but 2. record any errors or warnings via an out parameter. Applying this to the above gives: ```c++ Optional<T> example_api(MessageContext& context, int a); // unique_ptr is already 'optional' std::unique_ptr<SourceControlFile> try_load_port_manifest_text(MessageContext& context, StringView text, StringView control_path); ``` Issues this new mechanism fixes: * Errors and warnings can share the same channel and thus be printed together * The interface between code wanting to report events and the code wanting to consume them is a natural thread synchronization boundary. Other attempts to fix this have been incorrect by synchronizing individual print calls ( microsoft#1290 ) or complex enough that we are not sure they are correct by trying to recover boundaries by reparsing our own error output ( microsoft#908 ) * This shuts down the "error: error:" and similar bugs where it isn't clear who is formatting the overall error message vs. talking about individual components Known issues that are not fixed by this change: * This still doesn't make it easy for callers to programmatically handle specific types of errors. Currently, we have some APIs that still use explicit `std::error_code` because they want to do different things for 'file does not exist' vs. 'there was an I/O error'. Given that this condition isn't well served by the ExpectedL mechanism I don't want to wait until we have a better solution to it to proceed. * Because we aren't making the context parameter the 'success carrier' it's more complex to implement 'warnings as errors' or similar functionality where the caller decides how 'important' something is. I would be in favor of moving all success tests to the context parameter but I'm not proposing that because the other vcpkg maintainers do not like it. * Contextual information / stack problems aren't solved. However, the context parameter might be extended in the future to help with this.
…ome clear that we need a thread safe channel through which errors and warnings can both be reasonably reported. Now that microsoft#1279 is landed and functionally everything in the codebase uses ExpectedL, we can look at what the new thing that fixes issues is. Consider the following: ```c++ ExpectedL<T> example_api(int a); ExpectedL<std::unique_ptr<SourceControlFile>> try_load_port_manifest_text(StringView text, StringView control_path, MessageSink& warning_sink); ``` The reason this can't return the warnings through the ExpectedL channel is that we don't want the 'error' state to be engaged when there are merely warnings. Moreover, that these channels are different channels means that situations that might want to return errors and warnings together, as happens when parsing files, means that order relationships between errors and warnings is lost. It is probably a good idea in general to put warnings and errors about the same location next to each other in the output, but that's hard to do with this interface. Rather than multiplexing everything through the return value, this proposal is to multiplex only the success or failure through the return value, and report any specific error information through an out parameter. 1. Distinguish whether an overall operation succeeded or failed in the return value, but 2. record any errors or warnings via an out parameter. Applying this to the above gives: ```c++ Optional<T> example_api(MessageContext& context, int a); // unique_ptr is already 'optional' std::unique_ptr<SourceControlFile> try_load_port_manifest_text(MessageContext& context, StringView text, StringView control_path); ``` Issues this new mechanism fixes: * Errors and warnings can share the same channel and thus be printed together * The interface between code wanting to report events and the code wanting to consume them is a natural thread synchronization boundary. Other attempts to fix this have been incorrect by synchronizing individual print calls ( microsoft#1290 ) or complex enough that we are not sure they are correct by trying to recover boundaries by reparsing our own error output ( microsoft#908 ) * This shuts down the "error: error:" and similar bugs where it isn't clear who is formatting the overall error message vs. talking about individual components Known issues that are not fixed by this change: * This still doesn't make it easy for callers to programmatically handle specific types of errors. Currently, we have some APIs that still use explicit `std::error_code` because they want to do different things for 'file does not exist' vs. 'there was an I/O error'. Given that this condition isn't well served by the ExpectedL mechanism I don't want to wait until we have a better solution to it to proceed. * Because we aren't making the context parameter the 'success carrier' it's more complex to implement 'warnings as errors' or similar functionality where the caller decides how 'important' something is. I would be in favor of moving all success tests to the context parameter but I'm not proposing that because the other vcpkg maintainers do not like it. * Contextual information / stack problems aren't solved. However, the context parameter might be extended in the future to help with this.
This results in ~10-20% faster build times on my machine.
For example building boost on my M1 mac went down from 2.948 min to 2.375 min