Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS S3 cache miss should not issue warning #1404

Merged

Conversation

petamas
Copy link
Contributor

@petamas petamas commented May 9, 2024

This fixes microsoft/vcpkg#31932

Previously, when using the AWS S3 binary cache, every cache miss triggered the following message:

warning: error: aws failed with exit code: (1).

This was both noisy and misleading (because it was not an actual error).

Now AWS cache misses are treated silently, the same way as cache misses from the local cache are.

Note

I don't have access to either Google Cloud Storage or Tencent Cloud Object Storage sources, so I don't know whether the same kind of issue happens there too; in any case, fixing those should be easy using the newly introduced interface. (Their stat() and download_file() methods need to be adapted to return CacheResult::miss in the relevant cases.)

@petamas
Copy link
Contributor Author

petamas commented May 9, 2024

@microsoft-github-policy-service agree company="Formlabs"

@@ -951,14 +951,22 @@ namespace
static constexpr StringLiteral m_accept_header = "Accept: application/json;api-version=6.0-preview.1";
};

enum class CacheResult
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this use RestoreResult instead of making up a new value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, I just didn't find that enum.

struct IObjectStorageTool
{
virtual ~IObjectStorageTool() = default;

virtual LocalizedString restored_message(size_t count,
std::chrono::high_resolution_clock::duration elapsed) const = 0;
virtual ExpectedL<Unit> stat(StringView url) const = 0;
virtual ExpectedL<Unit> download_file(StringView object, const Path& archive) const = 0;
virtual ExpectedL<CacheResult> stat(StringView url) const = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should stat return CacheAvailability instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean ExpectedL<CacheAvailability>, or plain CacheAvailability? In either case, how should we treat unknown in the call sites?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've found your suggestion for precheck(), but it's still unclear to me how to handle it in AwsStorageTool::download_file(). Should I simply return RestoreResult:: unavailable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean ExpectedL<CacheAvailability>, or plain CacheAvailability? In either case, how should we treat unknown in the call sites?

Sorry, I meant ExpectedL<CacheAvailability> :)

@@ -990,11 +998,11 @@ namespace
const auto& abi = action.package_abi().value_or_exit(VCPKG_LINE_INFO);
auto tmp = make_temp_archive_path(m_buildtrees, action.spec);
auto res = m_tool->download_file(make_object_path(m_prefix, abi), tmp);
if (res)
if (res.value_or(CacheResult::miss) == CacheResult::hit)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this value_or is less clear than just spelling it out:

if (auto cache_result = res.get()) {
    if (*cache_result == CacheResult::hit) {
        out_zip_paths[idx].emplace(std::move(tmp), RemoveWhen::always);
        return;
    }
} else {
    out_sink.println_warning(res.error());
}

@@ -1007,7 +1015,8 @@ namespace
{
auto&& action = *actions[idx];
const auto& abi = action.package_abi().value_or_exit(VCPKG_LINE_INFO);
if (m_tool->stat(make_object_path(m_prefix, abi)))
auto res = m_tool->stat(make_object_path(m_prefix, abi));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If stat returns CacheAvailability then this can be:

auto maybe_res = m_tool->stat(make_object_path(m_prefix, abi));
if (auto res = maybe_res.get()) {
    cache_status[idx] = *res;
}

cmd_execute_and_capture_output(Command{m_tool}.string_arg("-q").string_arg("stat").string_arg(url)),
Tools::GSUTIL);
auto cmd = Command{m_tool}.string_arg("-q").string_arg("stat").string_arg(url);
return flatten(cmd_execute_and_capture_output(cmd), Tools::GSUTIL).map(unit_to_cache_hit);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a case where we should be looking at the output to confirm we are in the case we expect?

    ExpectedL<CacheResult> stat_flatten_pseudocode(const ExpectedL<ExitCodeAndOutput>& maybe_exit, StringView tool_name)
    {
        if (auto exit = maybe_exit.get())
        {
            if (exit->exit_code == 0)
            {
                return {CacheResult::hit};
            }
            if (exit->exit_code == 1 && exit->output.empty())
            {
                return {CacheResult::miss};
            }

            return {msg::format_error(
                        msgProgramReturnedNonzeroExitCode, msg::tool_name = tool_name, msg::exit_code = exit->exit_code)
                        .append_raw('\n')
                        .append_raw(exit->output)};
        }

        return {msg::format_error(msgLaunchingProgramFailed, msg::tool_name = tool_name)
                    .append_raw(' ')
                    .append_raw(maybe_exit.error().to_string())};
    }

(this would of course remove the need for unit_to_cache_hit)

I'm a bit nervous to treat all potentially failing exit codes as cache miss. For example, if there's an authentication failure, that probably prints something about what happened or similar, and probably also returns EXIT_FAILURE. But with this PR's suggestion that cache provider would be silently broken.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might've misread my code. I also wouldn't want to treat all nonzero exit codes as cache misses, that's why I only changed the behaviour in the AWS provider, and only in case the output is empty, and the exit code is 0 or 1.

This specifically is the GCS provider, for which I tried to keep the original behaviour: it's either a cache hit, or an error with an error message. flatten() treats every nonzero exit code as an error, and returns a "right-filled" Expected with an error message, which I return as-is. Zero exit codes result in a Unit, which I treat as a hit like before. This function cannot return a cache miss at all, because I don't know what the right condition for that is in GCS caches.

Your function is almost equivalent to my AWS stat version, except that I treat "0 exit code, but empty stdout" as a cache miss, to handle if AWS maintainers decide to change aws ls behaviour to not return an error when there are no hits. However, I don't know how GCS reacts in the "not found" case, so I didn't want to change the behaviour from what it was before.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Yes, I was reviewing what this was trying to do based on what your PR description said and I didn't notice that this wasn't the AWS one. (Or if I did notice, I didn't internalize the difference...)

GitHub only showing like 2 lines of context == review pain :(

Comment on lines 1132 to 1174
// When the file is not found, "aws s3 ls" prints nothing, and returns exit code 1.
// flatten() would treat this as an error, but we want to treat it as a (silent) cache miss instead.
// See https://github.com/aws/aws-cli/issues/5544 for the related aws-cli bug report.
if (auto exit = maybe_exit.get())
{
// We want to return CacheResult::miss even if aws-cli starts to return success
const bool success_or_exit_code_1 = exit->exit_code == 0 || exit->exit_code == 1;
const bool output_is_empty = Strings::trim(exit->output) == "";
if (success_or_exit_code_1 && output_is_empty)
{
return CacheResult::miss;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto stat_flatten_pseudocode above

@BillyONeal
Copy link
Member

Thanks for the fix!

@petamas
Copy link
Contributor Author

petamas commented May 14, 2024

Thanks for your detailed review! I'll address your comments in a couple days. I added some questions in replies where something was unclear to me.

@petamas
Copy link
Contributor Author

petamas commented May 14, 2024

@BillyONeal
What's your preference on changes addressing review comments? Should I force-push the branch to rewrite history with the "correct" version incorporating the changes, or should I simply add my changes on top of the current ones in a separate commit?

@BillyONeal
Copy link
Member

@BillyONeal What's your preference on changes addressing review comments? Should I force-push the branch to rewrite history with the "correct" version incorporating the changes, or should I simply add my changes on top of the current ones in a separate commit?

There isn't widespread surgery here so I think you can do whatever you want :)

@petamas petamas force-pushed the aws-s3-cache-miss-should-not-issue-warning branch from c3c6df5 to d4b3097 Compare May 16, 2024 17:53
@petamas
Copy link
Contributor Author

petamas commented May 16, 2024

@BillyONeal: I've updated my branch. I split the changes into two commits:

  • The first one does not change the behaviour at all, only introduces the ability to return CacheAvailability/RestoreResult, but all providers always either return available or error, just like before.
  • The second one makes AWS's stat() return unavailable instead of an error in case the standard output is empty, and the exit code is either 1 (current behaviour of aws-cli) or 0 (possible future behaviour if they decide the address this issue).

If you have a better name for flatten_generic(), I'm open to suggestions.

Comment on lines 1036 to 1040
auto maybe_res = m_tool->stat(make_object_path(m_prefix, abi));
if (auto res = maybe_res.get())
{
cache_status[idx] = CacheAvailability::unavailable;
cache_status[idx] = *res;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realized that this is not an equivalent transformation: if stat() returns an error, the original version sets cache_status[idx] to CacheAvailability::unavailable, while the new version leaves it as CacheAvailability::unknown. Should I add an else branch to keep the behaviour the same as before? (My understanding is that leaving it unknown may at worst cause an extra query to the provider, but I'm not 100% sure I'm reading the rest of the code right.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitHub's not showing the diff properly.

Original:

                if (m_tool->stat(make_object_path(m_prefix, abi)))
                {
                    cache_status[idx] = CacheAvailability::available;
                }
                else
                {
                    cache_status[idx] = CacheAvailability::unavailable;
                }

Current:

                auto maybe_res = m_tool->stat(make_object_path(m_prefix, abi));
                if (auto res = maybe_res.get())
                {
                    cache_status[idx] = *res;
                }

Possible fix to keep old behaviour:

                auto maybe_res = m_tool->stat(make_object_path(m_prefix, abi));
                if (auto res = maybe_res.get())
                {
                    cache_status[idx] = *res;
                }
                else
                {
                    cache_status[idx] = CacheAvailability::unavailable;
                }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to go with the real equivalent change (i.e. I readded the else branch), and force-pushed.

This commit does not change any behavior, just prepares the way for the next commit.
@petamas petamas force-pushed the aws-s3-cache-miss-should-not-issue-warning branch from d4b3097 to 80dcf3f Compare May 22, 2024 10:18
@petamas petamas requested a review from BillyONeal May 22, 2024 10:19
@BillyONeal
Copy link
Member

BillyONeal commented Jun 5, 2024

github note block

I learned today that you can do this "note" thing

Copy link
Member

@BillyONeal BillyONeal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thank you!

@BillyONeal BillyONeal enabled auto-merge (squash) June 7, 2024 00:55
@BillyONeal BillyONeal merged commit 071c997 into microsoft:main Jun 7, 2024
5 checks passed
@petamas petamas deleted the aws-s3-cache-miss-should-not-issue-warning branch July 8, 2024 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[vcpkg-tool] AWS CLI tool error when trying to download non-existing cache from AWS
2 participants