Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HOLD for payment 2023-07-03] Investigate why some changes didn't make it to staging as expected #10214

Closed
roryabraham opened this issue Aug 2, 2022 · 30 comments · Fixed by #10378
Assignees
Labels
Awaiting Payment Auto-added when associated PR is deployed to production Engineering Improvement Item broken or needs improvement. Reviewing Has a PR in review Weekly KSv2

Comments

@roryabraham
Copy link
Contributor

Context

Problem

There is a very obscure edge case with our git CI that caused some changes on main to not be correctly propagated to staging during a staging deploy. There were a few affected files, but the one we've focused on as our case study is src/pages/settings/Payments/PaymentsPage/BasePaymentsPage.js. Here's my best attempt at summarizing the history of this file.

  1. The OfflineIndicator was added in cc59b0a (#9589).
  2. That change was deployed to staging in 1.1.86-0
  3. The change was reverted here. Note that this PR was meant to be a revert of Update Global Offline Indicator #9895, but also included an additional (potentially unrelated) change in BasePaymentsPage.
  4. The revert was cherry-picked to staging version 1.1.86-2 in ffd8581 (#10115)
  5. The OfflineIndicator was added back in f9cf909 (#10135)

At this point, main contained the OfflineIndicator in BasePaymentsPage, but staging did not. However, when main was merged into staging, the usage of OfflineIndicator was carried over to main but the import was not.

In order to reproduce this locally and see it in action:

git checkout main
git pull
git checkout staging
git pull
git checkout main
git reset e66fa20d689e704fe6cf30485bc45f1e5e1397d7
git checkout -b fake-main
git reset --hard HEAD
git checkout staging
git reset 751915e541bfad2352237ce9d6030413b23938f0
git checkout -b fake-staging
git reset --hard HEAD
git clean -fd
git checkout -b update-fake-staging-from-fake-main

Then look at the diff between update-fake-staging-from-fake-main:

image (3)

You would think both of these changes would be carried over, but then run:

git merge -Xtheirs fake-main

And you'll see that the import is missing.

Solution

TBD, but we have some additional observations/hypotheses:

  • When we cherry-picked the revert to staging, we somehow muddled the history such that git thinks that change (ffd8581) is newer than the change on main that added it back (f9cf909).
  • An examination of the git log shows that on main the commits from the revert PR are part of a linear history, while on staging the merge commit is included and is treated as a separate branch?

image

  • The next step to investigate how to prevent this from happening in the future is probably to try and reproduce it in a fresh repo (perhaps utilizing this existing test script), and fiddle with the cherry-pick command that's used to see if we can keep these histories in line and prevent this problem.
@roryabraham roryabraham added Engineering Weekly KSv2 Improvement Item broken or needs improvement. labels Aug 2, 2022
@roryabraham
Copy link
Contributor Author

cc @Expensify/mobile-deployers – @johnmlee101, @rafecolton, @yuwenmemon, and I spent a few hours trying to figure this out this morning. We made some good progress but didn't quite get it worked out.

It seems like a pretty extreme edge case, but the main moral for now is be careful with your cherry-picks, particularly with reverts. Hopefully we'll have more tangible guidance or improvements here soon.

@roryabraham roryabraham self-assigned this Aug 2, 2022
@roryabraham
Copy link
Contributor Author

cc @kidroca in case you're interested in investigating

@roryabraham
Copy link
Contributor Author

roryabraham commented Aug 8, 2022

Found another problem scenario in this workflow run:

  • Staging was at b568742

  • ee8989d was CP'd to staging:

    git cherry-pick -S -x --mainline 1 --strategy=recursive -Xtheirs ee8989dc7cb5a1debcbf71ea3227e9105112afe1
  • Then staging included this unrelated change, which was CP'd in 🍒 Cherry pick PR #10229 to staging 🍒 #10295

    ```gradle
    buildConfigField "boolean", "IS_NEW_ARCHITECTURE_ENABLED", isNewArchitectureEnabled().toString()
    
      if (isNewArchitectureEnabled()) {
          // We configure the NDK build only if you decide to opt-in for the New Architecture.
          externalNativeBuild {
              ndkBuild {
                  arguments "APP_PLATFORM=android-21",
                      "APP_STL=c++_shared",
                      "NDK_TOOLCHAIN_VERSION=clang",
                      "GENERATED_SRC_DIR=$buildDir/generated/source",
                      "PROJECT_BUILD_DIR=$buildDir",
                      "REACT_ANDROID_DIR=$rootDir/../node_modules/react-native/ReactAndroid",
                      "REACT_ANDROID_BUILD_DIR=$rootDir/../node_modules/react-native/ReactAndroid/build",
                      "NODE_MODULES_DIR=$rootDir/../node_modules"
                  cFlags "-Wall", "-Werror", "-fexceptions", "-frtti", "-DWITH_INSPECTOR=1"
                  cppFlags "-std=c++17"
                  // Make sure this target name is the same you specify inside the
                  // src/main/jni/Android.mk file for the `LOCAL_MODULE` variable.
                  targets "rndiffapp_appmodules"
              }
          }
          if (!enableSeparateBuildPerCPUArchitecture) {
              ndk {
                  abiFilters (*reactNativeArchitectures())
              }
          }
      }
    ```
    

@roryabraham
Copy link
Contributor Author

I'm able to reproduce the initial problem in a fresh repo. Added unit tests covering this here: #10316

Now need to figure out how to fix this

@roryabraham
Copy link
Contributor Author

I believe this article basically explains exactly what is happening to us

@rafecolton
Copy link
Member

This is the next article in the series where the author describes how to fix the problem. I'm not immediately sure how to do that in code in our automation, but given that they are describing the exact problem we are experiencing, I think this is the solution.

@tgolen
Copy link
Contributor

tgolen commented Aug 9, 2022 via email

@roryabraham
Copy link
Contributor Author

roryabraham commented Aug 10, 2022

It's really hard to create a TL;DR of this problem because it's pretty convoluted, but these diagrams are pretty good. Note that in our example, our equivalent of master is actually staging, and our equivalent of feature is actually main:

Before (with cherry-pick)

image

After (by merging into a "helper" branch)

image


So I pushed a number of changes to this PR to:

  • DRY things up to (hopefully) make the tests more readable
  • Recreate the problem scenario and cover it with some tests / assertions
  • Implement the solution in the article

If you check out that PR, you can run the tests locally with: ./tests/unit/gitTestCI.sh || rm -rf ~/DumDumRepo.

If you comment out the assertions in the tests that look like this:

assert_equal "$output" "[ '7', '5', '1' ]"

Then you'll see the solution outlined in the article indeed fixes the scenario where a revert PR is persisted on staging, even after the same change is un-reverted on main and then main is merged into staging. However, I don't think this really works for us because it breaks other tests. I'm trying hard to understand the problem and solution here, but due to the urgency I could use more eyes.

@roryabraham
Copy link
Contributor Author

Okay, I believe I know how to fix this. I included some diagrams to explain the fix in the draft PR. I also got tests passing but need to implement the changes in the actual GitHub Workflows. Also I just want to carry this out with more examples to try and see if we can find any other cases where it doesn't work. For example, I haven't yet explicitly tested this slightly different scenario that's currently blocking deploys. I'm also not sure if there's anything we'll need to do to get main + staging back into a state where this solution will work.

The TL;DR of this solution is that we actually merge the temporary CP branch into both main and staging, instead of just staging. That way, it becomes a more recent common ancestor of both those branches and fixes the later merge that happens when main is merged back into staging. One sort of annoying side-effect of this is that it will mean one more OSBotify PR merged into main for every PR we cherry-pick. Not a dealbreaker though, that's just what we have to do to auto-commit code into protected branches.

@tgolen
Copy link
Contributor

tgolen commented Aug 10, 2022

My concern with this is that we're digging so deep into the weeds that we're ("we" as in, anyone that isn't Rory 😁 ) never going to be able to maintain this or fix it if it breaks.

Thinking back to web-e deploying, we've never ran into this problem and that's a deploy that's been pretty stable for many years (with a CP process to both staging and production branches).

So, at some point... do we sit back and ask ourselves the question: Is this still the right direction to be pushing ourselves in?

GitHub actions work great for simple repos (@AndrewGable loved implementing them on comp, k2, etc.), but when it comes to very complex deploy flows like this, I think that's when the complexity begins to be more than we have the capacity to manage.

I know that doesn't really help to solve this particular issue, and I'm still very lost on it, but I can't escape the feeling like we've dug ourselves into a hole and we just keep digging deeper :(

@roryabraham
Copy link
Contributor Author

roryabraham commented Aug 10, 2022

@tgolen Feel free to correct me if I'm misinterpreting, but I think that there's an underlying assumption in your concern – namely, some combination of the following:

  • This problem can't happen on Web-Expensify
  • The problem is introduced by code we have in place to automatically resolve conflicts
  • This has something to do with GitHub Actions v.s: a manual deploy process

But I don't think that's true. As a matter of fact, the Git logic in the NewDot deploy process mirrors the Git logic in the Web-E deploy process pretty closely. To demonstrate that I believe this same problem can occur in Web-Expensify, I wrote a little script. You'll notice that it doesn't use any flags to automatically resolve conflicts, and when you run it you'll see that there are no manual conflicts to resolve either:

#!/bin/bash

# Fail immediately if there is an error thrown
set -e

BLUE=$'\e[1;34m'
RED=$'\e[1;31m'
RESET=$'\e[0m'
function info {
  echo "$BLUE$1$RESET"
}
function error {
  echo "💥 $RED$1$RESET"
} 

info "Creating repo"
mkdir ~/DumDumRepo
cd DumDumRepo
git init -b main
echo "some content" >> myFile.txt
git add myFile.txt
git commit -m "Initial commit"

info
info "Creating staging branch"
git checkout -b staging
git checkout main

info
info "Appending and prepending content (initial PR)"
printf "Before:\n\n$(cat myFile.txt)"
printf "Prepended content\n%s" "$(cat myFile.txt)" > myFile.txt
printf "\nAppended content\n" >> myFile.txt
printf "\n\nAfter:\n\n$(cat myFile.txt)\n"

info
info "Committing change to main"
git add myFile.txt
git commit -m "Append and prepend content"

info
info "Merging main into staging"
git checkout staging
git merge main

info
info "Making an unrelated change on main"
git checkout main
printf "File list before:\n"
ls
echo "some content" >> anotherFile.txt
git add anotherFile.txt
git commit -m "Create another file"
printf "File list after:\n"
ls

info
info "Reverting the append + prepend on main"
printf "Before:\n\n$(cat myFile.txt)\n"
echo "some content" > myFile.txt
printf "\nAfter:\n\n$(cat myFile.txt)\n"
git add myFile.txt
git commit -m "Revert append and prepend"

info
info "Cherry-picking the revert to staging"
REVERT_COMMIT=$(git rev-parse HEAD)
git checkout staging
git cherry-pick "$REVERT_COMMIT"

info
info "Verifying that the revert is present on staging, but the unrelated change is not"
printf "myFile.txt on staging:\n\n$(cat myFile.txt)"
printf "\n\nFile list on staging:\n\n"
ls
if [[ "$(cat myFile.txt)" != "some content" ]]; then
    error "Revert did not make it to staging"
fi
if [[ -f anotherFile.txt ]]; then
    error "Unrelated change made it to staging"
fi

info
info "Repeating previously reverted append + prepend on main"
git checkout main
printf "Before:\n\n$(cat myFile.txt)\n\n"
printf "Prepended content\n%s" "$(cat myFile.txt)" > myFile.txt
printf "\nAppended content\n" >> myFile.txt
git add myFile.txt
git commit -m "Append and prepend content again"
printf "\n\nAfter:\n\n$(cat myFile.txt)\n"

info
info "Mering main into staging"
git checkout staging
git merge main

info
info "Verifying that the append + prepend (just added back on main) are now present on staging"
printf "myFile.txt on staging:\n\n$(cat myFile.txt)"
if [[ "$(cat myFile.txt)" != *"Prepended"*  ]]; then
    error "Prepended content not present on staging"
fi
if [[ "$(cat myFile.txt)" != *"Appended"*  ]]; then
    error "Appended content not present on staging"
fi

info
info "Verifying that previously unrelated change is now present on staging"
printf "File list on staging:\n\n"
ls
if [[ ! -f anotherFile.txt  ]]; then
    error "Other file is not present on staging"
fi

info
info "Cleaning up..."
cd ..
rm -rf ~/DumDumRepo

While this script is kind of long, you'll see that it demonstrates this problem:

  1. Change A is merged to main
  2. main is merged into staging
  3. An unrelated change, B is merged to main
  4. Change A is reverted on main (call this change R)
  5. Change R is cherry-picked to staging. At this point A should be reverted on staging and B should not be present on staging
  6. Change A is added back on main (called this A')
  7. main is merged into staging
  8. At this point:
    1. Change B should be present on staging ✅
    2. Change R should be replaced by A' (and in effect, A should be present on staging), but it's not 💥

I encourage creating the script locally and running it to see the output. Unless I'm missing something all this same Git logic is present in Web-Expensify, and the fact that a person is there to resolve conflicts won't really make a difference, because there are no conflicts.

EDIT: In case anyone prefers an abbreviated version of the above script without so many logs, here ya go:

#!/bin/bash

mkdir ~/DumDumRepo
cd DumDumRepo
git init -b main
echo "some content" >> myFile.txt
git add myFile.txt
git commit -m "Initial commit"

git checkout -b staging
git checkout main

printf "Prepended content\n%s" "$(cat myFile.txt)" > myFile.txt
printf "\nAppended content\n" >> myFile.txt

git add myFile.txt
git commit -m "Append and prepend content"

git checkout staging
git merge main

git checkout main
echo "some content" >> anotherFile.txt
git add anotherFile.txt
git commit -m "Create another file"

echo "some content" > myFile.txt
git add myFile.txt
git commit -m "Revert append and prepend"

REVERT_COMMIT=$(git rev-parse HEAD)
git checkout staging
git cherry-pick "$REVERT_COMMIT"

if [[ "$(cat myFile.txt)" != "some content" ]]; then
    echo "Error: Revert did not make it to staging"
fi
if [[ -f anotherFile.txt ]]; then
    echo "Error: Unrelated change made it to staging"
fi

git checkout main
printf "Prepended content\n%s" "$(cat myFile.txt)" > myFile.txt
printf "\nAppended content\n" >> myFile.txt
git add myFile.txt
git commit -m "Append and prepend content again"

git checkout staging
git merge main

if [[ "$(cat myFile.txt)" != *"Prepended"*  ]]; then
    echo "Error: Prepended content not present on staging"
fi
if [[ "$(cat myFile.txt)" != *"Appended"*  ]]; then
    echo "Error: Appended content not present on staging"
fi

if [[ ! -f anotherFile.txt  ]]; then
    echo "Other file is not present on staging"
fi

cd ..
rm -rf ~/DumDumRepo

@tgolen
Copy link
Contributor

tgolen commented Aug 11, 2022

Thanks, Rory!

I totally understand that the problem you're running into can be reproduced on ANY repository. It's not a problem that's specific to a repo. So, while I believe you that the script reproduces the bug, I don't think it's an accurate test.

Can you reproduce the bug using the web-e deploy process?

@tgolen
Copy link
Contributor

tgolen commented Aug 11, 2022

Primarily, I think one of the disconnects with the bug, and the web-e deploy process might be your step 7:

main is merged into staging

We don't merge main into staging during the deploy process. We create a new copy of staging based on main (and destroy the previous version of staging). Typically referred to as the "branch dance".

@roryabraham
Copy link
Contributor Author

roryabraham commented Aug 11, 2022

We don't merge main into staging during the deploy process. We create a new copy of staging based on main (and destroy the previous version of staging). Typically referred to as the "branch dance".

Very astute observation! This does in fact prevent the issue we're seeing here! You can see that by changing the above script(s) to replace:

- git checkout staging
- git merge main
+ git branch -D staging
+ git checkout -b staging

And looking at the Git chart it makes sense why:

GitChartWebE drawio


So my next question is: How do we manage that "branch dance" in Web-Expensify? Does it require input from someone in Ring 0 to delete the (presumably protected?) staging branch? If we can use the same process in NewDot it would be a valuable simplification and solve this problem to boot

@tgolen
Copy link
Contributor

tgolen commented Aug 11, 2022

Ah, cool! I believe the way it works is that the deployer group is given permissions on the repo to force push to staging and production. Then the script does a forced push of the current main branch to the remote's staging branch (I think this is as simple as using the -F flag). It's been a while since I've dug into that code though.

The force push is applied here: https://github.com/Expensify/PHP-Libs/blob/main/src/Deploy/Deployify.php#L199

I think the force parameter is passed from here: https://github.com/Expensify/Web-Expensify/blob/20ba8c1e41eeae72ba158c7fd8173a09cd777475/_tools/deploy/lib.sh#L120-L121

@kidroca
Copy link
Contributor

kidroca commented Aug 11, 2022

@roryabraham I haven't dug that deep into the problem, but could a thing like enforcing PRs to be rebased onto main before the merge fix the problem without changing the current cherry pick flow

I've been asked to rebase my PR when working on other open source projects

Rebase keeps a linear history and this should help with something being cherry picked after merge, since it would be picking from a "straight line". The branch to be merged will be properly up to date with any changes in main

The problem with this strategy is that merging something to main would require all the other PR's to rebase, but maybe we can find some middle-ground?

@roryabraham
Copy link
Contributor Author

roryabraham commented Aug 11, 2022

After discussing in slack we are going to:

  • Give OSBotify permission to force-push to staging and production (done)
  • Instead of merging main into staging, just delete staging and replace it with main

@roryabraham
Copy link
Contributor Author

This is an important fix we need to make, but we're currently prioritizing WAQ and product bugs.

@roryabraham roryabraham added the Daily KSv2 label Jun 14, 2023
@roryabraham roryabraham changed the title [HOLD] Investigate why some changes didn't make it to staging as expected Investigate why some changes didn't make it to staging as expected Jun 14, 2023
@roryabraham
Copy link
Contributor Author

Taking this off HOLD and treating it as my top priority

@melvin-bot melvin-bot bot removed the Overdue label Jun 14, 2023
@roryabraham roryabraham mentioned this issue Jun 15, 2023
57 tasks
@melvin-bot melvin-bot bot added the Reviewing Has a PR in review label Jun 16, 2023
@melvin-bot
Copy link

melvin-bot bot commented Jun 21, 2023

Triggered auto assignment to @nathanmetcalf (ring0), see https://stackoverflow.com/c/expensify/questions/6102 for more details.

@roryabraham
Copy link
Contributor Author

@nathanmetcalf the GitHub settings changes we need were discussed here.

@roryabraham roryabraham removed the ring0 label Jun 22, 2023
@roryabraham
Copy link
Contributor Author

@nathanmetcalf friendly bump so we can hopefully roll out the fix for this tomorrow 🙂

@roryabraham roryabraham removed the Reviewing Has a PR in review label Jun 22, 2023
@nathanmetcalf
Copy link
Contributor

Updated the rules as per Rory's New. messages.

@nathanmetcalf nathanmetcalf reopened this Jun 23, 2023
@nathanmetcalf nathanmetcalf removed their assignment Jun 23, 2023
@roryabraham
Copy link
Contributor Author

Seems my instructions caused some problems. Taking this over to https://github.com/Expensify/Expensify/issues/294557 for auditing purposes.

@melvin-bot melvin-bot bot added the Overdue label Jun 26, 2023
@roryabraham
Copy link
Contributor Author

This is fixed. Most of the testing + context is in this slack thread

@melvin-bot melvin-bot bot removed the Overdue label Jun 26, 2023
@melvin-bot melvin-bot bot added Weekly KSv2 Awaiting Payment Auto-added when associated PR is deployed to production and removed Daily KSv2 labels Jun 26, 2023
@melvin-bot melvin-bot bot changed the title Investigate why some changes didn't make it to staging as expected [HOLD for payment 2023-07-03] Investigate why some changes didn't make it to staging as expected Jun 26, 2023
@melvin-bot
Copy link

melvin-bot bot commented Jun 26, 2023

The solution for this issue has been 🚀 deployed to production 🚀 in version 1.3.32-5 and is now subject to a 7-day regression period 📆. Here is the list of pull requests that resolve this issue:

If no regressions arise, payment will be issued on 2023-07-03. 🎊

After the hold period is over and BZ checklist items are completed, please complete any of the applicable payments for this issue, and check them off once done.

  • External issue reporter
  • Contributor that fixed the issue
  • Contributor+ that helped on the issue and/or PR

As a reminder, here are the bonuses/penalties that should be applied for any External issue:

  • Merged PR within 3 business days of assignment - 50% bonus
  • Merged PR more than 9 business days after assignment - 50% penalty

@melvin-bot melvin-bot bot added the Reviewing Has a PR in review label Jun 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Awaiting Payment Auto-added when associated PR is deployed to production Engineering Improvement Item broken or needs improvement. Reviewing Has a PR in review Weekly KSv2
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants