Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG (?): Large changes to Code-Review scores between 4.10.2 and 4.10.5 #2812

Closed
pnacht opened this issue Mar 31, 2023 · 4 comments · Fixed by #2882
Closed

BUG (?): Large changes to Code-Review scores between 4.10.2 and 4.10.5 #2812

pnacht opened this issue Mar 31, 2023 · 4 comments · Fixed by #2882
Labels
check/Code-Review kind/bug Something isn't working

Comments

@pnacht
Copy link
Contributor

pnacht commented Mar 31, 2023

Describe the bug
4.10.5 gives significantly lower scores for Code-Review than 4.10.2.

I am running Scorecard frequently for some 200 projects. I just upgraded from 4.10.2 to 4.10.5 and noticed that average Code-Review scores are drastically lower: of the 200 projects, 137 saw their score drop. And the average drop (looking only at those projects that dropped) was of ~5 points.

Δ score # projects
-10 3
-9 6
-8 17
-7 14
-6 17
-5 17
-4 8
-3 21
-2 14
-1 20
0 61
1 2

Reproduction steps
Steps to reproduce the behavior:
Run Scorecard for the following sample projects with 4.10.2 and 4.10.5:

Results (trimmed):

# pandas @ 4.10.2
    {
      "score": 9,
      "reason": "28 out of last 30 changesets reviewed before merge -- score normalized to 9",
      "name": "Code-Review",
    },
# pandas @ 4.10.5
    {
      "score": 0,
      "reason": "found 2 unreviewed human changesets",
      "name": "Code-Review",
    },
# systemd/systemd @ 4.10.2
    {
      "score": 8,
      "reason": "18 out of last 21 changesets reviewed before merge -- score normalized to 8",
      "name": "Code-Review",
    },

# systemd/systemd @ 4.10.5
    {
      "score": 0,
      "reason": "found 3 unreviewed human changesets",
      "name": "Code-Review",
    },
# microsoft/typescript @ 4.10.2
    {
      "score": 7,
      "reason": "22 out of last 30 changesets reviewed before merge -- score normalized to 7",
      "name": "Code-Review",
    },

# microsoft/typescript @ 4.10.5
    {
      "score": 0,
      "reason": "found 8 unreviewed human changesets",
      "name": "Code-Review",
    },

I intentionally picked these three projects to display here since they are very popular, active and have mature communities. It intuitively seems more likely that 80% of PRs are reviewed than that none are.

Diving in

Running the CLI with --raw and EXPERIMENTAL_V6=1, I took a look at how commits were handled by both versions.

Here are a few cases from pandas-dev/pandas:

pandas-dev/pandas#52308 was merged without review (author was also merger, so it wasn't an implicit review), but 4.10.5 has a "review: APPROVED by mroeschke" result.

# https://github.com/pandas-dev/pandas/pull/52308 @ 4.10.2
{
   ...
    "results": {
        ...
        "defaultBranchChangesets": [
            ...,
            {
                "number": "52308",
                "reviews": [],
                "authors": [ {
                        "login": "mroeschke"
                    } ],
                "commits": [ {
                        "message": "TYP: Remove unneeded type ignore (#52308)",
                        "sha": "ceef0da443a7bbb608fb0e251f06ae43a809b472",
                        "committer": {
                            "login": "github"
                        }
                    } ]
            },

# https://github.com/pandas-dev/pandas/pull/52308 @ 4.10.5
            {
                "number": "52308",
                "reviews": [ {
                        "state": "APPROVED",
                        "reviewer": {
                            "login": "mroeschke",
                            "isBot": false
                        }
                    } ],
                "authors": [ {
                        "login": "mroeschke",
                        "isBot": false
                    } ],
                "commits": [ {
                        "message": "TYP: Remove unneeded type ignore (#52308)",
                        "sha": "ceef0da443a7bbb608fb0e251f06ae43a809b472",
                        "committer": {
                            "login": "github",
                            "isBot": false
                        }
                    }
                ]
            },

pandas-dev/pandas#52316 was reviewed and approved by mroeschke, who then merged it. 4.10.2 presents this correctly, with a review array describing a few comments and then a single approval. 4.10.5, however, shows two approvals by mroeschke. This behavior can be found in a few other PRs.

# https://github.com/pandas-dev/pandas/pull/52316 @ 4.10.2
{
    "number": "52316",
    "platform": "GitHub",
    "reviews": [
        {
            "state": "COMMENTED",
            "reviewer": {
                "login": "mroeschke"
            }
        },
        {
            "state": "COMMENTED",
            "reviewer": {
                "login": "jbrockmendel"
            }
        },
        {
            "state": "APPROVED",
            "reviewer": {
                "login": "mroeschke"
            }
        }
    ],
    "authors": [
        {
            "login": "jbrockmendel"
        }
    ],
    "commits": [
        {
            "message": "PERF: concat_compat (#52316)",
            "sha": "9ba4ef86325abab6029436dcabbe3fab4a60f6ba",
            "committer": {
                "login": "github"
            }
        }
    ]
},

# https://github.com/pandas-dev/pandas/pull/52316 @ 4.10.5
{
    "number": "52316",
    "platform": "GitHub",
    "reviews": [
        {
            "state": "COMMENTED",
            "reviewer": {
                "login": "mroeschke",
                "isBot": false
            }
        },
        {
            "state": "COMMENTED",
            "reviewer": {
                "login": "jbrockmendel",
                "isBot": false
            }
        },
        {
            "state": "APPROVED",
            "reviewer": {
                "login": "mroeschke",
                "isBot": false
            }
        },
        {
            "state": "APPROVED",
            "reviewer": {
                "login": "mroeschke",
                "isBot": false
            }
        }
    ],
    "authors": [ {
            "login": "jbrockmendel",
            "isBot": false
        } ],
    "commits": [ {
            "message": "PERF: concat_compat (#52316)",
            "sha": "9ba4ef86325abab6029436dcabbe3fab4a60f6ba",
            "committer": {
                "login": "github",
                "isBot": false
            }
        } ]
},

I suspect that, in 4.10.5, merging a PR is always being considered an approval, even when the author is the same as the merger. However, somehow the Code-Review score in 4.10.5 is lower, not higher...

Expected behavior
Unclear? Given the dramatic shift in scores, it seems apparent one of the versions (4.10.2 or 4.10.5) is/was giving incorrect results. I intuitively suspect 4.10.2 is correct, but I could be wrong? 🤷‍♂️

Additional context
#2798 would be useful here. Having information of how different PRs/commits are categorized would make debugging and comprehending this sort of thing much easier.

Shoutout to @gabibguti, who helped me do this initial exploration of this behavior.

@pnacht pnacht added the kind/bug Something isn't working label Mar 31, 2023
@azeemshaikh38
Copy link
Contributor

Most likely due to #2450. @raghavkaul fyi.

I strongly think this change was the wrong direction. Not just due to the scoring changes but also the fact that unreviewed bot PRs are an actual threat vector and shouldn't simply be ignored. My comment on the original thread has more details on the reasoning.

I vote that we undo this change. @ossf/scorecard-maintainers please vote so we can reach consensus here.

@raghavkaul
Copy link
Contributor

Yes, I think it makes sense to move back to proportional scoring for unreviewed changesets. The initial reasoning to leveled scoring was that even a single unreviewed PR introduces risk. However, it's way more frequent than we initially expected, even among large, well-maintained projects, that some code is unreviewed, so to bring them to 3/10 or 0/10 doesn't seem like a useful indicator of how seriously they take code review.

I do think we could answer the question of unreviewed PRs by bots with more data (since #2450 actually was more lenient on those, not less lenient, by deducting a constant value).

@laurentsimon
Copy link
Contributor

using proportional scoring is fine with me. But still considering unreviewed bot PRs as unreviewed. Hopefully configuration and support for more "known" bots will help in the long run for repos that use a different review system based on bots.

@evverx
Copy link
Contributor

evverx commented May 5, 2023

I am running Scorecard frequently for some 200 projects. I just upgraded from 4.10.2 to 4.10.5 and noticed that average Code-Review scores are drastically lower: of the 200 projects, 137 saw their score drop

I was going through the systemd dashboard and found this issue. Given that it isn't the first time scorecard checks have gone down like that I wonder why this sort of testing isn't still preformed before rolling out releases?

(before I forget it's related to systemd/systemd#27530)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
check/Code-Review kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants