BUG (?): Large changes to Code-Review scores between 4.10.2 and 4.10.5 #2812

pnacht · 2023-03-31T22:43:49Z

Describe the bug
4.10.5 gives significantly lower scores for Code-Review than 4.10.2.

I am running Scorecard frequently for some 200 projects. I just upgraded from 4.10.2 to 4.10.5 and noticed that average Code-Review scores are drastically lower: of the 200 projects, 137 saw their score drop. And the average drop (looking only at those projects that dropped) was of ~5 points.

Δ score	# projects
-10	3
-9	6
-8	17
-7	14
-6	17
-5	17
-4	8
-3	21
-2	14
-1	20
0	61
1	2

Reproduction steps
Steps to reproduce the behavior:
Run Scorecard for the following sample projects with 4.10.2 and 4.10.5:

Results (trimmed):

# pandas @ 4.10.2
    {
      "score": 9,
      "reason": "28 out of last 30 changesets reviewed before merge -- score normalized to 9",
      "name": "Code-Review",
    },
# pandas @ 4.10.5
    {
      "score": 0,
      "reason": "found 2 unreviewed human changesets",
      "name": "Code-Review",
    },

# systemd/systemd @ 4.10.2
    {
      "score": 8,
      "reason": "18 out of last 21 changesets reviewed before merge -- score normalized to 8",
      "name": "Code-Review",
    },

# systemd/systemd @ 4.10.5
    {
      "score": 0,
      "reason": "found 3 unreviewed human changesets",
      "name": "Code-Review",
    },

# microsoft/typescript @ 4.10.2
    {
      "score": 7,
      "reason": "22 out of last 30 changesets reviewed before merge -- score normalized to 7",
      "name": "Code-Review",
    },

# microsoft/typescript @ 4.10.5
    {
      "score": 0,
      "reason": "found 8 unreviewed human changesets",
      "name": "Code-Review",
    },

I intentionally picked these three projects to display here since they are very popular, active and have mature communities. It intuitively seems more likely that 80% of PRs are reviewed than that none are.

Diving in

Running the CLI with --raw and EXPERIMENTAL_V6=1, I took a look at how commits were handled by both versions.

Here are a few cases from pandas-dev/pandas:

pandas-dev/pandas#52308 was merged without review (author was also merger, so it wasn't an implicit review), but 4.10.5 has a "review: APPROVED by mroeschke" result.

# https://github.com/pandas-dev/pandas/pull/52308 @ 4.10.2
{
   ...
    "results": {
        ...
        "defaultBranchChangesets": [
            ...,
            {
                "number": "52308",
                "reviews": [],
                "authors": [ {
                        "login": "mroeschke"
                    } ],
                "commits": [ {
                        "message": "TYP: Remove unneeded type ignore (#52308)",
                        "sha": "ceef0da443a7bbb608fb0e251f06ae43a809b472",
                        "committer": {
                            "login": "github"
                        }
                    } ]
            },

# https://github.com/pandas-dev/pandas/pull/52308 @ 4.10.5
            {
                "number": "52308",
                "reviews": [ {
                        "state": "APPROVED",
                        "reviewer": {
                            "login": "mroeschke",
                            "isBot": false
                        }
                    } ],
                "authors": [ {
                        "login": "mroeschke",
                        "isBot": false
                    } ],
                "commits": [ {
                        "message": "TYP: Remove unneeded type ignore (#52308)",
                        "sha": "ceef0da443a7bbb608fb0e251f06ae43a809b472",
                        "committer": {
                            "login": "github",
                            "isBot": false
                        }
                    }
                ]
            },

pandas-dev/pandas#52316 was reviewed and approved by mroeschke, who then merged it. 4.10.2 presents this correctly, with a review array describing a few comments and then a single approval. 4.10.5, however, shows two approvals by mroeschke. This behavior can be found in a few other PRs.

# https://github.com/pandas-dev/pandas/pull/52316 @ 4.10.2
{
    "number": "52316",
    "platform": "GitHub",
    "reviews": [
        {
            "state": "COMMENTED",
            "reviewer": {
                "login": "mroeschke"
            }
        },
        {
            "state": "COMMENTED",
            "reviewer": {
                "login": "jbrockmendel"
            }
        },
        {
            "state": "APPROVED",
            "reviewer": {
                "login": "mroeschke"
            }
        }
    ],
    "authors": [
        {
            "login": "jbrockmendel"
        }
    ],
    "commits": [
        {
            "message": "PERF: concat_compat (#52316)",
            "sha": "9ba4ef86325abab6029436dcabbe3fab4a60f6ba",
            "committer": {
                "login": "github"
            }
        }
    ]
},

# https://github.com/pandas-dev/pandas/pull/52316 @ 4.10.5
{
    "number": "52316",
    "platform": "GitHub",
    "reviews": [
        {
            "state": "COMMENTED",
            "reviewer": {
                "login": "mroeschke",
                "isBot": false
            }
        },
        {
            "state": "COMMENTED",
            "reviewer": {
                "login": "jbrockmendel",
                "isBot": false
            }
        },
        {
            "state": "APPROVED",
            "reviewer": {
                "login": "mroeschke",
                "isBot": false
            }
        },
        {
            "state": "APPROVED",
            "reviewer": {
                "login": "mroeschke",
                "isBot": false
            }
        }
    ],
    "authors": [ {
            "login": "jbrockmendel",
            "isBot": false
        } ],
    "commits": [ {
            "message": "PERF: concat_compat (#52316)",
            "sha": "9ba4ef86325abab6029436dcabbe3fab4a60f6ba",
            "committer": {
                "login": "github",
                "isBot": false
            }
        } ]
},

I suspect that, in 4.10.5, merging a PR is always being considered an approval, even when the author is the same as the merger. However, somehow the Code-Review score in 4.10.5 is lower, not higher...

Expected behavior
Unclear? Given the dramatic shift in scores, it seems apparent one of the versions (4.10.2 or 4.10.5) is/was giving incorrect results. I intuitively suspect 4.10.2 is correct, but I could be wrong? 🤷‍♂️

Additional context
#2798 would be useful here. Having information of how different PRs/commits are categorized would make debugging and comprehending this sort of thing much easier.

Shoutout to @gabibguti, who helped me do this initial exploration of this behavior.

The text was updated successfully, but these errors were encountered:

azeemshaikh38 · 2023-03-31T22:51:35Z

Most likely due to #2450. @raghavkaul fyi.

I strongly think this change was the wrong direction. Not just due to the scoring changes but also the fact that unreviewed bot PRs are an actual threat vector and shouldn't simply be ignored. My comment on the original thread has more details on the reasoning.

I vote that we undo this change. @ossf/scorecard-maintainers please vote so we can reach consensus here.

raghavkaul · 2023-04-04T15:01:31Z

Yes, I think it makes sense to move back to proportional scoring for unreviewed changesets. The initial reasoning to leveled scoring was that even a single unreviewed PR introduces risk. However, it's way more frequent than we initially expected, even among large, well-maintained projects, that some code is unreviewed, so to bring them to 3/10 or 0/10 doesn't seem like a useful indicator of how seriously they take code review.

I do think we could answer the question of unreviewed PRs by bots with more data (since #2450 actually was more lenient on those, not less lenient, by deducting a constant value).

laurentsimon · 2023-04-06T20:30:54Z

using proportional scoring is fine with me. But still considering unreviewed bot PRs as unreviewed. Hopefully configuration and support for more "known" bots will help in the long run for repos that use a different review system based on bots.

evverx · 2023-05-05T17:06:50Z

I am running Scorecard frequently for some 200 projects. I just upgraded from 4.10.2 to 4.10.5 and noticed that average Code-Review scores are drastically lower: of the 200 projects, 137 saw their score drop

I was going through the systemd dashboard and found this issue. Given that it isn't the first time scorecard checks have gone down like that I wonder why this sort of testing isn't still preformed before rolling out releases?

(before I forget it's related to systemd/systemd#27530)

pnacht added the kind/bug Something isn't working label Mar 31, 2023

naveensrinivasan added the check/Code-Review label Apr 13, 2023

naveensrinivasan mentioned this issue Apr 17, 2023

🐛 Code Review: Use proportional scoring #2882

Merged

evverx mentioned this issue May 6, 2023

Debug option ossf/scorecard-action#176

Open

raghavkaul closed this as completed in #2882 Jun 14, 2023

spencerschrock mentioned this issue Aug 10, 2023

✨ scdiff: add basic compare functionality #3363

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG (?): Large changes to Code-Review scores between 4.10.2 and 4.10.5 #2812

BUG (?): Large changes to Code-Review scores between 4.10.2 and 4.10.5 #2812

pnacht commented Mar 31, 2023 •

edited

Loading

azeemshaikh38 commented Mar 31, 2023

raghavkaul commented Apr 4, 2023

laurentsimon commented Apr 6, 2023

evverx commented May 5, 2023 •

edited

Loading

BUG (?): Large changes to Code-Review scores between 4.10.2 and 4.10.5 #2812

BUG (?): Large changes to Code-Review scores between 4.10.2 and 4.10.5 #2812

Comments

pnacht commented Mar 31, 2023 • edited Loading

Results (trimmed):

Diving in

azeemshaikh38 commented Mar 31, 2023

raghavkaul commented Apr 4, 2023

laurentsimon commented Apr 6, 2023

evverx commented May 5, 2023 • edited Loading

pnacht commented Mar 31, 2023 •

edited

Loading

evverx commented May 5, 2023 •

edited

Loading