Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bot no longer autostarts CI? #235

Closed
richardlau opened this issue Jun 5, 2019 · 19 comments
Closed

bot no longer autostarts CI? #235

richardlau opened this issue Jun 5, 2019 · 19 comments
Labels

Comments

@richardlau
Copy link
Member

Opened a new PR (nodejs/node#28076) an hour ago and the bot hasn't autostarted a lite CI run as expected.

@phillipj
Copy link
Member

phillipj commented Jun 5, 2019

Thanks for reporting!

You're definitively onto something, seeing this in the logs:

19:38:19.872Z DEBUG bot: Triggering Jenkins build (req_id=78c9fb30-87c9-11e9-88d8-ed216d688228, pr=nodejs/node/#28078, action=pull_request.opened)
19:38:20.366Z DEBUG bot: checking out origin/v7.x-staging... (req_id=78c9fb30-87c9-11e9-88d8-ed216d688228, pr=nodejs/node/#28078, action=pull_request.opened)
19:38:20.385Z DEBUG bot: (req_id=78c9fb30-87c9-11e9-88d8-ed216d688228, pr=nodejs/node/#28078, action=pull_request.opened)
    error: pathspec 'origin/v7.x-staging' did not match any file(s) known to git.

19:38:20.386Z DEBUG bot: backport to 7 failed (req_id=78c9fb30-87c9-11e9-88d8-ed216d688228, pr=nodejs/node/#28078, action=pull_request.opened)
19:38:20.387Z DEBUG bot: Should have added (but temporary disabled): dont-land-on-v7.x (req_id=78c9fb30-87c9-11e9-88d8-ed216d688228, pr=nodejs/node/#28078, action=pull_request.opened)
19:38:20.387Z DEBUG bot: processing a new backport to v6 (req_id=78c9fb30-87c9-11e9-88d8-ed216d688228, pr=nodejs/node/#28078, action=pull_request.opened)
19:38:20.387Z DEBUG bot: aborting any previous backport attempt... (req_id=78c9fb30-87c9-11e9-88d8-ed216d688228, pr=nodejs/node/#28078, action=pull_request.opened)
19:38:20.405Z DEBUG bot: updating git remotes... (req_id=78c9fb30-87c9-11e9-88d8-ed216d688228, pr=nodejs/node/#28078, action=pull_request.opened)
19:38:20.423Z DEBUG bot: (req_id=78c9fb30-87c9-11e9-88d8-ed216d688228, pr=nodejs/node/#28078, action=pull_request.opened)
    Fetching origin

19:38:20.784 FATAL bot: Unchaught exception, terminating bot process immediately
    TypeError: Cannot destructure property `jobName` of 'undefined' or 'null'.
        at onBuildTriggered (/../scripts/trigger-jenkins-build.js:85:53)
        at Request.request.post [as _callback] (/home/../scripts/trigger-jenkins-build.js:70:14)
        at Request.self.callback (/home/../node_modules/request/request.js:185:22)
        at Request.emit (events.js:189:13)
        at Request.<anonymous> (/home/../node_modules/request/request.js:1161:10)
        at Request.emit (events.js:189:13)

After that, the bot process dies and starts up again few seconds later.

Not sure if all the above is related to the fatal explosion at the bottom, but sharing those logs anyway.

@sam-github
Copy link

error: pathspec 'origin/v7.x-staging' did not match any file(s) known to git.

That's my fault, I guess, though I might be triggering a bug elsewhere. I deleted the unused (I thought) staging branches, since we'll never do a v7.x release again (and can make a staging branch if we really needed it).

But, I'm a bit confused, why should @richardlau 's PR, richardlau wants to merge 1 commit into nodejs:master from richardlau:11eol, be depending on the existence of v7.x-staging?

@sam-github
Copy link

These lines are interesting, too. I suspect that the bot has no longer necessary (but still running) code that is checking whether commits cherry-pick clean onto LTS branches, except that I think we removed that feature (inability to cherrypick happens all the time, but doesn't mean they won't cherry pick when the whole branch is picked down), and we certainly don't care if commits don't pick clean onto unsupported release lines.

19:38:20.386Z DEBUG bot: backport to 7 failed (req_id=78c9fb30-87c9-11e9-88d8-ed216d688228, pr=nodejs/node/#28078, action=pull_request.opened)
19:38:20.387Z DEBUG bot: Should have added (but temporary disabled): dont-land-on-v7.x (req_id=78c9fb30-87c9-11e9-88d8-ed216d688228, pr=nodejs/node/#28078, action=pull_request.opened)
19:38:20.387Z DEBUG bot: processing a new backport to v6 (req_id=78c9fb30-87c9-11e9-88d8-ed216d688228, pr=nodejs/node/#28078, action=pull_request.opened)

I'm going to repush those staging branches, but I think this is indicative of a code cleanup needed in the bot.

@sam-github
Copy link

I pushed v6.x-staging and v7.x-staging. Perhaps the issue should be retitled?

@richardlau
Copy link
Member Author

Those log lines are from https://github.com/nodejs/github-bot/blob/b6e54da9741460fd85fda1931bfb3aeaaba4fc28/scripts/attempt-backport.js. Given that it references 7, 6 and 4 which are long EOL it either needs updating or removal:

const nodeVersions = [
{ version: 7 },
{ version: 6, lts: true },
{ version: 4, lts: true }

@sam-github
Copy link

probably removal, along with

for (const node of nodeVersions) {
queueAttemptBackport(options, node.version, !!node.lts)
}
and
// to trigger polling manually
app.get('/attempt-backport/pr/:owner/:repo/:id', (req, res) => {
const owner = req.params.owner
const repo = req.params.repo
const prId = parseInt(req.params.id, 10)
const options = { owner, repo, prId, logger: req.log }
if (~enabledRepos.indexOf(repo)) {
for (const node of nodeVersions) {
queueAttemptBackport(options, node.version, !!node.lts)
}
}
if (!inProgress) processNextBackport()
res.end()
})
}
and
function queueAttemptBackport (options, version, isLTS) {
queue.push(function () {
options.logger.debug(`processing a new backport to v${version}`)
attemptBackport(options, version, isLTS, processNextBackport)
})
}
function attemptBackport (options, version, isLTS, cb) {
// Start
gitAmAbort()
function wrapCP (cmd, args, opts, callback) {
let exited = false
if (arguments.length === 3) {
callback = opts
opts = {}
}
opts.cwd = global._node_repo_dir
const cp = spawn(cmd, args, opts)
const argsString = [cmd, ...args].join(' ')
cp.on('error', function (err) {
debug(`child_process err: ${err}`)
if (!exited) onError()
})
cp.on('exit', function (code) {
exited = true
if (!cb) {
debug(`error before exit, code: ${code}, on '${argsString}'`)
return
} else if (code > 0) {
debug(`exit code > 0: ${code}, on '${argsString}'`)
onError()
return
}
callback()
})
// Useful when debugging.
cp.stdout.on('data', (data) => {
options.logger.debug(data.toString())
})
cp.stderr.on('data', (data) => {
options.logger.debug(data.toString())
})
return cp
}
function onError () {
if (!cb) return
const _cb = cb
setImmediate(() => {
options.logger.debug(`backport to ${version} failed`)
if (!isLTS) {
options.logger.debug(`Should have added (but temporary disabled): dont-land-on-v${version}.x`)
} else {
getBotPrLabels(options, (err, ourLabels) => {
if (err) {
options.logger.error(err, 'Error fetching existing bot labels')
return
}
const label = `lts-watch-v${version}.x`
if (!ourLabels.includes(label)) return
removeLabelFromPR(options, label)
})
}
setImmediate(() => {
inProgress = false
_cb()
})
})
cb = null
}
-- I don't think any of that is used anymore.

@richardlau
Copy link
Member Author

So it looks like jobs are now attempted to start but something's still going wrong, e.g. nodejs/node#28098 (comment)

Sadly, an error occurred when I tried to trigger a build. :(

@phillipj
Copy link
Member

phillipj commented Jun 6, 2019

With #237 rolled out, we luckily have more error details to work with:

Error: Expected 200 from Jenkins, got 403
    at Request.request.post [as _callback] (/home/.../scripts/trigger-jenkins-build.js:70:17)
    at Request.self.callback (/home/.../node_modules/request/request.js:185:22)
    at Request.emit (events.js:189:13)

@phillipj
Copy link
Member

phillipj commented Jun 6, 2019

IIRC we've seen this happen before when changes has been made in Jenkins, including plugin upgrades, does that ring a bell to anyone in @nodejs/build?

@rvagg
Copy link
Member

rvagg commented Jun 7, 2019

I'm not aware of anything that's changed on our end. There were some plugin security updates earlier this week but I didn't update them because they didn't seem to impact us. Maybe someone else did though. There's some updates queued right now so I've run them and restarted Jenkins. Maybe try again now to see if that's fixed anything.
@phillipj can you log more information about the failed response where you got that 403? I see it's only taking response.statusCode, the body might be informative for these situations.

@phillipj
Copy link
Member

phillipj commented Jun 7, 2019

Thanks for the quick reply. Good point about also logging the body, snook it into master now (7ced2a7). I'll post an example later when I see this error happening again.

@phillipj
Copy link
Member

phillipj commented Jun 7, 2019

The 403 response body consists of HTML, where the only human readable message I see is:

<h1>Access Denied</h1><p class="error">nodejs-github-bot is missing the Job/Build permission</p>

@nodejs/jenkins-admins help?

@rvagg
Copy link
Member

rvagg commented Jun 11, 2019

sorry, I'm not sure what's up with this, the bot has this config:

JENKINS_JOB_NODE=node-test-pull-request-lite-pipeline
JENKINS_BUILD_TOKEN_NODE=xyz

in the node-test-pull-request-lite-pipeline job, it has that build token (replaced here with xyz but it's a long random string). I suppose, given that error message, that you not only need the token but also user permission.

nodejs-github-bot is only a member of the bots and automation-collaborators teams in the GitHub org and neither of these are configured in Jenkins to say that it it has "Job/Build permissions".

I'm hesitant to just add one of those teams or the user itself to Jenkins without understanding how we got here. There's 3 obvious reasons that I see:

  • A Jenkins upgrade broke this permission model--perhaps, but I haven't seen evidence that it was upgraded at the time this stopped working or that Jenkins have changed anything, but someone could investigate this further i suppose, the Jenkins release notes might be the place to look.
  • A change in Jenkins permissions--I can't find evidence of this and I'm fishing through config.xml history on the server. Perhaps there's something in the node-test-pull-request-lite-pipeline that I'm not seeing but I'm pretty sure it defers to global security.
  • A change in team membership of nodejs-github-bot--someone with audit log access on the org (@sam-github maybe?) could look through and see what kinds of changes were made in the timeframe in question. Maybe it got removed from "collaborators"? Maybe there's some team hierarchy stuff going on here.

Aside from that 🤷‍♂

phillipj added a commit to phillipj/github-bot that referenced this issue Jun 15, 2019
As the bot is does not have permission to trigger Jenkins Lite-CI builds
for some reason at the moment, and there's feedback about these error
comments not being useful to collaborators anyways, those comments are
muted for now.

Ideally github-bot maintainers should get notified about these errors,
as they're obviously not meant to be raised at all. But using
node core collaborators as proxy, manually telling bot maintainers about
errors surely isn't idealy either.

Refs nodejs#238
Refs nodejs#235 (comment)
phillipj added a commit that referenced this issue Jun 17, 2019
As the bot is does not have permission to trigger Jenkins Lite-CI builds
for some reason at the moment, and there's feedback about these error
comments not being useful to collaborators anyways, those comments are
muted for now.

Ideally github-bot maintainers should get notified about these errors,
as they're obviously not meant to be raised at all. But using
node core collaborators as proxy, manually telling bot maintainers about
errors surely isn't idealy either.

Refs #238
Refs #235 (comment)
@phillipj
Copy link
Member

@mmarchini from the sound of your work related to nodejs/node#34089, this might suddenly start working again since the bot gets privileges enough to start CI? Or wouldn't that affect the existing Jenkins token the bot is already using?

@mmarchini
Copy link
Contributor

mmarchini commented Jul 30, 2020

Ahhhh, I thought I saw an auto started CI in one PR. Yeah, my guess is this should be working now.

@mmarchini
Copy link
Contributor

There it is: nodejs/node#34558 (comment)

The bot is alive!

@phillipj
Copy link
Member

Hooray! Thanks for diving into getting those privileges fixed 💯

@mmarchini
Copy link
Contributor

For some reason, this only worked once 🤔

Probably not worth reopening though, once we have the auto start via label on nodejs/node, collaborators will be able to add the label while opening the PR, therefore starting a full CI run right away :)

(we also have Actions now that make the -lite pipeline redundant, we might consider disabling that functionality if the bot starts autorunning -lite again)

@mmarchini
Copy link
Contributor

Opened a PR to remote Lite-CI since it starts but fails instantly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants