Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Indexer holding many SSH sessions open #142

Closed
techman83 opened this issue Feb 4, 2020 · 8 comments
Closed

[Bug] Indexer holding many SSH sessions open #142

techman83 opened this issue Feb 4, 2020 · 8 comments
Labels
Bug Something isn't working Indexer Receives inflated modules and adds them to CKAN-meta

Comments

@techman83
Copy link
Member

Problem

Each run is leaving many defunct ssh processes open, which would suggest that however gitpython is being used, isn't letting go of the process properly.

netkan   17747  0.0  0.0      0     0 ?        Z    01:40   0:00 [ssh] <defunct>
netkan   17835  0.0  0.0      0     0 ?        Z    01:40   0:00 [ssh] <defunct>

A single run

netkan@a43caac9d5dc:~$ ps -waux --sort=start_time|grep "01:"|grep ssh|wc -l
153

< 24 hours of uptime

netkan@a43caac9d5dc:~$ ps -waux --sort=start_time|grep ssh|wc -l
1699

The result is that over time the service starts thrashing the disk and eventually crashes.

2020-02-04_14-33-40

@techman83 techman83 added the Bug Something isn't working label Feb 4, 2020
@HebaruSan
Copy link
Member

HebaruSan commented Feb 4, 2020

We might be able to switch from SSH to HTTPS, by storing the token into ~/.git-credentials in this format:

https://netkan-bot:token@github.com

Or possibly:

https://token@github.com

@techman83
Copy link
Member Author

Possibly, but I'm not entirely sure it'd fix the problem. It's probably just a side effect of the actual problem. GitPython calls git directly, so all of the git interactions, for better or worse, happen within git itself and are separate to python.

@HebaruSan HebaruSan added the Indexer Receives inflated modules and adds them to CKAN-meta label Feb 4, 2020
@HebaruSan
Copy link
Member

HebaruSan commented Feb 4, 2020

Can we figure out which operation those processes were created to perform? Since we have so few new mods indexed per day, I think we can rule out add, commit, and push. Is it just the pull in MessageHandler._update_master?

Also I wonder if __exit__ is being called consistently; it's hard to know since it's left to the runtime's handling of with blocks. Maybe some garbage collection logic is delaying the clean-up?

def __exit__(self, exc_type, exc_value, traceback):
self.repo.close()

OK, reading up on context managers, it sounds like the calling of __exit__ is pretty strongly guaranteed, so never mind that.

@HebaruSan
Copy link
Member

I think we can rule out add, commit, and push

Not quite the case; we do perform one push per batch, after handling the non-staged modules:

def process_ckans(self):
self._process_queue(self.master)
self._update_master(push=True)
self._process_queue(self.staged)

In total, each batch of 10 messages looks like it would do:

  • Two pulls (one in __enter__ and one in process_ckans)
  • One push (in process_ckans)

... regardless of whether any of the modules were changed. We can probably eliminate the push (when there are no changes) and one of the pulls with some light refactoring. The overall problem would remain but in a less severe form.

@techman83
Copy link
Member Author

OK, reading up on context managers, it sounds like the calling of __exit__ is pretty strongly guaranteed, so never mind that.

Yeah, context managers are pretty neat. If we were leaning on threads I could see scenarios where we might trip ourselves up, but that isn't the case here.

I have pondered if we are tripping up the abuse mechanisms and stalling the ssh connections. As previously there were significant pauses between batches of 10, but with the Inflator improvements we can really rip through the indexing run.

@techman83
Copy link
Member Author

techman83 commented Feb 5, 2020

Well. It would appear the issue is essentially identical after the last run. Which is interesting.

ps waux|grep defunct|wc -l
155

So whatever the problem is, it's going to be very obvious when it's no longer a problem!

@HebaruSan
Copy link
Member

Apparently related (I didn't find this, @techman83 did):

https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/

Apparently git doesn't wait for ssh to finish, which turns it into a zombie process, and containers don't have a mechanism for cleaning them up.

@HebaruSan
Copy link
Member

Also apparently related: aws/amazon-ecs-agent#852
The --init flag runs a zombie reaping process in the container.
(Again credit to @techman83 for finding this)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working Indexer Receives inflated modules and adds them to CKAN-meta
Projects
None yet
Development

No branches or pull requests

2 participants