[Bug] Indexer holding many SSH sessions open #142

techman83 · 2020-02-04T06:34:22Z

Problem

Each run is leaving many defunct ssh processes open, which would suggest that however gitpython is being used, isn't letting go of the process properly.

netkan   17747  0.0  0.0      0     0 ?        Z    01:40   0:00 [ssh] <defunct>
netkan   17835  0.0  0.0      0     0 ?        Z    01:40   0:00 [ssh] <defunct>

A single run

netkan@a43caac9d5dc:~$ ps -waux --sort=start_time|grep "01:"|grep ssh|wc -l
153

< 24 hours of uptime

netkan@a43caac9d5dc:~$ ps -waux --sort=start_time|grep ssh|wc -l
1699

The result is that over time the service starts thrashing the disk and eventually crashes.

The text was updated successfully, but these errors were encountered:

HebaruSan · 2020-02-04T07:03:37Z

We might be able to switch from SSH to HTTPS, by storing the token into ~/.git-credentials in this format:

https://netkan-bot:token@github.com

Or possibly:

https://token@github.com

techman83 · 2020-02-04T07:22:37Z

Possibly, but I'm not entirely sure it'd fix the problem. It's probably just a side effect of the actual problem. GitPython calls git directly, so all of the git interactions, for better or worse, happen within git itself and are separate to python.

HebaruSan · 2020-02-04T16:13:41Z

Can we figure out which operation those processes were created to perform? Since we have so few new mods indexed per day, I think we can rule out add, commit, and push. Is it just the pull in MessageHandler._update_master?

~~Also I wonder if __exit__ is being called consistently; it's hard to know since it's left to the runtime's handling of with blocks. Maybe some garbage collection logic is delaying the clean-up?~~

NetKAN-Infra/netkan/netkan/indexer.py

Lines 212 to 213 in 614db33

    
           def __exit__(self, exc_type, exc_value, traceback): 
        
               self.repo.close()

OK, reading up on context managers, it sounds like the calling of __exit__ is pretty strongly guaranteed, so never mind that.

HebaruSan · 2020-02-04T16:50:16Z

I think we can rule out add, commit, and push

Not quite the case; we do perform one push per batch, after handling the non-staged modules:

NetKAN-Infra/netkan/netkan/indexer.py

Lines 247 to 250 in 614db33

    
           def process_ckans(self): 
        
               self._process_queue(self.master) 
        
               self._update_master(push=True) 
        
               self._process_queue(self.staged)

In total, each batch of 10 messages looks like it would do:

Two pulls (one in __enter__ and one in process_ckans)
One push (in process_ckans)

... regardless of whether any of the modules were changed. We can probably eliminate the push (when there are no changes) and one of the pulls with some light refactoring. The overall problem would remain but in a less severe form.

techman83 · 2020-02-05T02:25:31Z

OK, reading up on context managers, it sounds like the calling of __exit__ is pretty strongly guaranteed, so never mind that.

Yeah, context managers are pretty neat. If we were leaning on threads I could see scenarios where we might trip ourselves up, but that isn't the case here.

I have pondered if we are tripping up the abuse mechanisms and stalling the ssh connections. As previously there were significant pauses between batches of 10, but with the Inflator improvements we can really rip through the indexing run.

techman83 · 2020-02-05T03:46:42Z

Well. It would appear the issue is essentially identical after the last run. Which is interesting.

ps waux|grep defunct|wc -l
155

So whatever the problem is, it's going to be very obvious when it's no longer a problem!

HebaruSan · 2020-02-11T01:47:27Z

Apparently related (I didn't find this, @techman83 did):

https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/

Apparently git doesn't wait for ssh to finish, which turns it into a zombie process, and containers don't have a mechanism for cleaning them up.

HebaruSan · 2020-02-11T01:52:48Z

Also apparently related: aws/amazon-ecs-agent#852
The --init flag runs a zombie reaping process in the container.
(Again credit to @techman83 for finding this)

…#142

techman83 added the Bug Something isn't working label Feb 4, 2020

HebaruSan added the Indexer Receives inflated modules and adds them to CKAN-meta label Feb 4, 2020

HebaruSan mentioned this issue Feb 4, 2020

Eliminate push and pull with no commits #143

Merged

techman83 added a commit to techman83/NetKAN-Infra that referenced this issue Feb 11, 2020

Run indexer with an init to reap zombie ssh sessions - fixes KSP-CKAN…

45051e3

…#142

techman83 mentioned this issue Feb 11, 2020

[Feature] PyGit2 #148

Closed

techman83 closed this as completed in f774c3e Feb 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Indexer holding many SSH sessions open #142

[Bug] Indexer holding many SSH sessions open #142

techman83 commented Feb 4, 2020

HebaruSan commented Feb 4, 2020 •

edited

Loading

techman83 commented Feb 4, 2020

HebaruSan commented Feb 4, 2020 •

edited

Loading

HebaruSan commented Feb 4, 2020

techman83 commented Feb 5, 2020

techman83 commented Feb 5, 2020 •

edited

Loading

HebaruSan commented Feb 11, 2020

HebaruSan commented Feb 11, 2020

[Bug] Indexer holding many SSH sessions open #142

[Bug] Indexer holding many SSH sessions open #142

Comments

techman83 commented Feb 4, 2020

Problem

HebaruSan commented Feb 4, 2020 • edited Loading

techman83 commented Feb 4, 2020

HebaruSan commented Feb 4, 2020 • edited Loading

HebaruSan commented Feb 4, 2020

techman83 commented Feb 5, 2020

techman83 commented Feb 5, 2020 • edited Loading

HebaruSan commented Feb 11, 2020

HebaruSan commented Feb 11, 2020

HebaruSan commented Feb 4, 2020 •

edited

Loading

HebaruSan commented Feb 4, 2020 •

edited

Loading

techman83 commented Feb 5, 2020 •

edited

Loading