Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

g10k cache sometimes gets corrupted #76

Closed
alex-harvey-z3q opened this issue Oct 4, 2017 · 26 comments
Closed

g10k cache sometimes gets corrupted #76

alex-harvey-z3q opened this issue Oct 4, 2017 · 26 comments

Comments

@alex-harvey-z3q
Copy link

From time to time I find that the g10k cache becomes corrupted and I am forced to delete it. This is a big problem in production and ultimately may mean I can't use g10k in production. A recent example was a failure like this:

executeCommand(): git command failed: git --git-dir /tmp/g10k/git@git.example.com-PUP_puppet-fstab.git remote update --prune exit status 128
Output: fatal: Not a git repository: '/tmp/g10k/git@git.example.com-PUP_puppet-fstab.git'
@alex-harvey-z3q
Copy link
Author

Next time it happens I will make sure I save a copy of the corrupted cache.

@alex-harvey-z3q
Copy link
Author

The issue may be that we are simply allowing g10k to fail and give up here:

g10k/helper.go

Lines 164 to 172 in de149b1

if err != nil {
if !allowFail && !config.UseCacheFallback {
Fatalf("executeCommand(): git command failed: " + command + " " + err.Error() + "\nOutput: " + string(out) +
"\nIf you are using GitLab be sure that you added your deploy key to your repository")
} else {
er.returnCode = 1
er.output = fmt.Sprint(err)
}
}

Really, we can't allow this tool to ever just give up and fail in production - especially if the only problem is a corrupted Git cache. It should just delete the problematic cache and try again.

@alex-harvey-z3q
Copy link
Author

alex-harvey-z3q commented Oct 4, 2017

An earlier instance of the output when this failed:

executeCommand(): git command failed: git --git-dir /tmp/g10k/git@git.example.com-PUP_puppet-jenkins.git remote update --prune exit status 1
Output: Fetching origin
error: refs/merge-requests/204/head does not point to a valid object!
error: refs/merge-requests/204/head does not point to a valid object!
error: refs/merge-requests/204/head does not point to a valid object!
error: Could not read 9665485307d333eb3faa0e1367c969f5a9adf4c1
error: refs/merge-requests/204/head does not point to a valid object!
From git.example.com:PUP/puppet-jenkins
   e5c1923..9513e8a  master     -> master
error: unable to find 9665485307d333eb3faa0e1367c969f5a9adf4c1
fatal: object 9665485307d333eb3faa0e1367c969f5a9adf4c1 not found
error: Could not fetch origin

@xorpaul
Copy link
Owner

xorpaul commented Oct 4, 2017

Output: fatal: Not a git repository: '/tmp/g10k/git@git.example.com-PUP_puppet-fstab.git'

could mean that the initial git pull of this repository did fail, but the current g10k version does not create the cache directory in this case. Maybe you are using an older version?

I tried this Puppetfile

 mod 'firewall',
 :git => 'https://gthub.com/puppetlabs/puppetlabs-firewall.git',
 :branch => 'master'

If there are only a handful of Puppet modules that are hosted on an unreliable Git server, then you can add it directly to the module:

 mod 'firewall',
 :git => 'https://github.com/puppetlabs/puppetlabs-firewall.git',
 :branch => 'master',
 :ignore_unreachable => 'true'

Or you can add a global setting in your g10k config to allow all your Git modules to fail and your g10k run to continue.
#57 (comment)

Really, we can't allow this tool to ever just give up and fail in production

I don't agree with this, in my setup I want g10k to fail if there is anything unreachable, because I only sync the g10k populated environments to my Puppetservers if g10k did run successfully. I'd rather have an older working Puppet environment than an corrupted, half populated environment in production.

It should just delete the problematic cache and try again.

Checking the local git repository first, clearing it and retry could be a solution, but then how often should g10k try this? What should it do if the Git repository is completely unreachable?
I'd rather fail fast and let the user retry the g10k run.

What we can agree on is that the cached git repository should never be empty or corrupted.
It would greatly help if we could find the reason how it ended up corrupted and fix that.
Otherwise I could add a g10k config setting that always checks the git repository first with git fsck or something similar and clear it and retry.

@alex-harvey-z3q
Copy link
Author

Yes. I think the basic principle is that the cached git repository should never be empty or corrupted, whereas I am seeing them corrupted quite often. I estimate g10k is being called dozens of times per day in about 50 AWS accounts per day at my site and I'm getting a corrupted cache maybe once a fortnight. I can confirm that each time I have seen the cache corrupted, it would fail repeatedly until I deleted the cache, at which point it would always succeed.

I guess the next thing to do is for me to wait until this happens again, and make a copy of the corrupted cache.

I take it you're saying you haven't actually seen this before?

@alex-harvey-z3q
Copy link
Author

alex-harvey-z3q commented Nov 2, 2017

I have an example of one of the problematic g10k caches saved now.

Here's the problem:

$ git --git-dir /var/tmp/g10k/git@git.example.com-FOO_puppet-tenant_profile.git remote update --prune
Fetching origin
error: refs/heads/feature/fitness6-jarfile does not point to a valid object!
error: refs/heads/feature/talendmetadata does not point to a valid object!
error: refs/merge-requests/30/head does not point to a valid object!
error: refs/merge-requests/31/head does not point to a valid object!
Warning: Permanently added 'git.example.com,10.0.0.10' (RSA) to the list of known hosts.
error: refs/heads/feature/fitness6-jarfile does not point to a valid object!
error: refs/heads/feature/talendmetadata does not point to a valid object!
error: refs/merge-requests/30/head does not point to a valid object!
error: refs/merge-requests/31/head does not point to a valid object!
error: refs/heads/feature/fitness6-jarfile does not point to a valid object!
error: refs/heads/feature/talendmetadata does not point to a valid object!
error: refs/merge-requests/30/head does not point to a valid object!
error: refs/merge-requests/31/head does not point to a valid object!
error: refs/heads/feature/fitness6-jarfile does not point to a valid object!
error: refs/heads/feature/talendmetadata does not point to a valid object!
error: refs/merge-requests/30/head does not point to a valid object!
error: refs/merge-requests/31/head does not point to a valid object!
error: refs/heads/feature/fitness6-jarfile does not point to a valid object!
error: refs/heads/feature/talendmetadata does not point to a valid object!
error: refs/merge-requests/30/head does not point to a valid object!
error: refs/merge-requests/31/head does not point to a valid object!
error: Could not read 5a16bc06490229d809f3b217a8ad3b6db2054355
error: Could not read bdcd583320d464836a146f4d7122453bcb225069
remote: Counting objects: 2, done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 2 (delta 0), reused 0 (delta 0)
Unpacking objects: 100% (2/2), done.
fatal: bad object 57a1d82e3d5ff67bb774fe40f5719323a64d6b03
error: git.example.com:FOO/puppet-tenant_profile.git did not send all necessary objects

error: Could not fetch origin

I'll see what else I can glean from this tar ball.

@alex-harvey-z3q
Copy link
Author

Reminder to me: I have this saved as a tarball as /var/tmp/tp.tgz on my laptop.

@alex-harvey-z3q
Copy link
Author

Use of git fsck --full results in:

$ git fsck --full
Checking object directories: 100% (256/256), done.
Checking objects: 100% (4343/4343), done.
error: refs/heads/feature/fitness6-jarfile: invalid sha1 pointer 57a1d82e3d5ff67bb774fe40f5719323a64d6b03
error: refs/heads/feature/talendmetadata: invalid sha1 pointer bdcd583320d464836a146f4d7122453bcb225069
error: refs/merge-requests/30/head: invalid sha1 pointer 57a1d82e3d5ff67bb774fe40f5719323a64d6b03
error: refs/merge-requests/31/head: invalid sha1 pointer bdcd583320d464836a146f4d7122453bcb225069
broken link from  commit 7c52cc078733ce8191867549e40b6870488cf6c8
              to    tree e219e5806d873ee8f21e37f731f84eb369ea03ee
broken link from  commit 7c52cc078733ce8191867549e40b6870488cf6c8
              to  commit 5a16bc06490229d809f3b217a8ad3b6db2054355
broken link from  commit 7c52cc078733ce8191867549e40b6870488cf6c8
              to  commit bdcd583320d464836a146f4d7122453bcb225069
broken link from    tree 0d2bff1d856e679e4e6176bf9adcb235d75abafe
              to    tree 1af132babe82801b3b1816e84cb5addc958e5956
broken link from    tree bb3b9aa85c640065f022f40f5ca92b437fbee9d6
              to    tree 0f58444d1b3ff07465f57c03300bfc3ea48536ba
broken link from    tree bb3b9aa85c640065f022f40f5ca92b437fbee9d6
              to    tree 3007b6433279e2e0ee15489077db6ae97b69efa4
broken link from    tree 01f5f38b3de8ba060ea7d4d9f5e49d2cfab3b186
              to    blob 7f430468dd7c3f22a00621b1a2f2ade211ca77ec
missing blob 7f430468dd7c3f22a00621b1a2f2ade211ca77ec
dangling commit eec3ce2ee3120133c1f95b6a9b960c3dc93f8452
missing tree 3007b6433279e2e0ee15489077db6ae97b69efa4
dangling commit cfa76e03fb97179326076300889b1d830404f5ce
missing commit bdcd583320d464836a146f4d7122453bcb225069
missing tree 1af132babe82801b3b1816e84cb5addc958e5956
dangling tag 2695e2f30947085baf77af21e16b3a01c4ee19cb
missing commit 5a16bc06490229d809f3b217a8ad3b6db2054355
missing tree 0f58444d1b3ff07465f57c03300bfc3ea48536ba
missing tree e219e5806d873ee8f21e37f731f84eb369ea03ee

@alex-harvey-z3q
Copy link
Author

On the other hand if I clone the upstream repo again and run the fsck command:

$ git fsck --full
Checking object directories: 100% (256/256), done.
Checking objects: 100% (2375/2375), done.

@alex-harvey-z3q
Copy link
Author

See this Stack Overflow post here, which seems to describe the same problem for others:
https://stackoverflow.com/questions/30356012/git-gc-displays-error-could-not-read-commit

@alex-harvey-z3q
Copy link
Author

Even after running the fsck command above, the remote update --prune command still fails.

@alex-harvey-z3q
Copy link
Author

alex-harvey-z3q commented Nov 2, 2017

@xorpaul , I think if the git update remote --prune returns a non-zero exit status, g10k should delete that clone, clone it again, and try again. Only if it still fails should it give up and abort. Otherwise, we just can't use this production. Thoughts?

@alex-harvey-z3q
Copy link
Author

alex-harvey-z3q commented Nov 2, 2017

@xorpaul Also, if you would like a tarball of the corrupted Git repo I saved, and copy of the same repo after cloning a fresh copy, let me know where I can send it.

@xorpaul
Copy link
Owner

xorpaul commented Nov 3, 2017

@alexharv074 Thanks for the debug info.

g10k is just calling the git binary to clone and update the local Git repository, if the remote Git server is unable to respond appropriately or sends a corrupted state of the repository, then the only thing g10k can do is retry the checkout.

What Git server are you using? Is it running on a VM or hardware? You should open a ticket at this Git server project with this information (cloning and updating multiple repositories at the same time, probably overloading the Git server, so that it sends invalid responses). Maybe you can adjust some settings (worker processes, web server processes) so that g10k doesn't overload your server.

I'll have a look at the git clone retry mechanism.

In the meantime you could try limiting the number of parallel checkouts and pulls with the -maxworker parameter.

@alex-harvey-z3q
Copy link
Author

@xorpaul

The Git server is Gitlab 8.16.4, running on a RHEL 6 EC2 instance, and the Git client version 1.7.1.

In any case, a clean & retry mechanism makes a lot sense to me, whatever the root cause is here. Whether it's the Git server's fault, or whether it's just a random corruption of a cloned Git repo, I still would not expect the tool to give up in production if the problem is that it has corrupted data in its cache.

Not sure how hard it is to implement the feature I proposed of course. I would send a PR if only I knew Golang.

@xorpaul
Copy link
Owner

xorpaul commented Nov 7, 2017

Try out the new v0.4 release:

https://github.com/xorpaul/g10k/releases/tag/v0.4

You can limit the number of Goroutines with -maxextractworker parameter or as maxextractworker: <INT> g10k config setting.

xorpaul added a commit that referenced this issue Nov 8, 2017
@xorpaul
Copy link
Owner

xorpaul commented Nov 8, 2017

Now you can also retry failed Git commands with 0.4.1

https://github.com/xorpaul/g10k/releases/tag/v0.4.1

Either use -retrygitcommands cli parameter or retry_git_commands g10k config setting.

  • You can let g10k retry to git clone or update the local repository if it failed before and was left in a corrupted state:
---
:cachedir: '/tmp/g10k'
retry_git_commands: true

sources:
  example:
    remote: 'https://github.com/xorpaul/g10k-environment.git'
    basedir: '/tmp/example/'

If you then call g10k with this config file and have a corrupted local Git repository, g10k deletes the local cache and retries the Git clone command once:

WARN: git command failed: git --git-dir /tmp/g10k/modules/https-__github.com_puppetlabs_puppetlabs-firewall.git remote update --prune deleting local cached repository and retrying...

@alex-harvey-z3q
Copy link
Author

alex-harvey-z3q commented Nov 9, 2017

Hi @xorpaul

Thanks very much for implementing the feature.

However, it does not seem to be working in the expected way:

-bash-4.1$ g10k -version
g10k Version 0.4 Build time: 2017-11-08_15:22:31 UTC
-bash-4.1$ g10k -puppetfile -retrygitcommands
executeCommand(): git command failed: git --git-dir /tmp/g10k/git@git.example.com-FOO_puppet-packer.git remote update --prune exit status 128
Output: fatal: Not a git repository: '/tmp/g10k/git@git.example.com-FOO_puppet-packer.git'

If you are using GitLab be sure that you added your deploy key to your repository
-bash-4.1$ ls -ld /tmp/g10k/git\@git.example.com-FOO_puppet-packer.git/
drwxrwxr-x. 7 jenkins jenkins 140 Oct 27 08:41 /tmp/g10k/git@git.example.com-FOO_puppet-packer.git/

@xorpaul
Copy link
Owner

xorpaul commented Nov 9, 2017

Hi @alexharv074,

ah, sorry forgot to add the new CLI parameter to the Puppetfile mode.
Fixed.
e323b65

Please try:
https://github.com/xorpaul/g10k/releases/tag/v0.4.2

$ ./g10k -puppetfile -verbose -retrygitcommands
2017/11/09 11:00:34 Executing git clone --mirror https://github.com/puppetlabs/puppetlabs-apache.git /tmp/g10k/https-__github.com_puppetlabs_puppetlabs-apache.git took 2.70532s
2017/11/09 11:00:34 Executing git --git-dir /tmp/g10k/https-__github.com_puppetlabs_puppetlabs-apache.git rev-parse --verify 'master' took 0.00243s                                                   
Need to sync .//modules/apache/                                                                                                                                                                                      2017/11/09 11:00:34 syncToModuleDir(): Executing git --git-dir /tmp/g10k/https-__github.com_puppetlabs_puppetlabs-apache.git archive master took 0.04980s
2017/11/09 11:00:34 Executing git --git-dir /tmp/g10k/https-__github.com_puppetlabs_puppetlabs-apache.git rev-parse --verify 'master' took 0.00224s
Synced ./Puppetfile with 1 git repositories and 0 Forge modules in 2.8s with git (2.7s sync, I/O 0.0s) and Forge (0.0s query+download, I/O 0.0s) using 50 resolv and 20 extract workers
$ rm -rf /tmp/g10k/https-__github.com_puppetlabs_puppetlabs-apache.git/*
$ rm modules/apache/                                                                                                                                                                                  
rm: cannot remove 'modules/apache/': Is a directory                                                                                                                                                     
$ ./g10k -puppetfile -verbose -retrygitcommands                       
2017/11/09 11:00:47 Executing git --git-dir /tmp/g10k/https-__github.com_puppetlabs_puppetlabs-apache.git remote update --prune took 0.00189s                   
WARN: git repository https://github.com/puppetlabs/puppetlabs-apache.git does not exist or is unreachable at this moment!
WARN: git command failed: git --git-dir /tmp/g10k/https-__github.com_puppetlabs_puppetlabs-apache.git remote update --prune deleting local cached repository and retrying...
2017/11/09 11:00:49 Executing git clone --mirror https://github.com/puppetlabs/puppetlabs-apache.git /tmp/g10k/https-__github.com_puppetlabs_puppetlabs-apache.git took 2.46786s                        
2017/11/09 11:00:49 Executing git --git-dir /tmp/g10k/https-__github.com_puppetlabs_puppetlabs-apache.git rev-parse --verify 'master' took 0.00244s
Synced ./Puppetfile with 1 git repositories and 0 Forge modules in 2.5s with git (2.5s sync, I/O 0.0s) and Forge (0.0s query+download, I/O 0.0s) using 50 resolv and 20 extract workers

@alex-harvey-z3q
Copy link
Author

@xorpaul

I am very happy to say it's working!

Before:

-bash-4.1$ g10k -puppetfile 
Resolving Git modules (34/52) 2.303s [===========================================>------------------------]  65%
executeCommand(): git command failed: git --git-dir /tmp/g10k/git@git.example.com-FOO_puppet-tenant_profile.git remote update --prune exit status 1
Output: Fetching origin
error: refs/heads/feature/fitness6-jarfile does not point to a valid object!
error: refs/heads/feature/talendmetadata does not point to a valid object!
error: refs/merge-requests/30/head does not point to a valid object!
error: refs/merge-requests/31/head does not point to a valid object!
error: refs/heads/feature/fitness6-jarfile does not point to a valid object!
error: refs/heads/feature/talendmetadata does not point to a valid object!
error: refs/merge-requests/30/head does not point to a valid object!
error: refs/merge-requests/31/head does not point to a valid object!
error: refs/heads/feature/fitness6-jarfile does not point to a valid object!
error: refs/heads/feature/talendmetadata does not point to a valid object!
error: refs/merge-requests/30/head does not point to a valid object!
error: refs/merge-requests/31/head does not point to a valid object!
error: refs/heads/feature/fitness6-jarfile does not point to a valid object!
error: refs/heads/feature/talendmetadata does not point to a valid object!
error: refs/merge-requests/30/head does not point to a valid object!
error: refs/merge-requests/31/head does not point to a valid object!
error: Could not read 5a16bc06490229d809f3b217a8ad3b6db2054355
error: Could not read bdcd583320d464836a146f4d7122453bcb225069
error: unable to find 57a1d82e3d5ff67bb774fe40f5719323a64d6b03
fatal: object 57a1d82e3d5ff67bb774fe40f5719323a64d6b03 not found
error: Could not fetch origin
 
If you are using GitLab be sure that you added your deploy key to your repository

Install new version:

[ec2-user@jenkins ~]$ g10k -version
g10k version 0.4.2 Build time: 2017-11-08_16:01:31 UTC

After:

-bash-4.1$ g10k -puppetfile -retrygitcommands
Resolving Git modules (43/52) 3.605s [=======================================================>------------]  83%
WARN: git repository git@git.example.com:FOO/puppet-tenant_profile.git does not exist or is unreachable at this moment!
Resolving Git modules (52/52) 4.542s [====================================================================] 100%
Synced ./Puppetfile with 52 git repositories and 0 Forge modules in 7.9s with git (7.6s sync, I/O 0.0s) and Forge (0.0s query+download, I/O 0.0s) using 50 resolv and 20 extract workers

And corrupted cache and all, it still copied 52 modules in 7.9 seconds!

Thanks so much Andreas, there will be many happy customers at my site, and best of all, I feel confident to roll out g10k at my next place!

@xorpaul
Copy link
Owner

xorpaul commented Nov 9, 2017

Glad I could help!

Did you censor your output?
You should've gotten a warning like:

WARN: git command failed: git --git-dir /tmp/g10k/git@git.example.com:FOO/puppet-tenant_profile.git remote update --prune deleting local cached repository and retrying...

@alex-harvey-z3q
Copy link
Author

No, I did redact sensitive information using search & replace to update the Git server address, and site-specific info in the Git URL, but the output I showed is otherwise unchanged.

To be honest, I was about to see if I could send in a pull request to improve the wording of the error message, but sounds like maybe it's still not behaving the way you expected it to?

@xorpaul
Copy link
Owner

xorpaul commented Nov 9, 2017

g10k should print a warning that the git command failed and that it retries the git clone command:

https://github.com/xorpaul/g10k/blob/master/git.go#L117

retrycommands

Maybe the progress bar from the default verbosity level is the cause that it skipped this line for you.

Can you retry using the -info verbosity level?

@alex-harvey-z3q
Copy link
Author

You are correct:

-bash-4.1$ g10k -puppetfile -info -retrygitcommands
WARN: git repository git@git.example.com:FOO/puppet-tenant_profile.git does not exist or is unreachable at this moment!
WARN: git command failed: git --git-dir /tmp/g10k/git@git.example.com-FOO_puppet-tenant_profile.git remote update --prune deleting local cached repository and retrying...
Synced ./Puppetfile with 52 git repositories and 0 Forge modules in 7.7s with git (7.4s sync, I/O 0.0s) and Forge (0.0s query+download, I/O 0.0s) using 50 resolv and 20 extract workers

@xorpaul
Copy link
Owner

xorpaul commented Nov 9, 2017

Alright then.

@xorpaul xorpaul closed this as completed Nov 9, 2017
@xorpaul
Copy link
Owner

xorpaul commented Nov 9, 2017

I'll update the output in the next release, so that only the retrying line gets printed, when -retrygitcommands is set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants