Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang in Fetch after Clone while converting to mirror #356

Closed
josharian opened this issue Dec 3, 2016 · 7 comments
Closed

Hang in Fetch after Clone while converting to mirror #356

josharian opened this issue Dec 3, 2016 · 7 comments
Assignees

Comments

@josharian
Copy link
Contributor

The following program hangs most times I run it (but not always):

package main

import (
	"log"

	git "gopkg.in/libgit2/git2go.v24"
)

// mirror does a clone --mirror of url into dest.
func mirror(url, dest string) (*git.Repository, error) {
	log.Printf("starting clone of %s into %s", url, dest)
	repo, err := git.Clone(url, dest, &git.CloneOptions{Bare: true})
	if err != nil {
		log.Printf("clone of %s into %s failed: %v", url, dest, err)
		return nil, err
	}
	log.Printf("finished clone of %s into %s", url, dest)

	cfg, err := repo.Config()
	if err != nil {
		return nil, err
	}
	remotes, err := repo.Remotes.List()
	if err != nil || len(remotes) == 0 || remotes[0] != "origin" {
		return nil, err
	}

	log.Printf("converting remote origin to mirror")
	cfg.SetBool("remote.origin.mirror", true)
	cfg.SetString("remote.origin.fetch", "+refs/*:refs/*")
	remote, err := repo.Remotes.Lookup("origin")
	if err != nil {
		return nil, err
	}
	log.Printf("starting mirror fetch of %s into %s", url, dest)
	err = remote.Fetch(nil, &git.FetchOptions{}, "")
	if err != nil {
		log.Printf("mirror fetch of %s into %s failed: %v", url, dest, err)
		return nil, err
	}
	log.Printf("finished mirror fetch of %s into %s", url, dest)

	return repo, nil
}

func main() {
	dest := "/tmp/review_clone_go"
	url := "https://go.googlesource.com/review"
	_, err := mirror(url, dest)
	if err != nil {
		log.Printf("fail: %v", err)
		return
	}
	log.Print("success")
}

It hangs during the call to remote.Fetch. Sample run, killed manually by SIGQUIT when it hung:

$ go run bug.go 
2016/12/02 16:22:39 starting clone of https://go.googlesource.com/review into /tmp/review_clone_go
2016/12/02 16:22:41 finished clone of https://go.googlesource.com/review into /tmp/review_clone_go
2016/12/02 16:22:41 converting remote origin to mirror
2016/12/02 16:22:41 starting mirror fetch of https://go.googlesource.com/review into /tmp/review_clone_go
^\SIGQUIT: quit
PC=0x7fffc1f20f4e m=0
signal arrived during cgo execution

goroutine 1 [syscall, locked to thread]:
runtime.cgocall(0x40f7e80, 0xc420051be8, 0xc400000000)
	/Users/josh/go/1.7/src/runtime/cgocall.go:131 +0x110 fp=0xc420051bb8 sp=0xc420051b78
gopkg.in/libgit2/git2go%2ev24._Cfunc_git_remote_fetch(0x4612b40, 0x0, 0x4616a20, 0x0, 0x0)
	??:0 +0x4d fp=0xc420051be8 sp=0xc420051bb8
gopkg.in/libgit2/git2go%2ev24.(*Remote).Fetch(0xc420014320, 0x0, 0x0, 0x0, 0xc42001a200, 0x0, 0x0, 0x0, 0x0)
	/Users/josh/src/gopkg.in/libgit2/git2go.v24/remote.go:655 +0x308 fp=0xc420051c78 sp=0xc420051be8
main.mirror(0x41486f0, 0x22, 0x41456c5, 0x14, 0x0, 0x0, 0x0)
	/Users/josh/src/foggy.co/gopath/bug.go:38 +0x7ef fp=0xc420051ed0 sp=0xc420051c78
main.main()
	/Users/josh/src/foggy.co/gopath/bug.go:51 +0x4f fp=0xc420051f48 sp=0xc420051ed0
runtime.main()
	/Users/josh/go/1.7/src/runtime/proc.go:183 +0x1f4 fp=0xc420051fa0 sp=0xc420051f48
runtime.goexit()
	/Users/josh/go/1.7/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc420051fa8 sp=0xc420051fa0

goroutine 17 [syscall, locked to thread]:
runtime.goexit()
	/Users/josh/go/1.7/src/runtime/asm_amd64.s:2086 +0x1

rax    0x4
rbx    0x0
rcx    0x7fff5fbff018
rdx    0x7fff5fbff0a0
rdi    0x7
rsi    0x7fff5fbff120
rbp    0x7fff5fbff1d0
rsp    0x7fff5fbff018
r8     0x0
r9     0x80ee
r10    0x7fff5fbff020
r11    0x283
r12    0x7fff5fbff1e8
r13    0xf32b
r14    0x5
r15    0x480a400
rip    0x7fffc1f20f4e
rflags 0x283
cs     0x7
fs     0x0
gs     0x0
exit status 2

Reproduced with both Go 1.7.4 and Go 1.8 beta 1.

Doing some horrible manual symbol resolution, I managed to tease out this backtrace:

* thread #1: tid = 0x54e8c, 0x00007fffc1f20f4e, stop reason = signal SIGSTOP
  * frame #0: 0x00007fffc1f20f4e
    frame #1: 0x0000000004368fa9 libgit2.24.dylib`curls_read + 41
    frame #2: 0x00000000043bfe2b libgit2.24.dylib`read_cb + 40
    frame #3: 0x00007fffb2ec541f
    frame #4: 0x00007fffb2e56f07
    frame #5: 0x00007fffb2e5a856
    frame #6: 0x00000000043c00bb libgit2.24.dylib`stransport_read + 21
    frame #7: 0x0000000004392cf9 libgit2.24.dylib`recv_stream + 30
    frame #8: 0x00000000043cd9a9 libgit2.24.dylib`http_stream_read + 438
    frame #9: 0x00000000043cf617 libgit2.24.dylib`git_smart__recv_cb + 53
    frame #10: 0x00000000043d168d libgit2.24.dylib`recv_pkt + 94
    frame #11: 0x00000000043d1808 libgit2.24.dylib`git_smart__download_pack + 313
    frame #12: 0x00000000043b0af0 libgit2.24.dylib`git_remote_download + 556
    frame #13: 0x00000000043b0d74 libgit2.24.dylib`git_remote_fetch + 155
    frame #14: 0x00000000040fb719
    frame #15: 0x000000000404ed20
    frame #16: 0x00000000040ebecd
    frame #17: 0x00000000040f43cc
    frame #18: 0x00000000040f1db2
    frame #19: 0x00000000040f69be
    frame #20: 0x00000000040f6df0
    frame #21: 0x00000000040f6f34
    frame #22: 0x000000000402af6a

Frames 14 to 22 are all in Go world and are uninteresting. I believe that the pcs in the 0x00007fff... range correspond to assertion failures.

libgit2 itself does not appear to be the problem. The following equivalent C program always works successfully.

#include <git2.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
	git_libgit2_init();
	git_repository *repo = NULL;

	const char *url = "https://go.googlesource.com/review";
	const char *dest = "/tmp/review_clone_c";

	git_clone_options clone_opts;
	git_clone_init_options(&clone_opts, GIT_CLONE_OPTIONS_VERSION);
	clone_opts.bare = 1;
	int err = git_clone(&repo, url, dest, &clone_opts);
	if (err != 0) {
		printf("clone err: %d\n", err);
		return 1;
	}

	git_remote *remote = NULL;
	err = git_remote_lookup(&remote, repo, "origin");
	if (err != 0) {
		printf("remote lookup err: %d\n", err);
		return 1;
	}

	git_config *cfg;
	err = git_repository_config(&cfg, repo);
	if (err != 0) {
		printf("config err: %d\n", err);
		return 1;
	}

	err = git_config_set_bool(cfg, "remote.origin.mirror", 1);
	if (err != 0) {
		printf("config bool err: %d\n", err);
		return 1;
	}
	err = git_config_set_string(cfg, "remote.origin.fetch", "+refs/*:refs/*");
	if (err != 0) {
		printf("config str err: %d\n", err);
		return 1;
	}

	git_fetch_options fetch_opts;
	err = git_fetch_init_options(&fetch_opts, GIT_FETCH_OPTIONS_VERSION);
	if (err != 0) {
		printf("fetch options init err: %d\n", err);
		return 1;
	}

	err = git_remote_fetch(remote, NULL, &fetch_opts, NULL);
	if (err != 0) {
		printf("fetch err: %d\n", err);
		return 1;
	}
	printf("success\n");
	return 0;
}

This smells like memory corruption due to a race somewhere, but I don't see where/how. Running under the Go race detector yields no complaints. I'm running darwin, not linux, so I can't test with -msan.

@josharian
Copy link
Contributor Author

Ah. If I let it run long enough, I eventually get curl error: Failed to get recent socket. Looking at dtruss output, I think the problem might indeed be that the clone happens on one system thread and the fetch happens on a different one.

I see a slew of runtime.LockOSThread calls throughout libgit2, but each only lasts as long as an individual git2go function/method call. That means that the Go runtime is free to switch from thread to thread between calls, which is indeed very likely to happen for cgo calls. Looking at the dtruss output (https://gist.github.com/josharian/c8b990eec873f245e30b6f29bfcc9245), it appears that in fact different threads are in use.

I wonder whether a different thread-locking strategy is required. An example alternative is for a Repository to spin up a goroutine that does nothing but read from a chan func() and execute the functions that it gets passed; that goroutine could be locked to an OS thread. That would mean that Repository.Free is no longer optional.

Or maybe I have mis-diagnosed. In any case, this is a blocker for me to be able to use git2go. Please let me know what I can do to help.

@josharian
Copy link
Contributor Author

More data.

I've instrumented libgit2 (now using next / tip) and spied using Wireshark.

The clone succeeds. The fetch begins, and transfers a bunch of data. Then, for no reason I can see, the TCP traffic ceases and QUIC traffic begins. At this point, libgit2 is blocked in a select statement in wait_for, called from curls_read (all in curl_stream.c). Then, about 4 minutes later, the server sends a TCP FIN ACK. The select completes and curls_read returns an error.

None of this explains why the C code succeeds, but this no longer really looks to me like a threading bug.

@carlosmn carlosmn self-assigned this Dec 5, 2016
@carlosmn
Copy link
Member

carlosmn commented Dec 6, 2016

I've been unable to reproduce this locally, but FWIW you don't need the second fetch, we have a callback to override the remote creation precisely for this use-case. The library will call your code instead of its own if you have this override and then you can do whatever during the initial remote creation, like git's own git clone --mirror does.

I've reworked the code from the issue description to a test that does it in a single step to test things out. While it doesn't fix the issue per se, it would avoid triggering it and it avoids the no-transfer network connection.

func createMirrorRemote(repo *Repository, defaultName, url string) (*Remote, ErrorCode) {
	remote, err := repo.Remotes.CreateWithFetchspec(defaultName, url, "+refs/*:refs/*")
	if err != nil {
		return nil, ErrGeneric
	}

	cfg, err := repo.Config()
	if err != nil {
		return nil, ErrGeneric
	}

	err = cfg.SetBool(fmt.Sprintf("remote.%s.mirror", defaultName), true)
	if err != nil {
		return nil, ErrGeneric
	}

	return remote, ErrOk
}

func TestGoogleClone(t *testing.T) {
	t.Parallel()
	url := "https://go.googlesource.com/review"
	dest, err := ioutil.TempDir("/Users/carlos/tmp", "review")
	if err != nil {
		t.Fatal(err)
	}
	t.Log("starting clone of", url, "into", dest, "\n")
	repo, err := Clone(url, dest, &CloneOptions{
		Bare:                 true,
		RemoteCreateCallback: createMirrorRemote,
	})
	if err != nil {
		t.Fatal(err)
	}
	t.Log("finished mirror-clone of", url, "into", dest, repo.Path())
}

@carlosmn
Copy link
Member

As it happens, running a test against go tip https://travis-ci.org/libgit2/git2go/jobs/186415531 (which will become 1.8) errors out in fetch, so it does seem likely we're messing up somehow, though I still don't know exactly where.

@carlosmn
Copy link
Member

carlosmn commented Jul 4, 2017

Coming back to this after a while... I can sometimes reproduce this hang with just the tests we have in the codebase. Sometimes it outright crashes. Via printf-debugging I see in the crashes that the finalizer is running for the Remote object, even though we're in the middle of one of its functions.

It looks like the much more aggressive garbage collector in 1.8 does not consider the receiver to be keeping it alive. It does look exactly like the docs for runtime.KeepAlive show.

@josharian
Copy link
Contributor Author

I'm on vacation for the next few weeks, but that seems plausible.

@carlosmn
Copy link
Member

carlosmn commented Sep 3, 2017

I'm going to consider this closed as of #393, do reopen if it's still broken.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants