Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent write of nodes referencing the same node fails with Dgraph v1.1 #4079

Closed
jostillmanns opened this issue Sep 27, 2019 · 11 comments · Fixed by #4454
Closed

Concurrent write of nodes referencing the same node fails with Dgraph v1.1 #4079

jostillmanns opened this issue Sep 27, 2019 · 11 comments · Fixed by #4454
Assignees
Labels
area/mutations Related to mutations JSON or RDF. kind/question Something requiring a response. status/accepted We accept to investigate/work on it.

Comments

@jostillmanns
Copy link

jostillmanns commented Sep 27, 2019

What version of Dgraph are you using?

v1.1

Have you tried reproducing the issue with the latest release?

yes

What is the hardware spec (RAM, OS)?

16Gb, Linux 4.19.67-1-lts

Steps to reproduce the issue (command/config used to run Dgraph).

run test

package main

import (
	"context"
	"encoding/json"
	"strconv"
	"testing"

	"github.com/dgraph-io/dgo/v2"
	"github.com/dgraph-io/dgo/v2/protos/api"
	"github.com/stretchr/testify/assert"
	"github.com/stretchr/testify/require"
	"google.golang.org/grpc"
)

func Test_it_writes_concurrently(t *testing.T) {
	conn, err := grpc.Dial(dgraphAddr(), grpc.WithInsecure())
	if err != nil {
		t.Fatalf("conntect: %v", err)
	}

	dg := dgo.NewDgraphClient(api.NewDgraphClient(conn))
	defer dg.Alter(context.Background(), &api.Operation{DropAll: true})

	schema := `
type node {
  name: string
  child: uid
}

name: string .
child: uid .
`
	err = dg.Alter(context.Background(), &api.Operation{Schema: schema})
	require.NoError(t, err)

	type node struct {
		ID   string `json:"uid"`
		Name string `json:"name"`

		Child *node `json:"child"`
	}

	child := node{ID: "_:blank-0", Name: "child"}
	js, err := json.Marshal(child)
	require.NoError(t, err)

	res, err := dg.NewTxn().Mutate(context.Background(), &api.Mutation{SetJson: js, CommitNow: true})
	require.NoError(t, err)

	in := []node{}
	for i := 0; i < 2; i++ {
		in = append(in, node{ID: "_:blank-0", Name: strconv.Itoa(i + 1), Child: &node{ID: res.GetUids()["blank-0"]}})
	}

	errChan := make(chan error)
	for i := range in {
		go func(n node) {
			js, err := json.Marshal(n)
			require.NoError(t, err)

			_, err = dg.NewTxn().Mutate(context.Background(), &api.Mutation{SetJson: js, CommitNow: true})
			errChan <- err
		}(in[i])
	}

	errs := []error{}
	for i := 0; i < len(in); i++ {
		errs = append(errs, <-errChan)
	}

	for _, e := range errs {
		assert.NoError(t, e)
	}
}

Expected behaviour and actual result.

Expected behaviour: no error and two nodes are created
Actual behaviour: One mutate request returns Transaction has been aborted. Please retry

Problem description

I can reproduce this to be working with Dgraph 1.0.16. If the nodes that are written concurrently don't reference the same child node everything works. Hence I guess the issue is with the child node reference.

@campoy campoy added area/mutations Related to mutations JSON or RDF. kind/bug Something is broken. status/accepted We accept to investigate/work on it. priority/P1 Serious issue that requires eventual attention (can wait a bit) kind/question Something requiring a response. and removed kind/bug Something is broken. priority/P1 Serious issue that requires eventual attention (can wait a bit) labels Sep 27, 2019
@campoy
Copy link
Contributor

campoy commented Sep 27, 2019

This is a very interesting issue, thanks for the report.

temporary workaround

For now, you can build a workaround to this by simply retrying, so your for loop would look like this:

errChan := make(chan error)
for i := range in {
	go func(n node) {
		js, err := json.Marshal(n)
		require.NoError(t, err)

		for  {
			_, err = dg.NewTxn().Mutate(context.Background(), &api.Mutation{SetJson: js, CommitNow: true})
			if err != dgo.ErrAborted {
				break
			}
		}
		errChan <- err
	}(in[i])
}

This, unfortunately requires the next version of dgo which has not been released yet.
If you want to try that with v2.0.0 you could simply check for the string error.

Instead of err != dgo.ErrAborted you should write err == nil || !strings.Contains(err.Error(), "aborted").

Is this a bug?

Regardless of the workaround it is worth talking about whether one of the transactions should indeed fail.

I tested the same code using dgraph@v1.0.16 and dgo@v1.0.0 and I was able to reproduce the error. So no, it's not a regression.

But still, the question remains.

Should this transaction be aborted since it refers to the same ID or is this something that should be accepted?

While they both refer to the same uid (the one you create first in child), they do not really modify any predicate coming out of it.
But I think rejecting it is the expected behavior, since one of the transaction could be deleting the child while the other one adds one more reference to it.

I'm not 100% sure, so I think @manishrjain is the one that should answer this one.

@jostillmanns
Copy link
Author

I think concurrent http requests which end up in a mutation referencing the same node is verly likely, hence I have to disagree on this beeing correct behavior.

@jostillmanns
Copy link
Author

jostillmanns commented Oct 1, 2019

Also your workaround isn't really feasible for me, as some of my mutations share a transaction with other requests. Implementing this workaround would require a huge effort on my side, as I build an abstraction layer around Dgraph that build up these transactions.

@manishrjain
Copy link
Contributor

I’m trying to understand the cause of txn abort, but I don’t see the schema here. Can you please paste the schema you’re using?

But, in general, I think any transactional system can result in aborts. Even if we were to figure out that somehow our txn system is a bit too strict and can be made lenient, aborts won’t go away. So, a way to retry is unavoidable.

Perhaps, a mechanism could be built by Dgraph doing the retry internally if a user asks for it, if CommitNow is set to true.

@jostillmanns
Copy link
Author

I updated the code to also write the schema.

@Kingloko
Copy link

Kingloko commented Oct 1, 2019

I'm having similar issues,

Perhaps, a mechanism could be built by Dgraph doing the retry internally if a user asks for it, if CommitNow is set to true.

I think that would be a great feature.

@manishrjain
Copy link
Contributor

manishrjain commented Oct 9, 2019

So, looking at the conflict generation code in v1.0.16, for how it was treating uids:

dgraph/posting/list.go

Lines 293 to 303 in 0590ee9

} else if x.Parse(l.key).IsData() {
// Unless upsert is specified, we don't check for index conflicts, only
// data conflicts.
// If the data is of type UID, then we use SPO for conflict detection.
// Otherwise, we use SP (for string, date, int, etc.).
typ, err := schema.State().TypeOf(t.Attr)
if err != nil {
glog.V(2).Infof("Unable to find type of attr: %s. Err: %v", t.Attr, err)
// Don't check for conflict.
} else if typ == types.UidID {
conflictKey = getKey(l.key, t.ValueId)

And with my changes: 693e7db24 , which consolidates the logic for conflict detection in list types:

dgraph/posting/list.go

Lines 460 to 476 in 7e74f43

case pk.IsData() && schema.State().IsList(t.Attr):
// Data keys, irrespective of whether they are UID or values, should be judged based on
// whether they are lists or not. For UID, t.ValueId = UID. For value, t.ValueId =
// fingerprint(value) or could be fingerprint(lang) or something else.
//
// For singular uid predicate, like partner: uid // no list.
// a -> b
// a -> c
// Run concurrently, only one of them should succeed.
// But for friend: [uid], both should succeed.
//
// Similarly, name: string
// a -> "x"
// a -> "y"
// This should definitely have a conflict.
// But, if name: [string], then they can both succeed.
conflictKey = getKey(l.key, t.ValueId)

The behavior is the same. If the (sub, pred, obj) where obj = UID, is the same, there would be conflict. That hasn't changed between the two versions. We do this to ensure that if a user has facets on that edge, they don't get overwritten. OR, if one op sets an object, while the other op deletes the object, we only apply one of those and reject the other. Seems like a logical approach.

I have three potential solutions here:

  1. Perhaps, we could add a NO_ABORT flag in a transaction, which would skip the conflict key generation and just write. That would simplify your operations, but at the cost of perhaps some data overwrites. The good thing is, you could control this at the transaction level.

  2. Alternatively, we could do this at the predicate level, where one could specify in the schema if this predicate should not be involved in conflict generation. For examples, predicates which are not crucial to the correctness of the data, like say names, or aliases, etc. Those could be marked in the schema as relaxed and we can just avoid doing conflict detection on those. Might be a better approach than the first one.

  3. Third approach would be to retry the mutation at the server level.

@Kingloko
Copy link

@manishrjain, I can't speak for @jostillmanns but all three of these approaches would work in my use case. Options 1 & 2 would be my preference, but I think option 3 would also solve my problem.

@erhlee-bird
Copy link

erhlee-bird commented Oct 10, 2019

@manishrjain, thanks for sharing the deeper insights here.

I would agree and say that all 3 of those options sound extremely appealing to me 😄.

@Kingloko
Copy link

What is the status of this issue? If this is low priority for you guys I'd appreciate if you can point me in the direction to fix it myself. This particular issue is a huge priority for my team.

@jostillmanns
Copy link
Author

Hi @manishrjain. Sorry it took me so long to write back on this issue. It took us another incident regarding this issue to look back at this 😉 . We discussed this again and would like to advocate for solution 1, as we have knowledge about the cases when overwrites would be okay.

Example: we have scheduled processes, which write data in parallel and we can make sure that there are no conflicts on the client level. On the other hand we accept HTTP requests, where the current behavior (NO_ABORT = false) is desirable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/mutations Related to mutations JSON or RDF. kind/question Something requiring a response. status/accepted We accept to investigate/work on it.
Development

Successfully merging a pull request may close this issue.

5 participants