Concurrent write of nodes referencing the same node fails with Dgraph v1.1 #4079

jostillmanns · 2019-09-27T08:09:38Z

What version of Dgraph are you using?

v1.1

Have you tried reproducing the issue with the latest release?

yes

What is the hardware spec (RAM, OS)?

16Gb, Linux 4.19.67-1-lts

Steps to reproduce the issue (command/config used to run Dgraph).

run test

package main

import (
	"context"
	"encoding/json"
	"strconv"
	"testing"

	"github.com/dgraph-io/dgo/v2"
	"github.com/dgraph-io/dgo/v2/protos/api"
	"github.com/stretchr/testify/assert"
	"github.com/stretchr/testify/require"
	"google.golang.org/grpc"
)

func Test_it_writes_concurrently(t *testing.T) {
	conn, err := grpc.Dial(dgraphAddr(), grpc.WithInsecure())
	if err != nil {
		t.Fatalf("conntect: %v", err)
	}

	dg := dgo.NewDgraphClient(api.NewDgraphClient(conn))
	defer dg.Alter(context.Background(), &api.Operation{DropAll: true})

	schema := `
type node {
  name: string
  child: uid
}

name: string .
child: uid .
`
	err = dg.Alter(context.Background(), &api.Operation{Schema: schema})
	require.NoError(t, err)

	type node struct {
		ID   string `json:"uid"`
		Name string `json:"name"`

		Child *node `json:"child"`
	}

	child := node{ID: "_:blank-0", Name: "child"}
	js, err := json.Marshal(child)
	require.NoError(t, err)

	res, err := dg.NewTxn().Mutate(context.Background(), &api.Mutation{SetJson: js, CommitNow: true})
	require.NoError(t, err)

	in := []node{}
	for i := 0; i < 2; i++ {
		in = append(in, node{ID: "_:blank-0", Name: strconv.Itoa(i + 1), Child: &node{ID: res.GetUids()["blank-0"]}})
	}

	errChan := make(chan error)
	for i := range in {
		go func(n node) {
			js, err := json.Marshal(n)
			require.NoError(t, err)

			_, err = dg.NewTxn().Mutate(context.Background(), &api.Mutation{SetJson: js, CommitNow: true})
			errChan <- err
		}(in[i])
	}

	errs := []error{}
	for i := 0; i < len(in); i++ {
		errs = append(errs, <-errChan)
	}

	for _, e := range errs {
		assert.NoError(t, e)
	}
}

Expected behaviour and actual result.

Expected behaviour: no error and two nodes are created
Actual behaviour: One mutate request returns Transaction has been aborted. Please retry

Problem description

I can reproduce this to be working with Dgraph 1.0.16. If the nodes that are written concurrently don't reference the same child node everything works. Hence I guess the issue is with the child node reference.

The text was updated successfully, but these errors were encountered:

campoy · 2019-09-27T22:16:25Z

This is a very interesting issue, thanks for the report.

temporary workaround

For now, you can build a workaround to this by simply retrying, so your for loop would look like this:

errChan := make(chan error)
for i := range in {
	go func(n node) {
		js, err := json.Marshal(n)
		require.NoError(t, err)

		for  {
			_, err = dg.NewTxn().Mutate(context.Background(), &api.Mutation{SetJson: js, CommitNow: true})
			if err != dgo.ErrAborted {
				break
			}
		}
		errChan <- err
	}(in[i])
}

This, unfortunately requires the next version of dgo which has not been released yet.
If you want to try that with v2.0.0 you could simply check for the string error.

Instead of err != dgo.ErrAborted you should write err == nil || !strings.Contains(err.Error(), "aborted").

Is this a bug?

Regardless of the workaround it is worth talking about whether one of the transactions should indeed fail.

I tested the same code using dgraph@v1.0.16 and dgo@v1.0.0 and I was able to reproduce the error. So no, it's not a regression.

But still, the question remains.

Should this transaction be aborted since it refers to the same ID or is this something that should be accepted?

While they both refer to the same uid (the one you create first in child), they do not really modify any predicate coming out of it.
But I think rejecting it is the expected behavior, since one of the transaction could be deleting the child while the other one adds one more reference to it.

I'm not 100% sure, so I think @manishrjain is the one that should answer this one.

jostillmanns · 2019-09-28T10:40:11Z

I think concurrent http requests which end up in a mutation referencing the same node is verly likely, hence I have to disagree on this beeing correct behavior.

jostillmanns · 2019-10-01T01:22:12Z

Also your workaround isn't really feasible for me, as some of my mutations share a transaction with other requests. Implementing this workaround would require a huge effort on my side, as I build an abstraction layer around Dgraph that build up these transactions.

manishrjain · 2019-10-01T03:27:18Z

I’m trying to understand the cause of txn abort, but I don’t see the schema here. Can you please paste the schema you’re using?

But, in general, I think any transactional system can result in aborts. Even if we were to figure out that somehow our txn system is a bit too strict and can be made lenient, aborts won’t go away. So, a way to retry is unavoidable.

Perhaps, a mechanism could be built by Dgraph doing the retry internally if a user asks for it, if CommitNow is set to true.

jostillmanns · 2019-10-01T03:49:56Z

I updated the code to also write the schema.

Kingloko · 2019-10-01T11:52:40Z

I'm having similar issues,

Perhaps, a mechanism could be built by Dgraph doing the retry internally if a user asks for it, if CommitNow is set to true.

I think that would be a great feature.

manishrjain · 2019-10-09T21:42:31Z

So, looking at the conflict generation code in v1.0.16, for how it was treating uids:

dgraph/posting/list.go

Lines 293 to 303 in 0590ee9

    
           } else if x.Parse(l.key).IsData() { 
        
           	// Unless upsert is specified, we don't check for index conflicts, only 
        
           	// data conflicts. 
        
           	// If the data is of type UID, then we use SPO for conflict detection. 
        
           	// Otherwise, we use SP (for string, date, int, etc.). 
        
           	typ, err := schema.State().TypeOf(t.Attr) 
        
           	if err != nil { 
        
           		glog.V(2).Infof("Unable to find type of attr: %s. Err: %v", t.Attr, err) 
        
           		// Don't check for conflict. 
        
           	} else if typ == types.UidID { 
        
           		conflictKey = getKey(l.key, t.ValueId)

And with my changes: 693e7db24 , which consolidates the logic for conflict detection in list types:

dgraph/posting/list.go

Lines 460 to 476 in 7e74f43

    
           case pk.IsData() && schema.State().IsList(t.Attr): 
        
           	// Data keys, irrespective of whether they are UID or values, should be judged based on 
        
           	// whether they are lists or not. For UID, t.ValueId = UID. For value, t.ValueId = 
        
           	// fingerprint(value) or could be fingerprint(lang) or something else. 
        
           	// 
        
           	// For singular uid predicate, like partner: uid // no list. 
        
           	// a -> b 
        
           	// a -> c 
        
           	// Run concurrently, only one of them should succeed. 
        
           	// But for friend: [uid], both should succeed. 
        
           	// 
        
           	// Similarly, name: string 
        
           	// a -> "x" 
        
           	// a -> "y" 
        
           	// This should definitely have a conflict. 
        
           	// But, if name: [string], then they can both succeed. 
        
           	conflictKey = getKey(l.key, t.ValueId)

The behavior is the same. If the (sub, pred, obj) where obj = UID, is the same, there would be conflict. That hasn't changed between the two versions. We do this to ensure that if a user has facets on that edge, they don't get overwritten. OR, if one op sets an object, while the other op deletes the object, we only apply one of those and reject the other. Seems like a logical approach.

I have three potential solutions here:

Perhaps, we could add a NO_ABORT flag in a transaction, which would skip the conflict key generation and just write. That would simplify your operations, but at the cost of perhaps some data overwrites. The good thing is, you could control this at the transaction level.
Alternatively, we could do this at the predicate level, where one could specify in the schema if this predicate should not be involved in conflict generation. For examples, predicates which are not crucial to the correctness of the data, like say names, or aliases, etc. Those could be marked in the schema as relaxed and we can just avoid doing conflict detection on those. Might be a better approach than the first one.
Third approach would be to retry the mutation at the server level.

Kingloko · 2019-10-10T12:22:32Z

@manishrjain, I can't speak for @jostillmanns but all three of these approaches would work in my use case. Options 1 & 2 would be my preference, but I think option 3 would also solve my problem.

erhlee-bird · 2019-10-10T13:05:47Z

@manishrjain, thanks for sharing the deeper insights here.

I would agree and say that all 3 of those options sound extremely appealing to me 😄.

Kingloko · 2019-10-17T17:58:13Z

What is the status of this issue? If this is low priority for you guys I'd appreciate if you can point me in the direction to fix it myself. This particular issue is a huge priority for my team.

jostillmanns · 2019-11-18T11:19:41Z

Hi @manishrjain. Sorry it took me so long to write back on this issue. It took us another incident regarding this issue to look back at this 😉 . We discussed this again and would like to advocate for solution 1, as we have knowledge about the cases when overwrites would be okay.

Example: we have scheduled processes, which write data in parallel and we can make sure that there are no conflicts on the client level. On the other hand we accept HTTP requests, where the current behavior (NO_ABORT = false) is desirable.

campoy assigned manishrjain Sep 27, 2019

arijitAD mentioned this issue Dec 20, 2019

Adds @noconflict directive to prevent conflict detection. #4454

Merged

arijitAD closed this as completed in #4454 Dec 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrent write of nodes referencing the same node fails with Dgraph v1.1 #4079

Concurrent write of nodes referencing the same node fails with Dgraph v1.1 #4079

jostillmanns commented Sep 27, 2019 •

edited

Loading

campoy commented Sep 27, 2019 •

edited

Loading

jostillmanns commented Sep 28, 2019

jostillmanns commented Oct 1, 2019 •

edited

Loading

manishrjain commented Oct 1, 2019

jostillmanns commented Oct 1, 2019

Kingloko commented Oct 1, 2019

manishrjain commented Oct 9, 2019 •

edited

Loading

Kingloko commented Oct 10, 2019

erhlee-bird commented Oct 10, 2019 •

edited

Loading

Kingloko commented Oct 17, 2019

jostillmanns commented Nov 18, 2019

Concurrent write of nodes referencing the same node fails with Dgraph v1.1 #4079

Concurrent write of nodes referencing the same node fails with Dgraph v1.1 #4079

Comments

jostillmanns commented Sep 27, 2019 • edited Loading

What version of Dgraph are you using?

Have you tried reproducing the issue with the latest release?

What is the hardware spec (RAM, OS)?

Steps to reproduce the issue (command/config used to run Dgraph).

Expected behaviour and actual result.

Problem description

campoy commented Sep 27, 2019 • edited Loading

temporary workaround

Is this a bug?

jostillmanns commented Sep 28, 2019

jostillmanns commented Oct 1, 2019 • edited Loading

manishrjain commented Oct 1, 2019

jostillmanns commented Oct 1, 2019

Kingloko commented Oct 1, 2019

manishrjain commented Oct 9, 2019 • edited Loading

Kingloko commented Oct 10, 2019

erhlee-bird commented Oct 10, 2019 • edited Loading

Kingloko commented Oct 17, 2019

jostillmanns commented Nov 18, 2019

jostillmanns commented Sep 27, 2019 •

edited

Loading

campoy commented Sep 27, 2019 •

edited

Loading

jostillmanns commented Oct 1, 2019 •

edited

Loading

manishrjain commented Oct 9, 2019 •

edited

Loading

erhlee-bird commented Oct 10, 2019 •

edited

Loading