Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store strings unescaped #1508

Closed
niemeyer opened this issue Sep 22, 2017 · 8 comments
Closed

Store strings unescaped #1508

niemeyer opened this issue Sep 22, 2017 · 8 comments
Assignees
Labels
kind/bug Something is broken.
Milestone

Comments

@niemeyer
Copy link

niemeyer commented Sep 22, 2017

On an rdf file produced by the internal export routine itself, bulkloader breaks if there were embedded newlines:

2017/09/21 21:47:14 Invalid end of input. Input: [<_:uid14c43> <svalue> "Foo]
while parsing line "<_:uid14c43> <svalue> \"Foo"
main.(*mapper).parseRDF
        /home/travis/gopath/src/github.com/dgraph-io/dgraph/cmd/bulkloader/mapper.go:149

Exported file content holds:

<_:uid14c43> <svalue> "Foo^M
Bar"^^<xs:string> 
@peterstace
Copy link

Thanks for reporting. This will break dgraphloader as well. It's really a bug in the exporter, new lines within strings should be escaped as "\n".

@peterstace peterstace changed the title bulkloader breaks on embedded new lines Exporter should escape newlines in RDF output Sep 22, 2017
@peterstace peterstace added the kind/bug Something is broken. label Sep 22, 2017
@peterstace peterstace added this to the v0.8.3 milestone Sep 22, 2017
@peterstace
Copy link

peterstace commented Sep 22, 2017

@niemeyer do you have a minimum working example? I'm having trouble reproducing this.

This is what I've tried (on a fresh instance of Dgraph):

First add some data containing a newline:

mutation {
  schema {
    pred: string @index(term) .
  }
  set {
    _:aoeu <pred> "Line One\nLine Two" .
  }
}
{
  q(func: allofterms(pred, "Two")) {
    pred
  }
}

Perform the export:

$ curl localhost:8080/admin/export

RDF file seems to be ok:

$ zcat play/dg/export/dgraph-1-2017-09-22-13-38.rdf.gz 
<_:uid1> <pred> "Line One\nLine Two"^^<xs:string> .

Loads ok with bulkloader as well:

$ bulkloader -r dgraph-1-2017-09-22-13-38.rdf.gz -s dgraph-schema-1-2017-09-22-13-38.rdf
{
        "RDFDir": "dgraph-1-2017-09-22-13-38.rdf.gz",
        "SchemaFile": "dgraph-schema-1-2017-09-22-13-38.rdf",
        "DgraphsDir": "out",
        "LeaseFile": "LEASE",
        "TmpDir": "tmp",
        "NumGoroutines": 4,
        "MapBufSize": 134217728,
        "SkipExpandEdges": false,
        "BlockRate": 0,
        "SkipMapPhase": false,
        "CleanupTmp": true,
        "NumShufflers": 1,
        "Version": false,
        "MapShards": 1,
        "ReduceShards": 1
}
REDUCE 00s [100.00%] edge_count:6.000 edge_speed:6.000/sec plist_count:6.000 plist_speed:6.000/sec
Total: 00s

@peterstace
Copy link

The problem can be reproduced when using the Go client.

@peterstace
Copy link

peterstace commented Sep 22, 2017

So this is the situation:

  • Dgraph currently expects all string inputs (via HTTP, via Go client, other clients etc) to be escaped. E.g. If the value intended is "A\tB" (i.e the ASCII sequence 65, 9, 66), then the client should send "A\\t" (i.e. the ASCII sequence 65, 92, 116, 66).
  • The Go client doesn't do this (but it's trivial to add). So the Go client can be 'fixed' by passing all values through val = strings.Replace(val, "\n", "\\n", -1).
  • Dgraph stores all of its string data internally in escaped form (e.g. as the byte sequence 65, 92, 116, 66).
  • When the export is performed, RDFs are generated by printing a quote, then the stored string, and then another quote (which is why the RDF export file is broken when the RDF originated from the Go client).

The quick and easy way to fix this is to escape newline in the Go client.

Ideally though, I think it would be better to transmit to/from and store the strings inside Dgraph in the regular way (e.g. the byte sequence 65, 9, 66). Under that proposal, the only places where escaping/unescaping would have to occur is when parsing RDFs and formating RDFs for the export. That way, when new clients are written the authors won't have to worry about any custom escaping rules. These authors could be third party community authors, so it would be nice not to surprise them.

@niemeyer
Copy link
Author

This sounds related to #1484 as well. It's very error prone to have a SetStringValue function whose parameter needs to be escaped in various cases and also certain predefined strings avoided to not get data corruption. It should really be up to the driver to make sure whatever was passed in is what's actually stored and that it roundtrips back into the same string when consumed via any of the available means.

@peterstace
Copy link

Summary of discussion with @manishrjain and @janardhan1993 :

  • We're going to modify dgraph to store strings unescaped. E.g. as byte sequence 65, 9, 66 for "A\tB".
  • Need to convert to strings using strconv.Unquote during RDF parsing.
  • Need to convert back to quoted/escaped strings using strconv.Quote during RDF export.

@peterstace
Copy link

PR: #1587

@peterstace peterstace changed the title Exporter should escape newlines in RDF output Store strings unescaped Oct 5, 2017
@peterstace
Copy link

Merged to master in 9772a86

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something is broken.
Development

No branches or pull requests

3 participants