Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add (*Reader).CopyNext(w) (int64, error) #167

Merged
merged 6 commits into from
Dec 13, 2016

Conversation

pwaller
Copy link
Contributor

@pwaller pwaller commented Nov 11, 2016

It is useful to be able to efficiently copy objects without
decoding them.

My use case is filtering when I already know the indices of
the objects I want to keep, and for rewriting a dictionary
of objects as a column of objects.

This commit:

  1. Exports GetNextSize.
  2. Adds a method ReadNext(p) to *Reader.

I wasn't sure about exporting GetNextSize, but I can't see the
harm in it, and it may be useful for finer-grained control in
the copying.

I also experimented with a NextReader() io.Reader method which
returned io.LimitReader(m.R, GetNextSize()), but for my use case
this was twice as slow and also did not handle nested objects.
In principle that might be useful for very large simple objects,
but is not included in this PR.


This change is Reviewable

@philhofer
Copy link
Member

I think the errors returned from ReadNext need to implement msgp.Error to preserve that API contract. Beyond that, 👍

@philhofer
Copy link
Member

Actually, as I squint harder at this, it's not clear that ReadNext does what you want it to do. The returned size value from getNextSize for an object encoded in a fixmap, for instance, is 1 byte; you don't get the size of the actual object.

@pwaller
Copy link
Contributor Author

pwaller commented Nov 11, 2016

Ah. Any ideas on how to fix this or otherwise do it correctly?

@pwaller
Copy link
Contributor Author

pwaller commented Nov 11, 2016

My implementation was inspired by Skip(), so, does skip have that problem or does it avoid that somehow?

@philhofer
Copy link
Member

Skip operates recursively. You'd have to do the same thing.

@pwaller
Copy link
Contributor Author

pwaller commented Nov 11, 2016

Oh, crap, I have not pushed the latest commit. Oops.

@pwaller
Copy link
Contributor Author

pwaller commented Nov 11, 2016

Force pushed.

@pwaller
Copy link
Contributor Author

pwaller commented Nov 11, 2016

Will fix Error return, too.

@philhofer
Copy link
Member

Ok cool; this looks correct.

I might suggest ReadNext(w io.Writer) (int, error), though. That may make the API easier to use (e.g. buf := bytes.NewBuffer(...); r.ReadNext(buf); ...)

@philhofer
Copy link
Member

Also, given the non-obvious nature of GetNextSize, perhaps it shouldn't be exported as-is.

@pwaller
Copy link
Contributor Author

pwaller commented Nov 11, 2016

Nice idea for ReadNext would suit me well. How about CopyNext as a name? ReadNext as a thing taking a writer seems confusing.

For GetNextSize I agree, though I may re-propose that later if I find I do need it.

Still fixing the error contract.

@philhofer
Copy link
Member

CopyNext sounds good to me.

@pwaller
Copy link
Contributor Author

pwaller commented Nov 11, 2016

  1. On second thought, Copy/io.Writer-style rather than Read means that the caller loses control of the buffer which is not good for allocation unless we do something ala io.CopyBuffer. Can we reconsider ReadNext as it is? Thoughts?
  2. I pushed Error interface just now.
  3. I am also unsure about whether the additional size check is necessary, do you have an opinion?
    n, err := m.R.ReadFull(p[:sz])
    if err != nil {
        return 0, err
    }
    if uintptr(n) != sz {
        return 0, fmt.Errorf("wrong # bytes read (%d != %d)", n, int64(sz))
    }

@pwaller
Copy link
Contributor Author

pwaller commented Nov 11, 2016

(If (3) is necessary, does it want to be yet another named error in errors.go, or something else?)

Thanks for the rapid review by the way.

@philhofer
Copy link
Member

The caller doesn't necessarily lose control of the buffer. bytes.NewBuffer(buf) will do the 'right' thing in the case where your buffer is large enough; you just need to call Reset() before copying the next object.

The size check should be unnecessary; it's a bug if ReadFull returns without error and doesn't fill the buffer.

@philhofer
Copy link
Member

(The Copy API also doesn't require that you hold the object in memory, either; it could be used to proxy objects or sub-objects of arbitrary size.)

@pwaller
Copy link
Contributor Author

pwaller commented Nov 11, 2016

The caller doesn't necessarily lose control of the buffer.

But in order to call the w.Write from inside CopyNext(w), I need to call ReadFull with a byte slice. From whence does this byte slice come?

@philhofer
Copy link
Member

Yeah; you'd need to refactor the code a bit. (*Reader).Next() gives you a pointer directly into the reader's buffer.

@philhofer
Copy link
Member

philhofer commented Nov 11, 2016

Also, *fwd.Reader implements WriteTo, so io.CopyN() shouldn't allocate a buffer.
EDIT: Never mind, io.CopyN doesn't use io.WriterTo. But still, if it were a bytes.Buffer, for instance, you'd still get the ReadFrom optimization.

It is useful to be able to efficiently copy objects without
decoding them.

My use case is filtering when I already know the indices of
the objects I want to keep, and for rewriting a dictionary
of objects as a column of objects.
@pwaller
Copy link
Contributor Author

pwaller commented Nov 11, 2016

That was the bit I was missing. That is freaking awesome:

--- a/msgp/read.go
+++ b/msgp/read.go
@@ -180,6 +180,32 @@ func (m *Reader) ReadNext(p []byte) (int, error) {
    return tot, nil
 }

+// CopyNext reads the next object from m without decoding it and writes it to w.
+// It avoids unnecessary copies internally.
+func (m *Reader) CopyNext(w io.Writer) (int64, error) {
+   sz, o, err := getNextSize(m.R)
+   if err != nil {
+       return 0, err
+   }
+
+   // avoids allocating because m.R implements WriteTo.
+   n, err := io.CopyN(w, m.R, int64(sz))
+   if err != nil {
+       return 0, err
+   }
+
+   // for maps and slices, read elements
+   for x := uintptr(0); x < o; x++ {
+       var n2 int64
+       n2, err = m.CopyNext(w)
+       if err != nil {
+           return n, err
+       }
+       n += n2
+   }
+   return n, nil
+}
+
 // ReadFull implements `io.ReadFull`
 func (m *Reader) ReadFull(p []byte) (int, error) {
    return m.R.ReadFull(p)

I have fully squashed my patches and force pushed.

@pwaller
Copy link
Contributor Author

pwaller commented Nov 11, 2016

Sorry, please hold, I forgot to remove ReadNext correctly.

Also, do we need to revisit in light of your strikeout in the above comment?

@philhofer
Copy link
Member

I would use (*fwd.Reader).Next(), and then fall back to CopyN when that fails. The reasoning here is that very frequently, Next() will work, and it will only fail when you ask for more bytes than can fit in the read buffer, in which case the cost of CopyN will amortize well, even if it does have to allocate a buffer.

@pwaller
Copy link
Contributor Author

pwaller commented Nov 11, 2016

Pushed. My current status: happy. Taking a break for a bit.

I learned some neat tricks tonight, thanks for that 😎

@pwaller pwaller changed the title Export GetNextSize and implement ReadNext Add (*Reader).CopyNext(w) (int64, error) Nov 11, 2016
@philhofer
Copy link
Member

@pwaller hey, did you want to finish this off? If not, I'm happy to do so myself.

@pwaller
Copy link
Contributor Author

pwaller commented Dec 13, 2016

AFAIK, I think it's done? I've ticked "Allow edits from maintainers" if you would like to make any additional changes.

- only call (*Reader).Next() when we're sure it won't realloc its buffer
- promote io.ErrUnexpectedEOF to msgp.ErrShortBytes
@philhofer
Copy link
Member

Cool. Pushed a small change in the error handling to make it a little more consistent with the rest of the code.

Thanks again!

@philhofer philhofer merged commit d7c86ea into tinylib:master Dec 13, 2016
@pwaller
Copy link
Contributor Author

pwaller commented Dec 13, 2016

👍

@pwaller pwaller deleted the read-next-object branch December 13, 2016 19:52
gobwas pushed a commit to gobwas/msgp that referenced this pull request Nov 20, 2017
* Add (*Reader).CopyNext(w) (int64, error)

It is useful to be able to efficiently copy objects without
decoding them.

My use case is filtering when I already know the indices of
the objects I want to keep, and for rewriting a dictionary
of objects as a column of objects.

* Remove ReadNext

* Add opportunistic optimization with m.R.Next

* Remove unused ReadNextError

* Remove commented code

* small fixup

- only call (*Reader).Next() when we're sure it won't realloc its buffer
- promote io.ErrUnexpectedEOF to msgp.ErrShortBytes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants