Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: fix flaky test-http2-client-upload #29889

Closed

Conversation

Flarna
Copy link
Member

@Flarna Flarna commented Oct 8, 2019

Wait for close event on server stream before shuting down server and
client to avoid races seen on windows CI.

Actually I'm not sure if this just hides a bug, see #20750 (comment) for more details.
I still haven't transformed into a http2/nghttp/net expert and noone else took this since I last commented so I decided to start a PR to at least help CI.

Refs: #20750 (comment)
Refs: #29852

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • tests and/or benchmarks are included
  • commit message follows commit guidelines

@nodejs-github-bot nodejs-github-bot added the test Issues and PRs related to the tests. label Oct 8, 2019
@addaleax addaleax added the http2 Issues or PRs related to the http2 subsystem. label Oct 8, 2019
Copy link
Member

@addaleax addaleax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not sure if this is helpful feedback, but I agree that this looks more like a real bug in the HTTP/2 code.

test/parallel/test-http2-client-upload.js Outdated Show resolved Hide resolved
@bcoe
Copy link
Contributor

bcoe commented Oct 8, 2019

@Flarna thank you for taking on this work.

@Trott, @jasnell, what do you think? we could perhaps add a TODO comment, referencing someone familiar with the http2 library, to dig further into this issue. But land this work to get Windows tests more stable?

@Trott
Copy link
Member

Trott commented Oct 9, 2019

@Flarna thank you for taking on this work.

@Trott, @jasnell, what do you think? we could perhaps add a TODO comment, referencing someone familiar with the http2 library, to dig further into this issue. But land this work to get Windows tests more stable?

If anyone on @nodejs/http2 chimes in, I'll be likely to defer to them, but if not, my opinion would be: If we have code that causes the bug to happen reliably, we could add that code as a test to the known_issues directory and have this version in parallel.

@Flarna
Copy link
Member Author

Flarna commented Oct 9, 2019

Back in July when I looked into this test I failed to created a 100% reproducer. Most likely because I haven't fully understood the actual root cause.
I thought I'm on a good track once I found that there are two goaway frames sent from client. Sometimes the second one is enqueued before the first one is actually passed down to net but some stream/connection state regarding socket disconnect is changed by the second having side effects to the first. But it turned out that the same happens on linux also where test is not failing...

@ronag
Copy link
Member

ronag commented Oct 9, 2019

Haven't looked much into http2 apart from the http2 compat parts but here is my take on this:

Following stream semantics the existing test is not entirely correct.

We cannot destroy the client before 'close' and expect that no further errors to be sent. 'finish'/'end' does not mean that the client is done, just that it is done with sending/receiving data from/to the user. There might still be relevant teardown logic that needs to run for a graceful completion. Therefore, we should not kill the client/server before the client has emitted 'close'.

I think the test should be updated with:

@@ -35,7 +35,7 @@ fs.readFile(loc, common.mustCall((err, data) => {
   server.listen(0, common.mustCall(() => {
     const client = http2.connect(`http://localhost:${server.address().port}`);
 
-    const countdown = new Countdown(2, () => {
+    const countdown = new Countdown(3, () => {
       server.close();
       client.close();
     });
@@ -47,6 +47,7 @@ fs.readFile(loc, common.mustCall((err, data) => {
     req.on('end', common.mustCall());
 
     req.on('finish', () => countdown.dec());
+    req.on('close', () => countdown.dec());
     const str = fs.createReadStream(loc);
     str.on('end', common.mustCall());

Would be interesting if it is still flaky after that.

i.e. this PR is on the right track but on the wrong side.

@Flarna
Copy link
Member Author

Flarna commented Oct 10, 2019

Tried the change proposed by @ronag but it doesn't help. The close event on req is always emitted before close on file stream therefore no change regarding the shutdown time of client/server.

@ronag
Copy link
Member

ronag commented Oct 10, 2019

@Flarna: What about something like:

const countdown = new Countdown(3, () => {
   client.on('close', () => server.close()); // I'm not sure if this event exists :/
   client.close();
});

The client maybe shuts down before the server causing a ECONNRESET.

@Flarna
Copy link
Member Author

Flarna commented Oct 10, 2019

@ronag tried it but it doesn't help.
The problem is not triggered by a server shutdown it's caused by races in client (see #20750 (comment))

@ronag
Copy link
Member

ronag commented Oct 10, 2019

Then maybe the problem lies in close and kMaybeDestroy:

  close(callback) {
    if (this.closed || this.destroyed)
      return;
    debugSessionObj(this, 'marking session closed');
    this[kState].flags |= SESSION_FLAGS_CLOSED;
    if (typeof callback === 'function')
      this.once('close', callback);
    this.goaway();
    this[kMaybeDestroy]();
  }

  // Destroy the session if:
  // * error is not undefined/null
  // * session is closed and there are no more pending or open streams
  [kMaybeDestroy](error) {
    if (error == null) {
      const state = this[kState];
      // Do not destroy if we're not closed and there are pending/open streams
      if (!this.closed ||
          state.streams.size > 0 ||
          state.pendingStreams.size > 0) {
        return;
      }
    }
    this.destroy(error);
  }

Based on my interpretation of @Flarna's debugging it seems that the return condition in kMaybeDestroy does not take into account whether the go away frame has been sent/flushed or not.

Also, (possibly unrelated) [kMaybeDestroy] does not seem to take into account whether the session has connected or not. Not quite sure of the implications of that.

@Trott
Copy link
Member

Trott commented Oct 10, 2019

@nodejs/http2 The code in #29889 (comment) is from lib/http/core.js. Thoughts on @ronag's comment there?

@ronag
Copy link
Member

ronag commented Oct 10, 2019

@Flarna My last guess. But maybe? Will try and see if I can sort out a Windows workstation to reproduce.

--- a/src/node_http2.cc
+++ b/src/node_http2.cc
@@ -2576,6 +2576,7 @@ void Http2Session::Goaway(uint32_t code,
   Debug(this, "submitting goaway");
   nghttp2_submit_goaway(session_, NGHTTP2_FLAG_NONE,
                         lastStreamID, code, data, len);
+  SendPendingData();
 }

@addaleax
Copy link
Member

addaleax commented Oct 10, 2019

Based on my interpretation of @Flarna's debugging it seems that the return condition in kMaybeDestroy does not take into account whether the go away frame has been sent/flushed or not.

If that is the case, we should probably wait for all outgoing writes to finish.

There’s also this bit:

// If the socket is alive, use setImmediate to destroy the session on the
// next iteration of the event loop in order to give data time to transmit.
// Otherwise, destroy immediately.
if (!socket.destroyed)
setImmediate(finishSessionDestroy, this, error);
else
finishSessionDestroy(this, error);

Which seems invalid – there is no reason to believe that setImmediate() is giving us the right timing in terms of waiting for output data to be sent. So, yes, this looks like it’s definitely a bug, be it the one making the test flaky or not.

@Flarna
Copy link
Member Author

Flarna commented Oct 10, 2019

I will try that in a free slot during the next days.
I can remember that I played around already in that area without success but it's too long ago to remember in detail. I was also confused that actuall two goaway frames are sent by client in this use case and sometimes enqueuing the second one changes some setting from the first one.
But it turned out that the same happens on linux and the test is stable there...

@ronag
Copy link
Member

ronag commented Oct 10, 2019

I'm a little confused as to why that goaway call is needed there since HttpSession::Close does nghttp2_session_terminate_session.

@sogaani
Copy link

sogaani commented Oct 11, 2019

I could 100% reproduce the same error by making session.destroy method just call socket.destroy and read data slowly.
Major difference between original session.destroy and just call socket.destroy is that the original try to send pending data.
The following code eventually calls Http2Session::SendPendingData.

// Destroy the handle if it exists at this point
if (handle !== undefined)
handle.destroy(code, socket.destroyed);

I guess that Http2Session::SendPendingData can fail and it makes tests flaky.

The codes to reproduce.

'use strict';

const common = require('../common');
const fixtures = require('../common/fixtures');
const http2 = require('http2');
const fs = require('fs');
const net = require('net');

const tmpdir = require('../common/tmpdir');
tmpdir.refresh();
const loc = fixtures.path('person-large.jpg');

const server = http2.createServer();

server.on('stream', common.mustCall((stream) => {
  let sum = 0;
  stream.pause();
  const slowRead = ()=>{
    setTimeout(()=>{
      const data = stream.read(stream._readableState.highWaterMark/10);
      sum += data ? data.length: 0;
      console.log('read:' + sum + ' soc:' + socket.bytesWritten);
      slowRead();
    }, 10)
  };
  slowRead();
  stream.respond();
  stream.end();
}));

let socket;
server.listen(0, common.mustCall(() => {
  const options = {
    createConnection: (authority, options) => {
      socket = net.connect(server.address().port, 'localhost');
      return socket;
    }
  }
  const client = http2.connect(`http://localhost:${server.address().port}`, options);
  const req = client.request({ ':method': 'POST' });

  req.on('response', common.mustCall());
  req.resume();

  req.on('close', common.mustCall(() => {
    server.close();
    client.close();
  }));
  const origin_destroy = client.destroy;
  client.destroy = (err)=>{
    console.log('session destroy called');
    setImmediate(()=>{
      socket.destroy(); // this cause a serverside error because this destroy a socket before goaway frame send.
      //origin_destroy.call(client, err);
    });
  }
  const str = fs.createReadStream(loc);
  str.on('end', common.mustCall());
  str.pipe(req);
}));

Edit: Fix fixtures path as #29889 (comment) Thanks @Trott

@Trott
Copy link
Member

Trott commented Oct 11, 2019

const loc = fixtures.path('large.pdf');

If anyone else wants to copy/paste the replication above, change this line to use 'person-large.jpg' rather than 'large.pdf' if you don't have a large pdf file lying around to put in the fixtures directory.

The replication code works (as in it shows the problem) on macOS. So if it's a proper replication of the issue, it demonstrates that the problem is cross-platform.

@ronag
Copy link
Member

ronag commented Oct 11, 2019

@Trott That test no longer fails for me after #29889 (comment)

@ronag
Copy link
Member

ronag commented Oct 11, 2019

If that is the case, we should probably wait for all outgoing writes to finish.

Since "goaway" is kind of a meta message I don't think it's considered as a "pending" write in the stream. We've got SendPendingData in the cpp side of things but I don't see any way to get a callback that ensures that the "goaway" frame has been sent from the js side.

Side note. I know very little about http2 core so please take everything I write with a grain of salt.

@Flarna
Copy link
Member Author

Flarna commented Oct 11, 2019

I can reproduce it even if I add SendPendingData() in Http2Session::Goaway().

@Flarna
Copy link
Member Author

Flarna commented Oct 11, 2019

@addaleax Replacing the setImmediate above with a setTimeout(2000) avoids the problem; most likely not a fix for production 😁

@sogaani
Copy link

sogaani commented Oct 15, 2019

Maybe, I found the difference between TCP with windows and TCP with Linux that causes the error on windows, when calls the sendPendingData().
I just walked through source code and there could be some misunderstands.

  • Difference
    With Linux, if fails to write data to a socket, libuv retries it
    On windows doesn't retry.

  • Linux
    SendPendingData eventually calls uv__write(). If uv__write() fails to write data with temporal error(ENOBUF, EAGAIN, EINTR...), it queues the write request and it will call uv__write() again.

    static void uv__write(uv_stream_t* stream) {
    struct iovec* iov;
    QUEUE* q;
    uv_write_t* req;
    int iovmax;
    int iovcnt;
    ssize_t n;
    int err;
    start:
    assert(uv__stream_fd(stream) >= 0);
    if (QUEUE_EMPTY(&stream->write_queue))
    return;
    q = QUEUE_HEAD(&stream->write_queue);
    req = QUEUE_DATA(q, uv_write_t, queue);
    assert(req->handle == stream);
    /*
    * Cast to iovec. We had to have our own uv_buf_t instead of iovec
    * because Windows's WSABUF is not an iovec.
    */
    assert(sizeof(uv_buf_t) == sizeof(struct iovec));
    iov = (struct iovec*) &(req->bufs[req->write_index]);
    iovcnt = req->nbufs - req->write_index;
    iovmax = uv__getiovmax();
    /* Limit iov count to avoid EINVALs from writev() */
    if (iovcnt > iovmax)
    iovcnt = iovmax;
    /*
    * Now do the actual writev. Note that we've been updating the pointers
    * inside the iov each time we write. So there is no need to offset it.
    */
    if (req->send_handle) {
    int fd_to_send;
    struct msghdr msg;
    struct cmsghdr *cmsg;
    union {
    char data[64];
    struct cmsghdr alias;
    } scratch;
    if (uv__is_closing(req->send_handle)) {
    err = UV_EBADF;
    goto error;
    }
    fd_to_send = uv__handle_fd((uv_handle_t*) req->send_handle);
    memset(&scratch, 0, sizeof(scratch));
    assert(fd_to_send >= 0);
    msg.msg_name = NULL;
    msg.msg_namelen = 0;
    msg.msg_iov = iov;
    msg.msg_iovlen = iovcnt;
    msg.msg_flags = 0;
    msg.msg_control = &scratch.alias;
    msg.msg_controllen = CMSG_SPACE(sizeof(fd_to_send));
    cmsg = CMSG_FIRSTHDR(&msg);
    cmsg->cmsg_level = SOL_SOCKET;
    cmsg->cmsg_type = SCM_RIGHTS;
    cmsg->cmsg_len = CMSG_LEN(sizeof(fd_to_send));
    /* silence aliasing warning */
    {
    void* pv = CMSG_DATA(cmsg);
    int* pi = pv;
    *pi = fd_to_send;
    }
    do
    n = sendmsg(uv__stream_fd(stream), &msg, 0);
    while (n == -1 && RETRY_ON_WRITE_ERROR(errno));
    /* Ensure the handle isn't sent again in case this is a partial write. */
    if (n >= 0)
    req->send_handle = NULL;
    } else {
    do
    n = uv__writev(uv__stream_fd(stream), iov, iovcnt);
    while (n == -1 && RETRY_ON_WRITE_ERROR(errno));
    }
    if (n == -1 && !IS_TRANSIENT_WRITE_ERROR(errno, req->send_handle)) {
    err = UV__ERR(errno);
    goto error;
    }
    if (n >= 0 && uv__write_req_update(stream, req, n)) {
    uv__write_req_finish(req);
    return; /* TODO(bnoordhuis) Start trying to write the next request. */
    }
    /* If this is a blocking stream, try again. */
    if (stream->flags & UV_HANDLE_BLOCKING_WRITES)
    goto start;
    /* We're not done. */
    uv__io_start(stream->loop, &stream->io_watcher, POLLOUT);
    /* Notify select() thread about state change */
    uv__stream_osx_interrupt_select(stream);
    return;
    error:
    req->error = err;
    uv__write_req_finish(req);
    uv__io_stop(stream->loop, &stream->io_watcher, POLLOUT);
    if (!uv__io_active(&stream->io_watcher, POLLIN))
    uv__handle_stop(stream);
    uv__stream_osx_interrupt_select(stream);
    }

  • Windows
    SendPendingData() seems to expect that uv__tcp_try_write() returns UV_EAGAIN or UV_ENOSYS here, when it will be able to retry writing data. But uv__tcp_try_write() seemingly could return other retryable errors(WSAEINTR map to UV_ECANCELED, WSAENOBUFS map to UV_ENOBUFS).

    node/deps/uv/src/win/tcp.c

    Lines 915 to 936 in df0e2e3

    int uv__tcp_try_write(uv_tcp_t* handle,
    const uv_buf_t bufs[],
    unsigned int nbufs) {
    int result;
    DWORD bytes;
    if (handle->stream.conn.write_reqs_pending > 0)
    return UV_EAGAIN;
    result = WSASend(handle->socket,
    (WSABUF*) bufs,
    nbufs,
    &bytes,
    0,
    NULL,
    NULL);
    if (result == SOCKET_ERROR)
    return uv_translate_sys_error(WSAGetLastError());
    else
    return bytes;
    }

    If uv__tcp_try_write() failed and returns UV_EAGAIN or UV_ENOSYS, SendPendingData() calls uv_tcp_write() through uv_write2. uv_tcp_write() does not have retry argorizme, althought libuv has it on Linux.

    node/deps/uv/src/win/tcp.c

    Lines 848 to 912 in df0e2e3

    int uv_tcp_write(uv_loop_t* loop,
    uv_write_t* req,
    uv_tcp_t* handle,
    const uv_buf_t bufs[],
    unsigned int nbufs,
    uv_write_cb cb) {
    int result;
    DWORD bytes;
    UV_REQ_INIT(req, UV_WRITE);
    req->handle = (uv_stream_t*) handle;
    req->cb = cb;
    /* Prepare the overlapped structure. */
    memset(&(req->u.io.overlapped), 0, sizeof(req->u.io.overlapped));
    if (handle->flags & UV_HANDLE_EMULATE_IOCP) {
    req->event_handle = CreateEvent(NULL, 0, 0, NULL);
    if (!req->event_handle) {
    uv_fatal_error(GetLastError(), "CreateEvent");
    }
    req->u.io.overlapped.hEvent = (HANDLE) ((ULONG_PTR) req->event_handle | 1);
    req->wait_handle = INVALID_HANDLE_VALUE;
    }
    result = WSASend(handle->socket,
    (WSABUF*) bufs,
    nbufs,
    &bytes,
    0,
    &req->u.io.overlapped,
    NULL);
    if (UV_SUCCEEDED_WITHOUT_IOCP(result == 0)) {
    /* Request completed immediately. */
    req->u.io.queued_bytes = 0;
    handle->reqs_pending++;
    handle->stream.conn.write_reqs_pending++;
    REGISTER_HANDLE_REQ(loop, handle, req);
    uv_insert_pending_req(loop, (uv_req_t*) req);
    } else if (UV_SUCCEEDED_WITH_IOCP(result == 0)) {
    /* Request queued by the kernel. */
    req->u.io.queued_bytes = uv__count_bufs(bufs, nbufs);
    handle->reqs_pending++;
    handle->stream.conn.write_reqs_pending++;
    REGISTER_HANDLE_REQ(loop, handle, req);
    handle->write_queue_size += req->u.io.queued_bytes;
    if (handle->flags & UV_HANDLE_EMULATE_IOCP &&
    !RegisterWaitForSingleObject(&req->wait_handle,
    req->event_handle, post_write_completion, (void*) req,
    INFINITE, WT_EXECUTEINWAITTHREAD | WT_EXECUTEONLYONCE)) {
    SET_REQ_ERROR(req, GetLastError());
    uv_insert_pending_req(loop, (uv_req_t*)req);
    }
    } else {
    /* Send failed due to an error, report it later */
    req->u.io.queued_bytes = 0;
    handle->reqs_pending++;
    handle->stream.conn.write_reqs_pending++;
    REGISTER_HANDLE_REQ(loop, handle, req);
    SET_REQ_ERROR(req, WSAGetLastError());
    uv_insert_pending_req(loop, (uv_req_t*) req);
    }
    return 0;
    }

So, on windows, SendPendingData() could fail when 1 or 2 error happens. however on Linux, SendPendingData() retries writing data till success if there is no critical error.

@Trott
Copy link
Member

Trott commented Oct 15, 2019

@nodejs/libuv

@sogaani

This comment has been minimized.

@addaleax
Copy link
Member

I wondered why the error happens with http2 but http, if the error depends on writing data on TCP.

Both protocols use TCP and uv_try_write() + uv_write().

I found that Net.socket.write() calls uv_fs_write() which works as blokingIO with worker thread rather than uv_write**() which works as nonblokingIO.

net.Socket.write() does not call uv_fs_write(), at least not on Node’s side.

@sogaani
Copy link

sogaani commented Oct 16, 2019

Both protocols use TCP and uv_try_write() + uv_write().

Ah, I missed the following condition.

node/lib/net.js

Lines 326 to 328 in 1447a79

if ((fd === 1 || fd === 2) &&
(this._handle instanceof Pipe) &&
process.platform === 'win32') {

@addaleax
Copy link
Member

I tried the approach laid out above, waiting for writes to finish, but it actually makes the test much more flaky:

Diff in the fold
diff --git a/lib/internal/http2/core.js b/lib/internal/http2/core.js
index 95ce8bcdb4ed..23b40565b665 100644
--- a/lib/internal/http2/core.js
+++ b/lib/internal/http2/core.js
@@ -1327,16 +1327,12 @@ class Http2Session extends EventEmitter {
     const handle = this[kHandle];
 
     // Destroy the handle if it exists at this point
-    if (handle !== undefined)
+    if (handle !== undefined) {
+      handle.ondone = finishSessionDestroy.bind(null, this, error);
       handle.destroy(code, socket.destroyed);
-
-    // If the socket is alive, use setImmediate to destroy the session on the
-    // next iteration of the event loop in order to give data time to transmit.
-    // Otherwise, destroy immediately.
-    if (!socket.destroyed)
-      setImmediate(finishSessionDestroy, this, error);
-    else
+    } else {
       finishSessionDestroy(this, error);
+    }
   }
 
   // Closing the session will:
diff --git a/src/node_http2.cc b/src/node_http2.cc
index f3ef8363e4eb..1b347478cf47 100644
--- a/src/node_http2.cc
+++ b/src/node_http2.cc
@@ -764,6 +764,10 @@ void Http2Session::Close(uint32_t code, bool socket_closed) {
 
   flags_ |= SESSION_STATE_CLOSED;
 
+  if (!(flags_ & SESSION_STATE_WRITE_IN_PROGRESS)) {
+    MakeCallback(env()->ondone_string(), 0, nullptr);
+  }
+
   // If there are outstanding pings, those will need to be canceled, do
   // so on the next iteration of the event loop to avoid calling out into
   // javascript since this may be called during garbage collection.
@@ -1554,6 +1558,11 @@ void Http2Session::OnStreamAfterWrite(WriteWrap* w, int status) {
   // Inform all pending writes about their completion.
   ClearOutgoing(status);
 
+  if (flags_ & SESSION_STATE_CLOSED) {
+    MakeCallback(env()->ondone_string(), 0, nullptr);
+    return;
+  }
+
   if ((flags_ & SESSION_STATE_READING_STOPPED) &&
       !(flags_ & SESSION_STATE_WRITE_IN_PROGRESS) &&
       nghttp2_session_want_read(session_)) {

While I think that is the right thing to do, it makes me wonder…

  1. Why do we destroy the socket in the first place, instead of ending it cleanly? Using .end() instead of .destroy() removes all of the introduced flakiness for me. (It makes some other tests fail but I feel like we really should not just RST the connection when we’re done.)
  2. Why do we not handle ECONNRESET errors? We should expect those and treat them like a regular EOF marker, imo.

@sogaani
Copy link

sogaani commented Oct 16, 2019

I tried the approach laid out above, waiting for writes to finish, but it actually makes the test much more flaky:

I guess this is because of OnStreamAfterWrite() is called by asynchronous write which is queued before sending goaway frame. Are we supposed to check there is no pending data before callback?

@@ -1554,6 +1558,11 @@ void Http2Session::OnStreamAfterWrite(WriteWrap* w, int status) {
   // Inform all pending writes about their completion.
   ClearOutgoing(status);
 
+  if (flags_ & SESSION_STATE_CLOSED) {
+    MakeCallback(env()->ondone_string(), 0, nullptr);
+    return;
+  }
+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
author ready PRs that have at least one approval, no pending requests for changes, and a CI started. http2 Issues or PRs related to the http2 subsystem. test Issues and PRs related to the tests.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants