Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not background server in ping-pong test #29

Merged
merged 1 commit into from
Sep 8, 2022

Conversation

armanbilge
Copy link
Owner

Fixes #28.

I wonder if #21 is related, but in that case it seems it was hanging if there was an error in the client 🤔

_ <- IO(assertEquals(res, "pong"))
} yield ()

serverCh.bind(new InetSocketAddress(0)) *> server.both(client).void
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The solution is simple: instead of running the server in the background to the client (where its exceptions are ignored) instead run both concurrently with equal importance.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like I am sitting exams in a concurrency course ;-)
Insert usually humble "I could be wrong. If so sorry"

I think you have the right concern but are introducing a race condition here.
The server needs to have completed initialization and be
known to be in accept before the client starts (well before it issues its connect()
but the client prologue code appears short & fast).

To check my grasshopper understanding of "*>", it says (from the cats-effedt IO API)
*>[B](that: IO[B]): IO[B] Runs the current IO, then runs the parameter, keeping its result.

This means that the serverCh.bind has run to completion and written all the way to memory? Client uses serverAddr <- serverCh.localAddress. Is there an easy way for the serverCh.localAddress`
to be passed to the client rather than globally? This would guarantee sequencing and make it evident.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, #21 is related but different. This Issue reveals the logic error in #21.

When this is fixed, #21 will go underground but still be there.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have the right concern but are introducing a race condition here.

Yes, after writing this I became worried there is a race condition here myself. Let me think through it again 😂

This means that the serverCh.bind has run to completion and written all the way to memory.

Yes, that's right. bind must complete before it continues.

Is there an easy way for the serverCh.localAddress to be passed to the client rather than globally?

Sure, we could def client(serverAddress) and write it like this maybe?

serverCh.bind(new InetSocketAddress(0)) *>
  serverCh.localAddress.flatMap { serverAddress =>
    server.both(client(serverAddress)).void
  }

But I'm not convinced this will make much difference: we are still relying on serverCh.bind to complete, because otherwise won't serverCh.localAddress just return junk to us?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are making progress but there is a mélange of concerns.

  1. You are correct serverCH.bind must complete.

  2. Using the argument, eliminates the requirement that the bind result
    must have reached possibly shared, possibly distribute memory.

  3. Unless there is some semantic that the server part of .both() happens
    before the (client) part, I think there is still a race.

    I believe that there is a complex way to take a lock on a common, shared,
    volatile variable before the .both(). Have the client aquire the lock (meaning
    that it does a synchronous wait until it gets it. The client then does a tiny (2 second?)
    delay, then begins execution. Meanwhile, the server releases the lock immediately
    before its accept. Most example are written for single-stream human speed "start server; start client"
    Here we are dealing with probably parallel machine speed.

    Much as is an eyesore, I think the economic fix is explicit .delay(). start_server.delay(10).start_client_in_parallel
    I think the cats-effect IO API has a .delay() method. Yes I know that we are trying to do async code here.
    of 10 (5?) seconds is enough

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for that detailed explanation. I understand your concerns better now, specifically about shared memory across threads. However, because the Scala Native runtime is single-threaded, this should not be a problem at this time (although in the future we will have to think hard how to make these implementations thread-safe).

3. Unless there is some semantic that the server part of .both() happens
before the (client) part, I think there is still a race.

I'm confused why this needs to be the case. The very first step in server is to accept the client socket. There is no way for this step to proceed without running the client concurrently.

val server = serverCh.accept.use { ch =>

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will study this more in the light. In another discussion you posted
the code for "local and remote addresses".

There, the server socket is definitely "listening". The same is true
here. I think there is a time window, where the client can connect
to the ServerSocket (at the indicated port) and the TCP handshake
is 'held" for a short period of time. I need to check, with a clear
head and clear mind if my understanding is right and how long
"short" is. Memory has it that it is not forever.

I think a well-defined, if ugly, and obvious happens-before leads to
lower support costs.

More later.

Thank you for chasing this.

@armanbilge
Copy link
Owner Author

I am happy to hold on this PR until we can resolve #21 as well. I do not want that problem to go underground, as I understand it to be more serious.

@LeeTibbert
Copy link
Collaborator

#21 is more serious in the testSuite environment. That one means that users in the wild can get
infinite hangs.

re: this one. I went back and studied how a connection is established.
The TL;DR is that .both() is OK here, as long as the server .bind() is guaranteed
to have happened-before.

The longer story is that JVM-ish ServerSocket bind() contains C bind() followed by a listen()`
call. Once that method has completed, incoming TCP connections should complete the
three-way handshake and complete the connection. The connection will be place on the
listen queue, if there is space, else dropped. I could not discover any time limit
a connection can stay in that listen queue.

When accept() is eventually executed, the connection will be taken of the listen queue
and passed on to the caller of accept().

In the absence of any other timeouts, client writes before accept will continue to
happen until a TCP receive buffer fills on the Cerver; then ???

In the current case, there is only one possible client, writing a small amount
of data before attempting to read. So if the client is running before the server
executes accept, and the client issues a connect(), the connection should
happen and go into the server listen queue. Since there is only one connection,
that queue should be empty and never fill.

All of this depends upon the assured 'happens-before' of bind-with-listen.

I believe that "this is (now) happening in ping-pong`"

I usually write server code which is flooded with requests when the gates open,
so I ensure that I have gotten all the way to accept() before starting possibly
many clients. Those concerns are not relevant here.

@LeeTibbert
Copy link
Collaborator

serverCh.bind(new InetSocketAddress(0)) *>
  serverCh.localAddress.flatMap { serverAddress =>
    server.both(client(serverAddress)).void
  }

But I'm not convinced this will make much difference: we are still relying on serverCh.bind to complete, because otherwise won't
serverCh.localAddress just return junk to us

I think this hinges upon "what does it mean to complete", with the argument, it is pretty clear, at least to me, that the
serverAddress is concrete, available, and knowable. The previous code had a shared global (non-volatile) variable. I am always
leary of parallel un-synchronoized access to non-atomic variables. Did the change make it all the way to memory and
were all the caches updated?

I may be excessive here and do not mean to make your life difficult. I am trying to reduce your support costs,
especially when people take your test code for an example and, unintentionally, violate its un-documented
synchronization assumptions.

@armanbilge
Copy link
Owner Author

I am always leary of parallel un-synchronoized access to non-atomic variables. Did the change make it all the way to memory and were all the caches updated?

I may be excessive here and do not mean to make your life difficult. I am trying to reduce your support costs, especially when people take your test code for an example and, unintentionally, violate its un-documented synchronization assumptions.

These are great points, thank you for the detailed explanations.

I completely understand your concerns about memory publication: we agonize over details like this in Cats Effect JVM (a tangent but there is a fantastic story/talk about one such investigation :)

The part I still can't wrap my head around is how these issues can come into play in a single-threaded runtime, which is the only option in Scala Native today. Furthermore, if these issues do come into play, then I have no idea how we can solve it, as Scala Native currently offers no way of defining volatile or atomic variables.

@LeeTibbert
Copy link
Collaborator

Sure as dancing leads to sin; concurrency leads to parallelism.

A lot depends upon how much you believe your abstraction.
I lost my window with link to the doc which describes '.both'.
If you believe the abstraction and know nothing about
the underlying implementation, I believe both could
be happening at once.

The part I still can't wrap my head around is how these issues can come into play in a single-threaded runtime, which is
the only option in Scala Native today.

At the level of the .both() abstraction, you/we do not know that, we know the abstraction.

Thing change and probably cost you a long week of panic coming at the worst of times.

Furthermore, if these issues do come into play, then I have no idea how we can solve it, as Scala Native
currently offers no way of defining volatile or atomic variables.

You are right, in the general case, we need volatile support in the SN compiler. There are a bunch of foo_atomic
methods which I stubbed because somebody needed to link. I cringe whenever I think of
how long they have lived and how badly they will break if/when SN becomes multithreaded.

In this specific case, and its relatives in TcpSuite.scala, and in many/most other places,
we can avoid triggering the problem by not using global variables from potentially two
(green-ish) threads. Passing the argument avoid an entire world of concern.

I set out this morning with my hot coffee to test this PR in my "known broken by me"
environment. I hope to be able to do that today.

@armanbilge
Copy link
Owner Author

and know nothing about
the underlying implementation

That's the thing: we do know the underlying implementation is single-threaded:

  1. The Scala Native runtime is fundamentally single-threaded and offers no synchronization primitives.
  2. The ExecutionContexts provided by this library are serially-executing event loops.

I definitely agree this will be a problem in the future when multithreading is at play. At that time, Scala Native will also give us the necessary tools to solve this problem, such as volatiles and atomics.

@LeeTibbert
Copy link
Collaborator

LeeTibbert commented Sep 6, 2022

OK, I got some time-on-task and exercised this PR

TL;DR Sorry, it does not solve the problem of the test failing when
the client connection fails. Progress, in that there
is now a stack trace, but test does not halt as
a result.

  • All this is standard IPv4.

  • I started with the epoll repository dated today 10 AMish EDT.

  • I established that both testsJVM/test and testsNative/test worked as expected.

  • I pulled down the file from this PR and re-ran the tests using it. All checked OK.

  • I injected a fault into connect in EpollAsyncSocketChannel.scala (used
    hard coded known bad port number. Well the port number is not bad, just
    misunderstood)

  • After about 2 minutes elapsed time, there is a stack traceback in the ping-pong
    test but it continues to hang (probably the lack of timeout in accept() that
    we are chasing in another issue.)

    I am not sure what is giving the 2 minute timeout. I would have to
    write a matching set of C programs to see if it is the TCP protocol.
    Could be.

    The callback is getting called. Issue connect() & timeout #20 recommends an explicit
    and shorter timeout. That is a good placeholder for this thought.

  • for grins & giggles, I implemented the code snippet in a prior reply
    which passes the serverAddress as a parameter. I had to define
    client as a def. As somewhat expected, same results
    as the stock file in this PR.

@armanbilge
Copy link
Owner Author

I injected a fault into connect in EpollAsyncSocketChannel.scala (used
hard coded known bad port number. Well the port number is not bad, just
misunderstood)

Would you mind pushing up a branch with this? So I can take a look.

@LeeTibbert
Copy link
Collaborator

Are you running an HTTPs server on your machine?
If not, the effective change is in EpollAsyncSocketChannel.scala
Most of this is context, the change is two lines for the port number.

If you are using an HTTPs server, or do not know, you should
change to 8081 (or some high number < 65K).

def connect[A](
      remote: SocketAddress,
      attachment: A,
      handler: CompletionHandler[Void, _ >: A]
  ): Unit = {
    val addrinfo = stackalloc[Ptr[posix.netdb.addrinfo]]()

    val continue = Zone { implicit z =>
      val addr = remote.asInstanceOf[InetSocketAddress]
      val hints = stackalloc[posix.netdb.addrinfo]()
      hints.ai_family = posix.sys.socket.AF_INET
      hints.ai_flags = posix.netdb.AI_NUMERICHOST | posix.netdb.AI_NUMERICSERV
      hints.ai_socktype = posix.sys.socket.SOCK_STREAM
      val rtn = posix
        .netdb
        .getaddrinfo(
          toCString(addr.getAddress().getHostAddress()),
// Original                                                                     
//          toCString(addr.getPort.toString),                                   
// Force an error, see if TcpSuite reports it                                   
          c"8080",
          hints,
          addrinfo
        )

I can do full branch if you like. Mine has printf's in it so that I can
determine that the callback is being called.

Copy link
Collaborator

@LeeTibbert LeeTibbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Our discussions convinced me that the serverCh.localAddress will have a proper value
by the time it is used.

@armanbilge armanbilge merged commit 9fb9929 into main Sep 8, 2022
@armanbilge armanbilge deleted the pr/fix-hanging-ping-pong branch September 8, 2022 04:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TcpSuite hangs if there is an error in accept
2 participants