Skip to content
This repository has been archived by the owner on Feb 12, 2022. It is now read-only.

Added the ability to "prod" the WebSocketConnection #45

Closed
wants to merge 1 commit into from

Conversation

dpapavas
Copy link

Adds a method to the SignalServiceMessagePipe, that can be called
periodically, in order to avoid issues when the device enters sleep,
disrupting timing functions like Thread.sleep() and Object.wait().

// FREEBIE

The Problem

This pull request is part of a solution to issue signalapp/Signal-Android#6644. The problem is that when the device enters sleep mode (basically as soon as the power button is pressed, provided no wake locks are held), the uptimeMillis() clock on which most interval timing functions such as Thread.sleep(millls) and Object.wait(millis) depend stops ticking.

The result is that WebSocket-handling code, that performs blocking wait, never awakes as the timeout never expires. This seems to be of no consequence on GCM-enabled devices, as the GCM notification wakes the device, which afterwards acquires a wake lock, until the message is received, but can make the application unusable when GCM is not available.

Note though, that it does happen in the presence of GCM as well: When the wake-lock is released and the device falls asleep (after any messages have been processed), return from readRequest() can be delayed indefinitely and hold the socket open. The keep alive is also missed and the socket is closed with an EOFException. I'm not sure whether this can have any practical effect on the overall operation of the application though. In any case, the solution proposed below can work with GCM-enabled devices as well, if deemed necessary.

This problem arises in three cases:

  1. When waiting to issue the next keep alive inside KeepAliveSender's thread: Here the typical result is that keep-alives are missed, as soon as the device falls asleep and the socket closes soon afterwards. Reception of any messages arriving after that, is delayed for an indefinite amount of time, typically until the user wakes the device (or until some other process wakes the device, long enough for the timeout to expire).

  2. When blocking while waiting for the data to become available on the socket: As far as I can see, the only result of any consequence here, is that the shutdown of the pipe corresponding to the socket can be delayed indefinitely. I'm not sure whether this can have any adverse effects on operation.

  3. When blocking after a disconnection and before attempting to reconnect: This has the obvious consequence that reconnection, say after a socket error, can be delayed indefinitely, possibly leading to connectivity issues.

The attempted solution

The solution, or at least the half that concerns the socket, is to provide a method to jolt the socket-related threads out of sleep periodically (called "proding" the pipe/socket). Note that this is not the same as polling: The prod only substitutes for the expiration of Thread.sleep() and Object.wait(), which will never come. The basic nature of the operation of the socket, remains unchanged.

The only essential purpose of the prod, is to wake the KeepAliveSender's thread and ensure delivery of the keep-alive. The approach is to substitute the call to Thread.sleep(), with a wait() on a lock, that is global to all threads. Each time the socket is prodded, notifyAll() is called on the lock and the threads awake. This approach nicely accommodates the possibility of more than one sockets co-existing at any one time.

When the KeepAliveSender's thread awakens from a prod, it also notifyAll()s the WebSocketConnection, in order to make sure it's not stuck blocking in readRequest() or onClosed(). Since the prods will practically arrive 1 minute apart, we cannot depend on them to wake us when blocking before a reconnection, so this wait, has been converted to a hybrid approach, where the first few attempts are busy-waits, to ensure a fast reconnection. The wait has also been moved, so as to only occur when a reconnection will follow. It therefore has practically no effect on GCM-enabled devices, which generally won't need to reconnect very often, as they close the pipe after each operation is concluded.

Adds a method to the SignalServiceMessagePipe, that can be called
periodically, in order to avoid issues when the device enters sleep,
disrupting timing functions like Thread.sleep() and Object.wait().

// FREEBIE
@moxie0
Copy link
Contributor

moxie0 commented Oct 15, 2017

Nice investigation, but this PR breaks the abstraction layer. The websocket is responsible for keeping itself alive, not the caller. If sleep or wait aren't doing the job, then they need to be replaced with something else which works but maintains the abstraction layer.

@dpapavas
Copy link
Author

Thank you for your prompt review. I'll comment here for the sake of completeness, for anyone else who might stumble upon this (as, from my research, it seems that many projects face similar problems).

Let me start by saying that I don't have much more experience with Android, or Java, than what I've gained in researching this issue, so feel free to take what follows with a grain of salt. Hopefully someone who knows more, will be able to fill in the blanks, or correct any mistakes on my part.

As I've mentioned in signalapp/Signal-Android#6644, but failed to mention here, there seems to be no other call that does the job. All timing-related constructs on the Java level, seem to use the same clock. I assume that this is not coincidental and that it is meant to force applications to use higher-level mechanisms, such as the AlarmManager, which can be better controlled for the purpose of charge conservation.

Although I understand your concerns about braking the abstraction layer of the overall design, in a certain sense it is already broken. Conceptually the socket should take care of itself, but that is not the case. On GCM-enabled devices, it manages to do so, solely because the GCM notification receiver acquires a wake lock, forcing the device to stay awake while a message is received. As far as I can see this is otherwise not really necessary, since my patched GCM-free version works just fine without it. I believe its basic effect is to keep the socket-related threads from falling into the kind of narcoleptic state, they fall into otherwise, but this should only be necessary in order to keep the KeepAliveSender going. If it was necessary in other respects, my version should not function properly. In a sense therefore, the wake lock that is held at the application level, performs the same function as my prodding alarm and presents the same sort of abstraction layer violation, albeit without being explicit about it.

The only alternatives I can see are the following:

  1. Abstract the prodding mechanism, and provide a way to prod the whole Java backend, as a way of enabling its function in the absence of GCM-initiated wake locks in general. The socket can then register to receive such prods. I understand that this is abstraction for the sake of abstraction, but this seems to be the issue and besides, it might prove useful in the future, if supporting GCM-free devices is desirable.

  2. Perhaps it would be possible to establish a separate service, as part of signal-service-android, with the sole purpose of registering an alarm to keep the socket connection awake. I'm not sure that even makes sense, as I'm not very familiar with the structure of Android applications in general and Signal in particular, but you'll be in a better position to judge this.

  3. Perhaps JNI can be used to expose a C-level function such as clock_nanosleep, which allows selection of the clock used to time the block. This is probably a horrible over-complication for the intended purpose, but it would allow the socket to take care of itself. The same caveat about this approach being practically possible, or making any sense at all holds as above.

In the absence of any such mechanism, my understanding is that Signal can generally not function without GCM, so I would recommend removing the existing infrastructure altogether, as it would only constitute broken code that confuses users, expecting Signal to function on their device.

@moxie0
Copy link
Contributor

moxie0 commented Oct 16, 2017

If AlarmManager is the thing that works, then use the AlarmManager, but just maintain the interface.

@dpapavas
Copy link
Author

Would you care to sketch me a broad outline of how that can be done, in an acceptable way, or is it understood that I just go on building ready baked and fully tested PRs, that you can then summarily reject, with one-line apocryphal feedback?

libsignal-service-java is Java-only. It knows nothing about GCM, Android services, or the AlarmManager. So whatever self-contained solution cannot be implemented there.

Perhaps, as I have suggested, a separate service could be incorporated inside libsignal-service-android, for the sole purpose of registering and receiving an alarm to wake the socket. As far as I can tell (and I may well be wrong), that would keep the pipe's API intact, but the service still has to be started from somewhere and that somewhere is the application. Introducing a new service inside a library and having the application start it, could well be construed as a change to the interface. Would that nevertheless be acceptable?

@moxie0
Copy link
Contributor

moxie0 commented Oct 18, 2017

It sounds like you're suggesting that Thread.sleep() doesn't work on Android. If that's true, it makes sense to me that you would want to replace Thread.sleep() with an interface like SignalThread.sleep() or whatever. The java code can implement that interface as Thread.sleep(), and the Android code can implement that interface as a rendevouz that uses the alarm manager. The calling code can then pass in the implementation of the interface they desire on construction.

@dpapavas
Copy link
Author

dpapavas commented Oct 19, 2017

Thanks for getting back to me. I think you misunderstand the issue: Thread.sleep doesn't work at all, on any sleeping Android device. Or rather, it does work in a sense, but not in a way that is useful to us. Thead.sleep(10000) essentially means "block the thread for 10 CPU-seconds". When the CPU is sleeping, it doesn't count towards those 10 seconds.

So when the thread starts sleeping for 55 seconds before sending the next keep-alive, if you push the power button on your device to turn off the screen 5 seconds later, then put it down and pick it up 5 hours later, the thread will wake up 50 seconds after you've picked up the device again and turned the screen on. During those 5 hours the socket will have long closed and Signal will be trapped with a stale pipe, waiting to send a keep-alive in one thread and to read something from the pipe in the other. The effect is of course the same, as if you had no connection during that time. If some other process acquires a wakelock during that period and wakes the CPU up, the clock will run while the lock is held and the threads may wake earlier, but in general, not in time to keep the socket alive.

Note also, that the problem is not necessarily confined to the socket. Any code that depends on sleeping for a more-or-less precise amount of time, without keeping a wake-lock during that time, is inherently broken. That is the case whether the device is GCM-enabled or not. GCM only fixes things in this case, because it keeps a wakelock while the message notified by the push is being handled.

All this has nothing to do with whether the sleep method is called from Signal-Android or signal-service-java. The reason that prompted me to break the interface, by exposing a method to wake the pipe up, is that there is, to my knowledge, no method to do blocking wait in Java, based on a real-time clock (and I think that is intentionally so, so that there really is no call). The only method to do that, that I have found, is to use the AlarmManager to periodically manually wake the thread up, by calling notify() on the lock it's waiting on.

From within libsignal-service-java which is built as a plain, Android-agnostic JRE, there is, as far as I can tell, no way to register an alarm. For that, I have to have a running Android service, and that can only be done from within Signal-Android, the MessageRetrievalService being the obvious candidate. But the listener receiving the alarm, needs to somehow call notifyAll, to wake the keep-alive (and also the thread blocking for read on the pipe), hence the need to introduce a new API call, which I agree is unpleasant, as it exposes internal machinery of the pipe, to the calling code.

If there is a better way to handle this issue, I'm all for it and will happily adopt it. And again, some of the above may be wrong, as my understanding of Java and Android is limited. I'm being a bit verbose in my descriptions both to better explain the problem, and in the hope that you'll spot any incorrect assumptions on my part.

@moxie0
Copy link
Contributor

moxie0 commented Oct 19, 2017

No, I'm pretty sure I understand the issue. You're saying Thread.sleep doesn't work on Android (a minority of devices stop the CPU so quickly after screen off).

The problem is with the implementation of Thread.sleep on Android. The interface is exactly what we want. So all I'm saying is: replace the implementation on Android.

This repository builds two artifacts, java and android. The latter depends on the former. If you define an interface such as SignalThread with a signature of SignalThread.sleep, you can provide two different implementations of that eg AndroidSignalThead and JavaSignalThread, each of which lives in the right artifact. The latter just does Thread.sleep(), the former implements a waiting rendezvous with an alarm manager and a dynamically registered receiver.

The caller can then pass in whichever implementation of the interface they want to use (depending if they are, for instance, a java or android app) and that's it.

@dpapavas
Copy link
Author

Well, it seems then, that it's me who's misunderstanding. Sorry, I hoped that I'd be able to get by with whatever Java/Android I could pick up on the way, in order to make this work, but it's evidently not working out. Hopefully someone will come along who's in a better position to put this fix together. Thank you for your time.

@moxie0
Copy link
Contributor

moxie0 commented Oct 20, 2017

You're pretty close, just move some of your code around and you should be there.

@theBoatman
Copy link

@moxie0 I think I understand what solution you have in mind, but in my opinion that will lead to a solution that is more complex and less readable than it needs to be. Adding an interface that mimics the sleep behavior of java and does Android alarm stuff behind is probably doable, but it adds just another layer between the android app and libsignal.

In my opinion the library should
*) publish the method sendKeepAlive so this can be triggered from outside
*) Add a parameter to specify on creation whether the library should send keepalives itself or I want to trigger them from outside.

Both changes would be very small and would not break any existing behavior. They just add a small additional feature. And with that it would be easy to set up an Alarm receiver in the android app and address that problem strait the android way. Is this something we can talk about?

@moxie0
Copy link
Contributor

moxie0 commented Jan 11, 2018

In my opinion abstraction layers make things more readable, not less. It seems that we disagree about that, but I'm going to want it done that way since I'm the one who's going to have maintain this code long term.

@theBoatman
Copy link

Of course it is your project and your decission. I will try to implement a sleep method based on android alarms to see where that will lead us.

But apart the question which implementation is more readable, fact is that we lose flexibility this way. I have two usecases in mind:

  1. On android, it would be better to not send keep alive packages on a fixed intervall, but define a time frame where to send it. This way android can coordinate different wakeups and process them more efficient.

  2. If we have control over the keep alive algoritm in the app we could develop an adaptive keep alive strategy, which would be clearly a platform dependend task.

If we just mimic Thread.sleep then those two usecases are not (directly) solveable in the future. (Especially for the first one I have some code in mind. The second one is a rather complex task and would need a lot of work and experimenting.)

And as a side note: We do not disagree on whether abstraction layers make things more readable. But in my opinion BOTH variants break the layer, because it is needed in this case. One solution does it by clearly opening the layer, the other one does it through the backdoor by allowing to inject platform dependend code into the library.

But as long as we agree that this is a problem that needs to be solved I am positive that we find a solution.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants