-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JNI concurrent interface extensions #1215
Conversation
…Multilabel Interface to JNI; generated OS dependant lib files; modified pom to include generated lib file in java library path; removed unused variable in multiclassleaner;
…ions in parallel, we introduce a concurrent learner (for multilabel and multiline multiclass learners) that works with a pool of learners to get high throughput for predictions in an online setting.
…_model method instead of initialize. Former method reuses shared variables from the seed learner instance and hence has much less memory footprint in comparison to latter which allocates new memory for all new instances created.
…onal to stop any unwanted NPEs.
…ner instances of the same model.
I’m all for a unified C++ layer...
…-Markus
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a lot of this stuff sounds pretty reasonable. A few concerns.
- This won't work with the current language level. And bumping the language level should not be taken lightly.
- This seems to also include changes to the build library that I proposed in another PR. In my opinion the changes to the way this is built should not go in here.
JNIEXPORT jlong JNICALL Java_vowpalWabbit_learner_VWLearners_seedVWModel(JNIEnv *env, jclass obj, jlong vwPtr) | ||
{ jlong cloneVwPtr = 0; | ||
try | ||
{ vw* vwInstance = VW::seed_vw_model((vw*)vwPtr, ""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does seed_vw_model
do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It initializes new VW instances using the parameters from an existing instance (seed), these cloned VW instances shares the shared data with the the seed instance.
The seed_vw_model
method is expected to make the concurrent methods more memory efficient.
import vowpalWabbit.jni.NativeUtils; | ||
|
||
import java.io.IOException; | ||
import java.util.Optional; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work. The pom.xml
file specifies the language level as Java 6 and this is a Java 8 construct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
Optional
helped to maintain cleaner and more semantically meaningful code, shall we upgrade it to Java 8 since Java 6 is no longer updated?
finally { | ||
STATIC_LOCK.unlock(); | ||
} | ||
} | ||
} | ||
private static native long initialize(String command); | ||
private static native long seedVWModel(long nativePointer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between seeding and initializing? If you are no longer using initialize that should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comments on seed_vw_model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets say you have a vw model of size 1GB and you create 2 instances of the same model using initialize, memory footprint in java is 2 * 1GB. Now if you use seed_vw_model, internal data structures are shared and hence memory footprint would still be 1GB.
We still need initialize to create the first instance. seed_vw_model requires a seed model to clone from.
@deaktator could you comment on this too? |
@jmorra I have removed the usage of |
I wanted to get the motivation for this PR. Is the main motivation improving throughput or to simply provide timeout semantics in the prediction methods? If it's throughput, I would suggest adding JMH benchmarks with the follow:
1. Your code
1. Threadlocal model with executor service backed by fixed size thread pool related to the desired concurrency level
1. Serial execution of single thread model
Is there a concern about the queue growing too large with the threadlocal approach? It seems simpler and possibly faster than using a blocking queue. Also, on closer inspection, you are polling the model queue inside the submitted Callable passed to the ExecutorService. This will add to the queue in the ExecutorService. Therefore the same issue would exist in either your or the threadlocal implementation.
Some other questions:
! If the predict method is always called by the same thread, won't this likely perform worse than a normal model? By calling the following, you block the calling thread, right? Am I wrong?
```java
+ O result = future.get(timeoutInMillis, TimeUnit.MILLISECONDS);
```
If so, won't your throughput be no better than serial model performance?
Why is your asynchronous boundary inside predict? Why block in the predict method vs return a Future? If you decide to return a Future, consider Guava ListenableFutures so you can potentially avoid blocking altogether by providing a callback mechanism in the model.
Why provide a LinkedBlockingQueue and an ExecutorService to the multithreaded learner constructor? If the executor service doesn't have the same number of threads as the size of the queue, then you'll be wasting resources. This is especially true when there are less threads than elements in the queue. Memory will be waste. In the other case, threads will be unnecessarily created and used without additional throughput, right?
I would update the javadoc to say **null** rather than empty.
I would update the docs if necessary to indicate that models concurrent models should be read only, don't learn or whatever terminology you like. This is so that you don't get into a situation that the concurrent model is not idempotent.
Speaking of idempotency, I would add a test to ensure idempotency by predicting many times based on the same input. Stick the results in a Set and test that the size of the set is 1.
If you choose not to change the implementation, I would consider adding an AtomicLong containing the number of interrupted predictions.
I was wondering about the shared state in VW. Is this known / guaranteed to be thread-safe? If not, that cloning the seed learner stuff could result in really nasty bugs that will only be observable under heavy load and will likely manifest in ways that are hard to diagnose the problem. The worst problem would be a corruption that doesn't result in errors but causes predictions to be "unstable" / nonsensical.
Am I way off on this stuff? If so, let me know. What do you think @jmorra?
|
Also, if we can guarantee that models will only ever be called on specific threads, maybe we should have non-locking variants of the serial models to be used exclusively inside the concurrent models. This will avoid locking and will use the concurrency mechanisms in the concurrent models to ensure thread safety.
|
Here's some scala doc that does a quick and dirty benchmark. Shows the multithreaded approach could be up to 3 orders of magnitude slower than serial execution. I would definitely recommend benchmarking this PR's code against serial execution.
// Scala Code
//
// ------------------------------------------------------------
// Example output
// ------------------------------------------------------------
// sum unthreaded_time_sec threaded_time_sec unthreaded_speedup
// 100000 1.9319E-5 0.017919362 927.55115X
// 100000 1.7781E-5 0.016643956 936.0529X
// 100000 2.0025E-5 0.023529027 1174.9827X
// 100000 1.5941E-5 0.01935222 1213.9904X
// 100000 1.209E-5 0.015618176 1291.8259X
// 100000 2.0106E-5 0.028387927 1411.9132X
// 100000 4.1961E-5 0.018170957 433.0439X
// 100000 8.384E-6 0.017119417 2041.9153X
// 100000 1.9023E-5 0.019923734 1047.3497X
// 100000 1.8086E-5 0.018302908 1011.9931X
import java.util.concurrent.ExecutorService
import java.util.concurrent.Executors
import java.util.concurrent.Future
import java.util.concurrent.Callable
import scala.annotation.tailrec
class Test(threads: Int) {
private[this] val exSvc = Executors.newFixedThreadPool(threads)
/**
* Return length of string. Compute asynchronously and block to get result.
* @param s
* @return length of string
*/
def strLength(s: String) = exSvc.submit(new Test.StrLen(s)).get
}
object Test {
def strLength(s: String) = s.length
class StrLen(s: String) extends Callable[Int] {
def call() = strLength(s)
}
}
def time[A](a: => A) = {
val t1 = System.nanoTime
val r = a
val t2 = System.nanoTime
(r, (1.0e-9*(t2 - t1)).toFloat)
}
@tailrec
def sumThreaded(t: Test, n: Int, i: Int, strs: Array[String], s: Int): Int = {
if (i < n)
sumThreaded(t, n, i+1, strs, s + t.strLength(strs(i)))
else s
}
@tailrec
def sumUnthreaded(n: Int, i: Int, strs: Array[String], s: Int): Int = {
if (i < n)
sumUnthreaded(n, i+1, strs, s + Test.strLength(strs(i)))
else s
}
def threadedLengths(n: Int, seed: Long, threads: Int) = {
val r = new scala.util.Random(seed)
val strs = Array.fill(n)(r.nextString(100))
val t = new Test(threads)
time(sumThreaded(t, strs.length, 0, strs, 0))
}
def unthreadedLengths(n: Int, seed: Long) = {
val r = new scala.util.Random(seed)
val strs = Array.fill(n)(r.nextString(100))
time(sumUnthreaded(strs.length, 0, strs, 0))
}
println("sum\tunthreaded_time_sec\tthreaded_time_sec\tunthreaded_speedup")
1 to 10 foreach { i =>
val (u, ut) = unthreadedLengths(1000, i)
val (t, tt) = threadedLengths(1000, i, 2)
val utInv = 1d / ut
val ttInv = 1d / tt
val speedUp = (utInv / ttInv).toFloat
println(s"$u\t$ut\t$tt\t${speedUp}X")
}
|
I think all of @deaktator's comments seem reasonable. @JohnLangford @zhilians what do you think? |
The assumption is that the calling thread itself is not necessarily the main thread, which is the case when you use a webserver where every rest api request is assigned a dedicated thread.
Same as above
I dont follow this comment completely. As a user of this functionality, one can decide how many threads one wants dedicated for the prediction. Ideal rule of thumb is to keep it close to (but less than) the number of cores in the CPU since predict calls are pure CPU only calls (and no IO). LinkedBlockingQueue is to handle scenarios where the application is queuing predict requests faster than the amount of time it takes to make actual predictions.
That is correct. We need to update the doc and maybe potentially disable learn by default for such read-only predict only learners.
testMultilabelsConcurrency is doing that. Also seed_vw_model is already being used in C#, so we assume that the method works as advertised and our tests indicate that this is so. |
We thought about it. The lock calls itself doesn't contribute significantly to the latency, its only when lot of consumers block on the lock call due to contention is when it starts contributing significantly to the overall latency. |
I agree, that was part of our initial plan until one of our colleagues made us aware of some weird issues ThreadLocal can cause. Besides, for our 10GB model which took about 30-60 secs to initialize, we had to use not so robust approach to initialize the learners before it can be used for prediction purposes when using threadlocals. |
@deaktator can you tell us about the size of the model you used? From the latency numbers you have published it seems like it is even less than 1 msec. We used a 10GB model that produced min latency of around 12 msecs. |
Thanks @deaktator for the detailed review. Adding to @atulvkamat 's comments:
The main focus for this PR is to increase the throughput in a high concurrency container without multiplying the memory usage. I think it's not a bad idea to run benchmark tests for different implementations over different sizes of the model. Will schedule time to do so this week.
I understand that you are concerned about the manipulation of the
You are correct that the caller will be blocked waiting for the result or timeout. But we are picturing this will be mostly used in a multi-consumer environment where multiple consumers are trying to get predictions from the same model, they will be benefit from having more than one predictor available, in which scenario the whole system shall be linearly accelerated. If the use case was really predicting using the same thread, one should rather use the original predictor implementation, which will still be available.
It's a good point. The motivation behind this constructor was to make it flexible for people who wants to reuse a shared executor service across different concurrent predictor instances. You are right that memory will be wasted if we have more predictor instances than threads in the executor, but not in a very bad way since we use But to simplify the API, I will add another simplified constructor that creates an executor and a predictor queue of the same size. Will update the javadoc as well. |
@zhilians I get what you're saying and it seems like you've adequately fleshed out your use case and have done a fair amount of performance testing. What I am concerned with is that the API is dictated by your use case. The API (signature, not documentation) doesn't make it clear to less astute users the requirement that it must be called in a multi-threaded environment. It's obvious in hindsight, but I had to read the code before it clicked. If instead you returned Futures or, even better, ListenableFutures, I feel like you wouldn't be imposing your use case on all users of the API. I like the "Your Server as a Function" paper by Marius Eriksen (https://monkey.org/~marius/funsrv.pdf) and the reactive paradigm so I'm likely biased toward Futures that are asynchronously transformable (and non-blocking) to other Futures .
What do you think about the possibility of returning futures?
One other thought. I would make a constructor that takes a timeout and have two predict methods: one that takes a custom timeout and one that doesn't take a timeout and uses the one from the constructor parameter. Is this dumb? I feel like it is more uniform with the current serial models.
|
We certainly don't want to impose our use case on all the users of the API. Our contribution is more focused on providing another possibility to speed up predictions in a concurrent environment. IMHO, when the user creates a concurrent learner from the factory, it's immediately clear that the user is going to get a pool of predictors instead of adopting a parallelized predictor that could be speeded up even in a single thread environment. We can certainly hammer the documentations a bit to make it clearer.
In addition to our timeboxed predict method, I've added another To the I think the |
… method to create concurrent predictor using thread pool and predictor pool of same size in factory;
Thank you for another in-depth review. PR comments addressed in 4703f57 I strongly believe that we should merge this in as soon as possible as it's providing an additional simple and memory effective wrapper for speeding up concurrent prediction tasks. We could ship it now if there's no major flaw in the implementation and then work on the feedbacks from the community regarding this new API. @jmorra @deaktator @atulvkamat Do you have other concerns regarding the core functionalities? |
@zhilians Change looks good to me. |
The only two outstanding concerns I have are
1. The queue polling inside the Callable on a possibly unbounded queue in the Executor. As was stated previously by @atulvkamat, big models can have latency on the order of milliseconds so if you get ~ M*500 calls/sec, where M is the number of models in the blocking queue, the current predictors will die due to back pressure with an OutOfMemoryError or a heap size exception, right? I didn't scrutinize the javadoc too much but I think this should be documented if it's not already.
2. The thread-safety of the seeded models. I can't speak to this as it lies in the VW internals in the C code. It would be nice to have a a sign off. Can @JohnLangford or someone else attest to the thread safety of this approach? Is the shared state between such models thread-safe? If not, this seeded model approach will also not be thread-safe. I know that @atulvkamat commented on the unit tests checking for this, but 100 examples is most likely not enough to catch a race condition gone bad and a passing test isn't definitive proof, merely a probabilistic proof (but the ε isn't really known).
Other than those, I think this stuff looks good.
|
The seeded models approach is in routine very heavy use in general.
…-John
On Thu, Apr 27, 2017 at 5:06 PM, Ryan Deak ***@***.***> wrote:
The only two outstanding concerns I have are
1. The queue polling inside the Callable on a possibly unbounded queue in
the Executor. As was stated previously by @atulvkamat, big models can have
latency on the order of milliseconds so if you get ~ M*500 calls/sec, where
M is the number of models in the blocking queue, the current predictors
will die due to back pressure with an OutOfMemoryError or a heap size
exception, right? I didn't scrutinize the javadoc too much but I think this
should be documented if it's not already.
2. The thread-safety of the seeded models. I can't speak to this as it
lies in the VW internals in the C code. It would be nice to have a a sign
off. Can @JohnLangford or someone else attest to the thread safety of this
approach? Is the shared state between such models thread-safe? If not, this
seeded model approach will also not be thread-safe. I know that @atulvkamat
commented on the unit tests checking for this, but 100 examples is most
likely not enough to catch a race condition gone bad and a passing test
isn't definitive proof, merely a probabilistic proof (but the ε isn't
really known).
Other than those, I think this stuff looks good.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1215 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAE25oTxofrVlQ15YlD9KFB8dwdH3YoVks5r0QNrgaJpZM4MdIP5>
.
|
Fine by me. @jmorra merge?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
Merged, thank you everyone :-) |
* Add prerequisites: zlib-devel (or zlib1g-dev) (#1226) * Added zlib-devel (or zlib1g-dev) to Prerequisites * Update README.md * do not use deprecated sklearn module (#1223) * Enable the C wrapper to extract model to memory and initialize model from memory (#1189) * extract model to memory and initialize model from memory * add get_confidence C wrapper function * JNI concurrent interface extensions (#1215) * replaced NativeUtils with Java's own library loader; Added MulticlassMultilabel Interface to JNI; generated OS dependant lib files; modified pom to include generated lib file in java library path; removed unused variable in multiclassleaner; * Since an instance of vw model is not thread safe for multiple predictions in parallel, we introduce a concurrent learner (for multilabel and multiline multiclass learners) that works with a pool of learners to get high throughput for predictions in an online setting. * Learner pool now creates vw instances of the same model using seed_vw_model method instead of initialize. Former method reuses shared variables from the seed learner instance and hence has much less memory footprint in comparison to latter which allocates new memory for all new instances created. * Fixed space alignment issue and changed getLearner api to return Optional to stop any unwanted NPEs. * Removed TODO as we now use seed_vw_model to instantiate multiple learner instances of the same model. * remove java8 components * added method to return Future in abstract concurrent predictor; added method to create concurrent predictor using thread pool and predictor pool of same size in factory; * tweaks
This patch is causing problems and will need to be backed out unless it can be fixed. https://travis-ci.org/JohnLangford/vowpal_wabbit#L2107 Testing locally, I can only reproduce 5% of the time so this is some form of race condition. @zhilians can you investigate and nail? |
Taking a look, I did a quick test and couldn't reproduce this error at all on Linux/Mac. I am now trying to replicate the same CI test environment (https://travis-ci.org/JohnLangford/vowpal_wabbit/jobs/238178163/config) |
My local repro was:
repeat 100 (mvn test -f java/pom.xml | grep BUILD)
which generated 5 "BUILD FAILURE" events.
…-John
On Thu, Jun 1, 2017 at 2:51 AM, Zhiliang Su ***@***.***> wrote:
Taking a look, I did a quick test and couldn't reproduce this error at all
on Linux/Mac. I am now trying to replicate the same CI test environment (
https://travis-ci.org/JohnLangford/vowpal_wabbit/jobs/238178163/config)
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#1215 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAE25sKGls2CZEbHqW9hUC5jZCh4YUuuks5r_l9agaJpZM4MdIP5>
.
|
Just some updates. I've been trying to fix the test failures, seems like it really consistent and can reproduce every time on Travis CI. Running it 100 times on my virtual machine setup 100% like the Travis CI didn't throw me errors. The only place where race condition could exist is when using the seed_model to create shallow copies of a given VW instance, which might not be thread-safe. So I tried adding a lock on the seed vw instance. The test still unfortunately fails: https://travis-ci.org/zhilians/vowpal_wabbit/builds/238682875 So I think I can rule that part out. Next step is to look into those IllegalArgumentException thrown. |
I've pulled this from the master branch now. I would like to get it back in, so if you can debug that would be great. |
This PR is to introduce concurrent interfaces, cross platform support improvements and multilabel multiline prediction support to VW's JNI.
learner/concurrent/
package containing our concurrent components. We need to do this since VW learner in C is not thread-safe and hence the VW learner in JNI synchronizes on every predict call which can hamper throughput of high volume low latency services that require to make predict calls.NativeUtils
which was used to load JNI lib from jar. Added proper OS dependent target. This is partially inline with New Java build #1193. We will need to merge fixes from New Java build #1193.