Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JNI concurrent interface extensions #1215

Merged
merged 8 commits into from
Apr 29, 2017

Conversation

zhilians
Copy link
Contributor

This PR is to introduce concurrent interfaces, cross platform support improvements and multilabel multiline prediction support to VW's JNI.

  1. Added learner/concurrent/ package containing our concurrent components. We need to do this since VW learner in C is not thread-safe and hence the VW learner in JNI synchronizes on every predict call which can hamper throughput of high volume low latency services that require to make predict calls.
  2. Added multiline prediction support in multi-label predictor interface.
  3. Removed NativeUtils which was used to load JNI lib from jar. Added proper OS dependent target. This is partially inline with New Java build #1193. We will need to merge fixes from New Java build #1193.

zhilians and others added 5 commits March 14, 2017 10:40
…Multilabel Interface to JNI; generated OS dependant lib files; modified pom to include generated lib file in java library path; removed unused variable in multiclassleaner;
…ions in parallel, we introduce a concurrent learner (for multilabel and multiline multiclass learners) that works with a pool of learners to get high throughput for predictions in an online setting.
…_model method instead of initialize. Former method reuses shared variables from the seed learner instance and hence has much less memory footprint in comparison to latter which allocates new memory for all new instances created.
@JohnLangford
Copy link
Member

@eisber @jmorra can you take a look at this? It seems plausibly quite valuable, but at the same time I'm wondering if we should do more to push functionality into C++ so the java interface can be slimmer?

@eisber
Copy link
Collaborator

eisber commented Mar 29, 2017 via email

Copy link
Collaborator

@jmorra jmorra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a lot of this stuff sounds pretty reasonable. A few concerns.

  1. This won't work with the current language level. And bumping the language level should not be taken lightly.
  2. This seems to also include changes to the build library that I proposed in another PR. In my opinion the changes to the way this is built should not go in here.

JNIEXPORT jlong JNICALL Java_vowpalWabbit_learner_VWLearners_seedVWModel(JNIEnv *env, jclass obj, jlong vwPtr)
{ jlong cloneVwPtr = 0;
try
{ vw* vwInstance = VW::seed_vw_model((vw*)vwPtr, "");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does seed_vw_model do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ref: https://github.com/JohnLangford/vowpal_wabbit/blob/462ed1624967d99a349527edd108a6f62ed92b9a/vowpalwabbit/parse_args.cc#L1400-L1424

It initializes new VW instances using the parameters from an existing instance (seed), these cloned VW instances shares the shared data with the the seed instance.

The seed_vw_model method is expected to make the concurrent methods more memory efficient.

import vowpalWabbit.jni.NativeUtils;

import java.io.IOException;
import java.util.Optional;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work. The pom.xml file specifies the language level as Java 6 and this is a Java 8 construct.

Copy link
Contributor Author

@zhilians zhilians Mar 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

Optional helped to maintain cleaner and more semantically meaningful code, shall we upgrade it to Java 8 since Java 6 is no longer updated?

finally {
STATIC_LOCK.unlock();
}
}
}
private static native long initialize(String command);
private static native long seedVWModel(long nativePointer);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between seeding and initializing? If you are no longer using initialize that should be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comments on seed_vw_model

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets say you have a vw model of size 1GB and you create 2 instances of the same model using initialize, memory footprint in java is 2 * 1GB. Now if you use seed_vw_model, internal data structures are shared and hence memory footprint would still be 1GB.

We still need initialize to create the first instance. seed_vw_model requires a seed model to clone from.

@jmorra
Copy link
Collaborator

jmorra commented Mar 29, 2017

@deaktator could you comment on this too?

@zhilians
Copy link
Contributor Author

@jmorra I have removed the usage of Optional which is the only JAVA8 component used. I think it will be beneficial to have this merge in, then we can iterate on longer term projects like fixing the JAVA build or building a unified c++ interface.

@deaktator
Copy link
Collaborator

deaktator commented Apr 21, 2017 via email

@deaktator
Copy link
Collaborator

deaktator commented Apr 21, 2017 via email

@deaktator
Copy link
Collaborator

deaktator commented Apr 22, 2017 via email

@deaktator
Copy link
Collaborator

deaktator commented Apr 23, 2017 via email

@jmorra
Copy link
Collaborator

jmorra commented Apr 24, 2017

I think all of @deaktator's comments seem reasonable. @JohnLangford @zhilians what do you think?

@atulvkamat
Copy link

atulvkamat commented Apr 24, 2017

I wanted to get the motivation for this PR. Is the main motivation improving throughput or to simply provide timeout semantics in the prediction methods?
Yes motivation is throughput for large models (~1GB or greater in size). In fact, we tested this on a 10GB model.

If it's throughput, I would suggest adding JMH benchmarks with the follow
Yes we will look into publishing this benchmark results. Requires us to carry out this work, will take sometime.

! If the predict method is always called by the same thread, won't this likely perform worse than >>a normal model? By calling the following, you block the calling thread, right? Am I wrong?

+ O result = future.get(timeoutInMillis, TimeUnit.MILLISECONDS);

If so, won't your throughput be no better than serial model performance?

The assumption is that the calling thread itself is not necessarily the main thread, which is the case when you use a webserver where every rest api request is assigned a dedicated thread.

Why is your asynchronous boundary inside predict? Why block in the predict method vs return >>a Future? If you decide to return a Future, consider Guava ListenableFutures so you can ?>>potentially >>avoid blocking altogether by providing a callback mechanism in the model.

Same as above

Why provide a LinkedBlockingQueue and an ExecutorService to the multithreaded learner >>constructor? If the executor service doesn't have the same number of threads as the size of the >>queue, then you'll be wasting resources. This is especially true when there are less threads >>than elements in the queue. Memory will be waste. In the other case, threads will be >>unnecessarily created and used without additional throughput, right?

I dont follow this comment completely. As a user of this functionality, one can decide how many threads one wants dedicated for the prediction. Ideal rule of thumb is to keep it close to (but less than) the number of cores in the CPU since predict calls are pure CPU only calls (and no IO). LinkedBlockingQueue is to handle scenarios where the application is queuing predict requests faster than the amount of time it takes to make actual predictions.

I would update the docs if necessary to indicate that models concurrent models should be read >>only, don't learn or whatever terminology you like. This is so that you don't get into a situation >>that the concurrent model is not idempotent.

That is correct. We need to update the doc and maybe potentially disable learn by default for such read-only predict only learners.

Speaking of idempotency, I would add a test to ensure idempotency by predicting many times >>based on the same input. Stick the results in a Set and test that the size of the set is 1.

If you choose not to change the implementation, I would consider adding an AtomicLong >>containing the number of interrupted predictions.

I was wondering about the shared state in VW. Is this known / guaranteed to be thread-safe? If >>not, that cloning the seed learner stuff could result in really nasty bugs that will only be >>observable under heavy load and will likely manifest in ways that are hard to diagnose the >>problem. The worst problem would be a corruption that doesn't result in errors but causes >>predictions to be "unstable" / nonsensical.

testMultilabelsConcurrency is doing that.

Also seed_vw_model is already being used in C#, so we assume that the method works as advertised and our tests indicate that this is so.

https://github.com/JohnLangford/vowpal_wabbit/blob/28c080a7df2b3c9a01f056f0957d222132997f46/cs/cli/vw_base.cpp#L66

@atulvkamat
Copy link

Also, if we can guarantee that models will only ever be called on specific threads, maybe we >>should have non-locking variants of the serial models to be used exclusively inside the concurrent >>models. This will avoid locking and will use the concurrency mechanisms in the concurrent >>models to ensure thread safety.

We thought about it. The lock calls itself doesn't contribute significantly to the latency, its only when lot of consumers block on the lock call due to contention is when it starts contributing significantly to the overall latency.

@atulvkamat
Copy link

Threadlocal model with executor service backed by fixed size thread pool related to the desired >>concurrency level

I agree, that was part of our initial plan until one of our colleagues made us aware of some weird issues ThreadLocal can cause. Besides, for our 10GB model which took about 30-60 secs to initialize, we had to use not so robust approach to initialize the learners before it can be used for prediction purposes when using threadlocals.

@atulvkamat
Copy link

@deaktator can you tell us about the size of the model you used? From the latency numbers you have published it seems like it is even less than 1 msec. We used a 10GB model that produced min latency of around 12 msecs.

@zhilians
Copy link
Contributor Author

zhilians commented Apr 25, 2017

Thanks @deaktator for the detailed review. Adding to @atulvkamat 's comments:

I wanted to get the motivation for this PR. Is the main motivation improving throughput or to simply provide timeout semantics in the prediction methods? If it's throughput, I would suggest adding JMH benchmarks with the follow:

  1. Your code
  2. Threadlocal model with executor service backed by fixed size thread pool related to the desired concurrency level
  3. Serial execution of single thread model

The main focus for this PR is to increase the throughput in a high concurrency container without multiplying the memory usage. I think it's not a bad idea to run benchmark tests for different implementations over different sizes of the model. Will schedule time to do so this week.

Also, on closer inspection, you are polling the model queue inside the submitted Callable passed to the ExecutorService. This will add to the queue in the ExecutorService. Therefore the same issue would exist in either your or the threadlocal implementation.

I understand that you are concerned about the manipulation of the LinkedBlockingQueue within the Callable submitted to the ExecutorService. It's true that it'll increase the latency of the operation in ExecutorService, but we are only talking a fraction of the prediction time here, depending on the model size and the actual predictor. The model queue will only be fiddled when retrieving and returning a model, which should be done promptly.

! If the predict method is always called by the same thread, won't this likely perform worse than a normal model? By calling the following, you block the calling thread, right? Am I wrong?

You are correct that the caller will be blocked waiting for the result or timeout. But we are picturing this will be mostly used in a multi-consumer environment where multiple consumers are trying to get predictions from the same model, they will be benefit from having more than one predictor available, in which scenario the whole system shall be linearly accelerated.

If the use case was really predicting using the same thread, one should rather use the original predictor implementation, which will still be available.

Why provide a LinkedBlockingQueue and an ExecutorService to the multithreaded learner constructor? If the executor service doesn't have the same number of threads as the size of the queue, then you'll be wasting resources. This is especially true when there are less threads than elements in the queue. Memory will be waste. In the other case, threads will be unnecessarily created and used without additional throughput, right?

It's a good point. The motivation behind this constructor was to make it flexible for people who wants to reuse a shared executor service across different concurrent predictor instances. You are right that memory will be wasted if we have more predictor instances than threads in the executor, but not in a very bad way since we use seed_vw_model which greatly reduces the memory footprint of new vw instances. It's up to the user to decide and control the trade-off between the number of threads and number of predictors.

But to simplify the API, I will add another simplified constructor that creates an executor and a predictor queue of the same size. Will update the javadoc as well.

@deaktator
Copy link
Collaborator

deaktator commented Apr 25, 2017 via email

@zhilians
Copy link
Contributor Author

zhilians commented Apr 27, 2017

What I am concerned with is that the API is dictated by your use case. The API (signature, not documentation) doesn't make it clear to less astute users the requirement that it must be called in a multi-threaded environment. It's obvious in hindsight, but I had to read the code before it clicked.

We certainly don't want to impose our use case on all the users of the API. Our contribution is more focused on providing another possibility to speed up predictions in a concurrent environment. IMHO, when the user creates a concurrent learner from the factory, it's immediately clear that the user is going to get a pool of predictors instead of adopting a parallelized predictor that could be speeded up even in a single thread environment. We can certainly hammer the documentations a bit to make it clearer.

If instead you returned Futures or, even better, ListenableFutures, I feel like you wouldn't be imposing your use case on all users of the API. I like the "Your Server as a Function" paper by Marius Eriksen (https://monkey.org/~marius/funsrv.pdf) and the reactive paradigm so I'm likely biased toward Futures that are asynchronously transformable (and non-blocking) to other Futures.

What do you think about the possibility of returning futures?

One other thought. I would make a constructor that takes a timeout and have two predict methods: one that takes a custom timeout and one that doesn't take a timeout and uses the one from the constructor parameter. Is this dumb? I feel like it is more uniform with the current serial models.

In addition to our timeboxed predict method, I've added another public Future<O> predict(final E example) to provide flexibility for users who wants to react to the prediction results asynchronously.

To the ListenableFutures idea, I personally think it's a nice extension to Future, but it's out of the scope of this PR because we are talking about introducing a third party dependency to this relatively low level and barebone API. Seems bit pricy to me for just one feature.

I think the timeout parameter or the Future<O> return type speak for themselves that they are methods from a concurrent predictor and should be used accordingly, so I am reluctant to match the API in the current serial models.

… method to create concurrent predictor using thread pool and predictor pool of same size in factory;
@zhilians
Copy link
Contributor Author

Thank you for another in-depth review.

PR comments addressed in 4703f57

I strongly believe that we should merge this in as soon as possible as it's providing an additional simple and memory effective wrapper for speeding up concurrent prediction tasks. We could ship it now if there's no major flaw in the implementation and then work on the feedbacks from the community regarding this new API.

@jmorra @deaktator @atulvkamat Do you have other concerns regarding the core functionalities?

@atulvkamat
Copy link

@zhilians Change looks good to me.

@deaktator
Copy link
Collaborator

deaktator commented Apr 27, 2017 via email

@JohnLangford
Copy link
Member

JohnLangford commented Apr 28, 2017 via email

@deaktator
Copy link
Collaborator

deaktator commented Apr 29, 2017 via email

Copy link
Contributor

@jon-morra-zefr jon-morra-zefr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Copy link
Collaborator

@jmorra jmorra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@JohnLangford JohnLangford merged commit ab8f09c into VowpalWabbit:master Apr 29, 2017
@JohnLangford
Copy link
Member

Merged, thank you everyone :-)

JohnLangford pushed a commit that referenced this pull request May 3, 2017
* Add prerequisites: zlib-devel (or zlib1g-dev) (#1226)

* Added zlib-devel (or zlib1g-dev) to Prerequisites

* Update README.md

* do not use deprecated sklearn module (#1223)

* Enable the C wrapper to extract model to memory and initialize model from memory (#1189)

* extract model to memory and initialize model from memory

* add get_confidence C wrapper function

* JNI concurrent interface extensions (#1215)

* replaced NativeUtils with Java's own library loader; Added MulticlassMultilabel Interface to JNI; generated OS dependant lib files; modified pom to include generated lib file in java library path; removed unused variable in multiclassleaner;

* Since an instance of vw model is not thread safe for multiple predictions in parallel, we introduce a concurrent learner (for multilabel and multiline multiclass learners) that works with a pool of learners to get high throughput for predictions in an online setting.

* Learner pool now creates vw instances of the same model using seed_vw_model method instead of initialize. Former method reuses shared variables from the seed learner instance and hence has much less memory footprint in comparison to latter which allocates new memory for all new instances created.

* Fixed space alignment issue and changed getLearner api to return Optional to stop any unwanted NPEs.

* Removed TODO as we now use seed_vw_model to instantiate multiple learner instances of the same model.

* remove java8 components

* added method to return Future in abstract concurrent predictor; added method to create concurrent predictor using thread pool and predictor pool of same size in factory;

* tweaks
@JohnLangford
Copy link
Member

This patch is causing problems and will need to be backed out unless it can be fixed. https://travis-ci.org/JohnLangford/vowpal_wabbit#L2107

Testing locally, I can only reproduce 5% of the time so this is some form of race condition.

@zhilians can you investigate and nail?

@zhilians
Copy link
Contributor Author

zhilians commented Jun 1, 2017

Taking a look, I did a quick test and couldn't reproduce this error at all on Linux/Mac. I am now trying to replicate the same CI test environment (https://travis-ci.org/JohnLangford/vowpal_wabbit/jobs/238178163/config)

@JohnLangford
Copy link
Member

JohnLangford commented Jun 1, 2017 via email

@zhilians
Copy link
Contributor Author

zhilians commented Jun 5, 2017

Just some updates.

I've been trying to fix the test failures, seems like it really consistent and can reproduce every time on Travis CI. Running it 100 times on my virtual machine setup 100% like the Travis CI didn't throw me errors.

The only place where race condition could exist is when using the seed_model to create shallow copies of a given VW instance, which might not be thread-safe. So I tried adding a lock on the seed vw instance.

The test still unfortunately fails: https://travis-ci.org/zhilians/vowpal_wabbit/builds/238682875

So I think I can rule that part out.

Next step is to look into those IllegalArgumentException thrown.

@JohnLangford
Copy link
Member

I've pulled this from the master branch now. I would like to get it back in, so if you can debug that would be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants