-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added support for parallel executions. Also fixed a bug in enhanced semantic roles. #422
base: master
Are you sure you want to change the base?
Conversation
@kwalcock: can you please take a look at this PR? I tried to implement transparent support for parallel executions. You may start with the Metal.getInferenceModel() method, and double check the Layers.clone() method next. |
Thank you for the use case. I keep updating my example with your good ideas and that documentation is still catching up. In a later version of fatdynet than what you probably have, one should be able to do def cloneBuilder(builder: RnnBuilder): RnnBuilder = {
val newBuilder = builder.clone() The newBuilder.newGraph is needed or else a runtime exception is thrown. I'm still working on the theory as to why. The code looks good. I will try out Metal.main as soon as I can. |
Thanks @kwalcock! |
Being dependent on a locally published snapshot, it's not going to work on Jenkins. I will test locally for now. So far I've made just one change to the C++ code that would necessitate recompiling. Things can't be removed from Maven Central, so I'm not real enthusiastic about sending it (fatdynet) there yet. |
The tests do still pass locally. If I understand correctly, the parallel ability is not yet tested. |
Not yet. A test of the parallel execution would be awesome :) |
In cases where clone calls copy without changing any of the values, like override def clone(): GreedyForwardLayer = copy() one could just return the original: override def clone(): GreedyForwardLayer = this It would save some memory. |
Someone left helpful hints in the C++ code that explains how newGraph fits in:
Not following these rules results in messages like
|
Do away with a few casts
Thanks for catching the clone() issue. I pushed the fix. |
ViterbiForwardLayer is another opportunity. That's the only other I noticed. |
another good catch. fixed and pushed. Thanks! |
For Serial: 57 seconds on 8 threads. |
Thanks!
What about FastNlpProcessor? And FastNlpProcessorWithSemanticRoles?
…On Wed, Sep 16, 2020 at 3:25 PM Keith Alcock ***@***.***> wrote:
For ParallelProcessorExample example, I get
Serial: 57 seconds
Parallel: 17 seconds
on 8 threads.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#422 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAI75TXQ62KZGHBP53FSNE3SGE3MHANCNFSM4ROFLFOQ>
.
|
Thanks for the reminder. I was just curious about whether it was working or not. Here are the times for the difference processors on a collection of just 10 documents. Each time the serial and parallel version got the same answer, but output for the various processors always differed:
|
Thanks! I think the correct comparison is between CluProcessor and FastNLPProcessorWithSemanticRoles. Good to see the times are about the same. |
So, ready to release fatdynet, and then merge in master for processors? |
I checked all these methods, and none are synchronized. I do use one synchronized in ConstEmbeddingsGlove.apply(), but that one is called just once. I have an idea on how to debug this issue, based on the observation that if we sum up the runtimes in CluProcessor methods during the parallel execution, the sum should not be higher than the seq times. If it is, it means the threads are blocking each other somewhere, and we have to dig deeper. Thanks!! |
I uploaded some Flight Recorder information, but so far it has not been especially helpful to me: https://drive.google.com/file/d/1oviO6z2RgtjDPa5tRYFLOEDrMHVErSoD/view?usp=sharing |
Hmm, good point. Maybe we should try 8 files of the same size, so we know that the max speedup is up to 8? If these files are dominated by a big one, parallelism is limited... |
FWIW I'm reading up on Scala parallelization at places like https://docs.scala-lang.org/overviews/parallel-collections/overview.html and https://docs.scala-lang.org/overviews/parallel-collections/performance.html#how-big-should-a-collection-be-to-go-parallel. I've been using the easiest thing that works/compiles, nothing that was based on actual effectiveness. |
I suspect you called .par on a list of files, which is Ok. But because of the wait at the end on the whole collection, all short jobs have to wait on the big one (if it exists). |
stating the obvious here :) |
Thanks! Have you tried using the same number of threads, but an increasing amount of RAM available to DyNet? |
No. One thing that had occurred to me is that the memory in DyNet could be fragmenting so that malloc takes longer to find an appropriate space and free takes longer to combine regions. I doubt that's the problem, though. I'll see if more memory can bend the line up. |
@kwalcock: In any case, should we merge this in master? I am not sure we should wait. |
While observing execution related to that crash issue, I notice in the IntelliJ debugger a lot of threads in the MONITOR state. Something isn't quite right and it seems to be a clue. |
Good finding. So, that means we did miss a sync somewhere in the Java/Scala
code, no?
…On Thu, Oct 1, 2020 at 17:36 Keith Alcock ***@***.***> wrote:
While observing execution related to that crash issue, I notice in the
IntelliJ debugger a lot of threads in the MONITOR state. Something isn't
quite right and it seems to be a clue.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#422 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAI75TVPQPPNOP5SWOWHZC3SIUN7RANCNFSM4ROFLFOQ>
.
|
It looks like a false alarm. It went away on the next run. I think something was stale. Still making sure. |
I've been trying to run this in an Ubuntu18 virtual machine with 16mb of memory. It will run serially but not in parallel. The error message seems almost unrelated, but it must be some side effect of memory shortage.
This doesn't look familiar, does it? |
It's probably coming from here: But I have no idea why... |
The test program I'm running in processors seems to max out at a 6x speedup or so (graph above). I wanted to make sure that isn't some general limit based on something like the memory bus bandwidth in the servers. This memory intensive C++ program was able to get 20x. I'm now suspicious of the malloc that is part of the forward pass in dynet and trying to get a profiler to show some insight. |
This is encouraging! Are you running the C and Scala tests on the same machine? |
The two graphs above are both from Jenny. I'm not able to do the C-level profiling there, though, and that's the reason for yesterday's failed attempt to get processors to run locally on Ubuntu 18. It was probably just something I ate/installed. Processors just started working on Ubuntu 20, though, so I'll see if the running dynet can explain itself. It wasn't anything obvious to me at the Java level. |
BTW the graph above showing the 20x speedup is for a C++ program unrelated to Dynet, and that was probably responsible for the confusion this morning. I was about to compare it to speedup for a C++ Dynet program and may see something closer to the Java 6x speedup than to this 20x. That would mean that the problem isn't with Java but is already in the C++ version, possibly in the synchronized memory allocation. Details would be investigated initially with perf tools that aren't installed for the servers, but they are running locally now. |
Got it. Thank you!
…On Mon, Oct 12, 2020 at 12:27 PM Keith Alcock ***@***.***> wrote:
BTW the graph above showing the 20x speedup is for a C++ program unrelated
to Dynet, and that was probably responsible for the confusion this morning.
I was about to compare it to speedup for a C++ Dynet program and may see
something closer to the Java 6x speedup than to this 20x. That would mean
that the problem isn't with Java but is already in the C++ version,
possibly in the synchronized memory allocation. Details would be
investigated initially with perf tools that aren't installed for the
servers, but they are running locally now.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#422 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAI75TUUSFNQI6VNQWBAXDTSKM4ATANCNFSM4ROFLFOQ>
.
|
This PR will be abandoned when it is finally replaced by updated parallel processing code. |
@MihaiSurdeanu ,there are graphs for TorchScript multi-threaded performance at kwalcock/torchscript#1. |
No description provided.