First draft of pure-scalding memory backend #1697

johnynek · 2017-06-03T23:51:14Z

This follows up the thread of work leading to #1682

This gives an in-memory backend without using cascading (which for the basic tests is MUCH faster).

This is not a production quality backend yet:

no support for joins (hashJoin yes cogroup no).
parallelism has not been carefully tuned, so we only get very naive parallelism at the moment.

The main point of this is to exercise using the execution API without cascading in the loop. I think this proof of concept shows that a spark backend would not be very hard at this point and the memory backend should be a guide for someone looking to do that.

I think we should merge this despite it not being complete because the PR is already dense enough. I'd like to improve the quality of the test coverage and support all the cases in later PRs.

r? @fwbrasil @piyushnarang

johnynek · 2017-06-03T23:51:21Z

cc @ianoc

benpence · 2017-06-06T17:13:02Z

scalding-core/src/main/scala/com/twitter/scalding/typed/memory_backend/MemoryBackend.scala

+    def loop(): R = {
+      val init = ref.get
+      val (next, res) = fn(init)
+      if (ref.compareAndSet(init, next)) res


Just to clarify: This is like STM where you're only making the change if the initial conditions when calling fn are met? Could this loop forever if there is another thread that consistently gets scheduled between the read and write?

yes. But I don't think that can happen because we only mutate it when jobs finish and there are only a finite number of jobs running at a time, so I think it is exponentially unlikely that they do spin forever.

benpence · 2017-06-06T17:19:57Z

scalding-core/src/main/scala/com/twitter/scalding/typed/memory_backend/MemoryBackend.scala

+
+  def openForRead(config: Config, tap: Tap[_, _, _]): TupleEntryIterator = ???
+  def fileExists(filename: String): Boolean = ???
+  def newFlowConnector(props: Config): FlowConnector = ???


These are just not implemented yet? Is there an exception we could throw instead of using undefined? Or is this the standard for scala?

These cannot be implemented because this platform does not know about cascading. Ultimately I'd like to remove these methods from Mode and move them down to a class like CascadingBackedMode, but I want to do that with your help since you likely have some code somewhere that assumes Mode has these methods.

@johnynek could you add a comment or a specific exception with this explanation?

benpence · 2017-06-06T17:22:20Z

scalding-core/src/main/scala/com/twitter/scalding/typed/memory_backend/MemoryBackend.scala

+  def get(): T = ref.get
+}
+
+final class MemoryMode private (srcs: Map[TypedSource[_], Iterable[_]], sinks: Map[TypedSink[_], AtomicBox[Option[Iterable[_]]]]) extends Mode {


Will this Map work if we read from and write to the same path at different points during the flow?

yes, because the sinks and sources are separate. You can't make a source EXCEPT something that is already a list on this platform.

benpence · 2017-06-06T17:35:25Z

scalding-core/src/main/scala/com/twitter/scalding/typed/memory_backend/MemoryBackend.scala

+
+      def result(implicit cec: ConcurrentExecutionContext): Future[ArrayBuffer[(K, V2)]] =
+        input.result.map { kvs =>
+          val m = MMap[K, ArrayList[V1]]()


Can we rename for clarity to like valuesGroupedByKey or something?

fwbrasil · 2017-06-06T18:52:58Z

scalding-core/src/main/scala/com/twitter/scalding/typed/memory_backend/MemoryBackend.scala

+
+  def openForRead(config: Config, tap: Tap[_, _, _]): TupleEntryIterator = ???
+  def fileExists(filename: String): Boolean = ???
+  def newFlowConnector(props: Config): FlowConnector = ???


@johnynek could you add a comment or a specific exception with this explanation?

fwbrasil · 2017-06-06T18:57:49Z

scalding-core/src/main/scala/com/twitter/scalding/typed/memory_backend/MemoryBackend.scala

+object MemoryPlanner {
+
+  sealed trait Op[+O] {
+    def result(implicit cec: ConcurrentExecutionContext): Future[ArrayBuffer[_ <: O]]


Would this new mode be used only for testing? I think a simpler synchronous implementation without parallellism would suffice for that.

I want to actually make this production grade as we move forward. I don't think it will be that hard. So I don't want to back out the design of using Futures since a better system will have more explicit parallelism.

Fair enough. I think Future might not be the best tool to achieve parallelism given its high execution overhead, but I think it's fine to use Future for now and revisit the decision if necessary.

fwbrasil · 2017-06-06T19:07:52Z

scalding-core/src/main/scala/com/twitter/scalding/typed/memory_backend/MemoryBackend.scala

+
+    case class Concat[O](left: Op[O], right: Op[O]) extends Op[O] {
+      def result(implicit cec: ConcurrentExecutionContext) = {
+        // start both futures in parallel


Why not inline these? They will run in parallel regardless.

bad comment from when I did for { } which is not parallel unless we start the futures first. Now, it doesn't matter (using zip).

fwbrasil · 2017-06-06T19:17:59Z

scalding-core/src/main/scala/com/twitter/scalding/typed/memory_backend/MemoryBackend.scala

+  sealed trait Op[+O] {
+    def result(implicit cec: ConcurrentExecutionContext): Future[ArrayBuffer[_ <: O]]
+
+    def flatMap[O1](fn: O => TraversableOnce[O1]): Op[O1] =


I find the naming here confusing. flatMap indicates a monad, but function doesn't return a monad instance. Also, it's strange the implementation creates a Map for a flatMap. Maybe the method could be called mapValues and the implementation be on top of mapAll?

I hear you on flatMap. I can rename concatMap, but the fact is we used flatMap for scalding to make it a bit easier for newbies to scala (since it is like flatMap on List in that List accepts a similarly broad return type). TypedPipe[T] is definitely not a monad (it is Applicative!). But I can change this name here.

fwbrasil · 2017-06-06T19:18:23Z

scalding-core/src/main/scala/com/twitter/scalding/typed/memory_backend/MemoryBackend.scala

+    def flatMap[O1](fn: O => TraversableOnce[O1]): Op[O1] =
+      Op.Map(this, fn)
+
+    def mapAll[O1 >: O, O2](fn: IndexedSeq[O1] => ArrayBuffer[O2]): Op[O2] =


wdyt about transform instead of mapAll?

fwbrasil · 2017-06-06T19:23:05Z

scalding-core/src/main/scala/com/twitter/scalding/typed/memory_backend/MemoryBackend.scala

+
+        case f@MapValues(_, _) =>
+          def go[K, V, U](node: MapValues[K, V, U]) = {
+            // don't capture node, which is a TypedPipe, which we avoid serializing


unnecessary?

fwbrasil · 2017-06-06T19:24:02Z

scalding-core/src/main/scala/com/twitter/scalding/typed/memory_backend/MemoryBackend.scala

+
+        case Mapped(input, fn) =>
+          val (m1, op) = plan(m, input)
+          (m1, op.flatMap { t => fn(t) :: Nil })


it might be worth adding a map method to avoid creating one list for each element.

fwbrasil · 2017-06-06T19:25:24Z

scalding-core/src/main/scala/com/twitter/scalding/typed/memory_backend/MemoryBackend.scala

+          (m, Op.Source({ cec =>
+            mem.readSource(src) match {
+              case Some(iter) => Future.successful(iter)
+              case None => Future.failed(new Exception(s"Source: $src not wired"))


wdyt about explaining how the user can fix the error in the exception message?

fwbrasil · 2017-06-06T19:28:47Z

scalding-core/src/main/scala/com/twitter/scalding/typed/memory_backend/MemoryBackend.scala

+      go(imr)
+  }
+
+  case class State(


could it be private[this]?

fwbrasil · 2017-06-06T19:28:52Z

scalding-core/src/main/scala/com/twitter/scalding/typed/memory_backend/MemoryBackend.scala

+    forced: Map[TypedPipe[_], Future[TypedPipe[_]]]
+    )
+
+  val state = new AtomicBox[State](State(0, MemoryPlanner.Memo.empty, Map.empty))


could it be private[this]?

johnynek · 2017-06-07T19:02:31Z

will send an update addressing these comments. Thank you for taking the time to look.

piyushnarang

Really cool stuff. Would be nice to add more scenarios in the memory test but we can follow up with that in a future pr.

piyushnarang · 2017-06-05T17:55:08Z

scalding-core/src/main/scala/com/twitter/scalding/typed/memory_backend/MemoryBackend.scala

+  def newWriter(): Writer =
+    new MemoryWriter(this)
+
+  def openForRead(config: Config, tap: Tap[_, _, _]): TupleEntryIterator = ???


throw an exception instead? this will evaluate to a NotImplementedError which might lead the caller / user to think it's a implementation bug?

piyushnarang · 2017-06-08T02:59:51Z

scalding-core/src/main/scala/com/twitter/scalding/typed/memory_backend/MemoryBackend.scala

+    @annotation.tailrec
+    def loop(): R = {
+      val init = ref.get
+      val (next, res) = fn(init)


do we need to worry about fn being computationally expensive? We could call and cache the value in the enclosing method if that's the case

If I understood correctly the purpose of this class, caching doesn't seem possible since fn depends on the current value and needs to be computed on each retry (the value will be different on each retry). @johnynek maybe renaming loop to retry would be more clear?

piyushnarang · 2017-06-08T03:07:52Z

scalding-core/src/main/scala/com/twitter/scalding/typed/memory_backend/MemoryBackend.scala

+  sealed trait Op[+O] {
+    def result(implicit cec: ConcurrentExecutionContext): Future[ArrayBuffer[_ <: O]]
+
+    def flatMap[O1](fn: O => TraversableOnce[O1]): Op[O1] =


could we name the generic type something else rather than O1?

piyushnarang · 2017-06-08T03:09:33Z

scalding-core/src/main/scala/com/twitter/scalding/typed/memory_backend/MemoryBackend.scala

+    def flatMap[O1](fn: O => TraversableOnce[O1]): Op[O1] =
+      Op.Map(this, fn)
+
+    def mapAll[O1 >: O, O2](fn: IndexedSeq[O1] => ArrayBuffer[O2]): Op[O2] =


something apart from O1, O2?

piyushnarang · 2017-06-08T03:24:05Z

scalding-core/src/main/scala/com/twitter/scalding/typed/memory_backend/MemoryBackend.scala

+          }
+          sum(slk)
+
+        case tp@TrappedPipe(_, _, _) => ???


throw not yet implemented exception?

??? means not implemented.

piyushnarang · 2017-06-08T03:24:20Z

scalding-core/src/main/scala/com/twitter/scalding/typed/memory_backend/MemoryBackend.scala

+        case WithDescriptionTypedPipe(pipe, description, dedup) =>
+          plan(m, pipe)
+
+        case WithOnComplete(pipe, fn) => ???


piyushnarang · 2017-06-08T03:25:09Z

scalding-core/src/main/scala/com/twitter/scalding/typed/memory_backend/MemoryBackend.scala

+             * with multi-way or we need to keep
+             * the decomposed series of joins
+             */
+            ???


should this be a todo?

johnynek · 2017-06-23T03:39:24Z

okay.

Sorry for the delay. Can you all take another look. As you know, this is a long line of changes and I am trying to keep each one somewhat digestible (shooting for ~400 lines of diff). This is slightly longer, so I'm hoping we can address any outstanding issues in a follow up.

I'd love to get the optimizations in place so we could think about releasing scalding 0.18 with this change to typed pipe, and in fact this memory platform is just a proof of concept that you can run without cascading. We can polish it more and make it as nice as we like, but the main purpose is to have a realistic example to prove that the API basically works without getting into the weeds of spark or flink.

piyushnarang · 2017-06-23T17:11:35Z

Looks good to me. Seems like the CI build has been hitting the 50 min timeout on the hadoop tests (noticed that on: #1700 as well). We'll need to either bump the timeout / maybe breakout the tests in that suite.

fwbrasil

LGTM

First draft of memory backend

bfacc4c

johnynek requested a review from benpence June 4, 2017 00:01

johnynek added 2 commits June 3, 2017 15:55

add hashJoin support

e609011

make futures run in parallel

dd1f8a4

johnynek mentioned this pull request Jun 4, 2017

Make an ADT for CoGrouped #1698

Merged

benpence reviewed Jun 6, 2017

View reviewed changes

fwbrasil reviewed Jun 6, 2017

View reviewed changes

piyushnarang reviewed Jun 8, 2017

View reviewed changes

address review comments

ef85805

fwbrasil approved these changes Jun 23, 2017

View reviewed changes

johnynek merged commit 026ba62 into develop Jun 23, 2017

johnynek deleted the oscar/in-memory-typed branch June 23, 2017 18:32

First draft of pure-scalding memory backend #1697

First draft of pure-scalding memory backend #1697

Conversation

johnynek commented Jun 3, 2017 • edited Loading

johnynek commented Jun 3, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnynek commented Jun 7, 2017

piyushnarang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnynek commented Jun 23, 2017

piyushnarang commented Jun 23, 2017

fwbrasil left a comment

Choose a reason for hiding this comment

johnynek commented Jun 3, 2017 •

edited

Loading