-
Notifications
You must be signed in to change notification settings - Fork 266
Support for creating a store-service for Storm platform #563
Changes from 6 commits
9dd349e
79a5077
5d115da
e2aeb02
77cee16
cd13fec
0d0524a
107ea8b
74999a7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
/* | ||
Copyright 2013 Twitter, Inc. | ||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); | ||
you may not use this file except in compliance with the License. | ||
You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software | ||
distributed under the License is distributed on an "AS IS" BASIS, | ||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
*/ | ||
|
||
package com.twitter.summingbird.online | ||
|
||
import com.twitter.summingbird.batch.{ Batcher, BatchID } | ||
import com.twitter.storehaus.ReadableStore | ||
import com.twitter.storehaus.algebra.Mergeable | ||
|
||
trait CombinedServiceStoreFactory[-K, V] extends MergeableStoreFactory[(K, BatchID), V] with OnlineServiceFactory[K, V] | ||
|
||
object CombinedServiceStoreFactory { | ||
|
||
def apply[K, V](mStore: () => Mergeable[(K, BatchID), V], b: Batcher) = { | ||
new CombinedServiceStoreFactory[K, V] { | ||
def mergeableStore = mStore | ||
def mergeableBatcher = b | ||
def serviceStore = () => ReadableStore.empty | ||
} | ||
} | ||
|
||
def apply[K, V](mStore: () => Mergeable[(K, BatchID), V], b: Batcher, sStore: () => ReadableStore[K, V]) = { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we want You can always ignore the BatchID if you want on the read side. We kind of messed this up in the past with summingbird and I've been trying to fix it (see #547 ) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, you could use a lazy parameter here and make the function below. This can be easier for users than writing the functions: def apply[K, V](mStore: => Mergable[(K, BatchID), V], b: Batcher, sStore: => ReadableStore[K, V]) = ... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have serviceStore function in Storm platform that takes the lazy parameters and turns them into functions https://github.com/twitter/summingbird/pull/563/files#diff-36ff64139544fb94baad353028959a1eR83. I was assuming users to be using that function like Storm.store and Storm.service. For lookups, wouldn't we always want the latest value for a specific key? I agree that it would be nice to pass just one MergeableStore to remove the possibility of getting two different stores. ClientMergeableStore would work in this case. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think we want the latest key, we want the latest that is before the timestamp of tuple coming in. Consider the fact that the queues get delayed, and the fact that we have batch boundaries. You don't want some part of the topology (some queue for instance that is low traffic), to get ahead update a service so the the backed up queue is looking into the future. It is not a huge source of errors, but I think it is more consistent to give a BatchID in the key when doing a service read, but some stores might ignore that (using composeKeyMapping) to discard the BatchID and just look up the key. But if we bake into the API the fact that the BatchID is not available at lookup time, there is no way for a user to make a more precise store. What do you think? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well, if we constrain the lookup to use the BatchID then we would be limiting the error on the value to be up to that BatchID since we don't have the actual timestamp in the store? So it's more correct but still not exactly correct. Unless we use very small batches (which we normally don't and they probably wouldn't make sense anyway). Also this change will require some significant changes in the online left join since right now we don't have access to the Batcher there but we would need to in order to cast the timestamp to the BatchID. Do you think this is a common enough usecase for us to change this behavior? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point that we don't have the batcher at the service lookup. This inconsistency bothers me, but I guess fixing it is bigger than this. Added: #568 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess here is where I want it to be easier to use. Can't we add a method that deals with the clientstore business here? That is confusing as to how to wire up, and I'd rather see it documented in code that works for most cases, than a wiki doc that explains the boilerplate to write. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yea, good point. I will try to wire the ClientStore in here to simplify it for the users There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We still need to distinguish between a service that uses online-only results (ClientStore(online)) and the one that combines online and offline (ClientStore(online, offline)). So, in user code, this would be:
or
And we would have to document this so it is clear when to use what. Does this sound reasonable? |
||
new CombinedServiceStoreFactory[K, V] { | ||
def mergeableStore = mStore | ||
def mergeableBatcher = b | ||
def serviceStore = sStore | ||
} | ||
} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -115,7 +115,7 @@ object FlatMapOperation { | |
storeSupplier: OnlineServiceFactory[K, JoinedV]): FlatMapOperation[T, (K, (V, Option[JoinedV]))] = | ||
new FlatMapOperation[T, (K, (V, Option[JoinedV]))] { | ||
lazy val fm = fmSupplier | ||
lazy val store = storeSupplier.create | ||
lazy val store = storeSupplier.serviceStore | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd put the type on this, i suspect each reference your doing here to store is actually a different store which isn't what we want. (Is store a readable store or a () => Readable store? ) --> Possibly just putting storeSupplier.serviceStore() here instead might be all thats needed There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it's a ()=>ReadableStore, yes, we need the () at the end, thanks |
||
override def apply(t: T) = | ||
fm.apply(t).flatMap { trav: TraversableOnce[(K, V)] => | ||
val resultList = trav.toSeq // Can't go through this twice | ||
|
@@ -125,15 +125,15 @@ object FlatMapOperation { | |
Future.value(Map.empty) | ||
else { | ||
// Do the lookup | ||
val mres: Map[K, Future[Option[JoinedV]]] = store.multiGet(keySet) | ||
val mres: Map[K, Future[Option[JoinedV]]] = store().multiGet(keySet) | ||
val resultFutures = resultList.map { case (k, v) => mres(k).map { k -> (v, _) } }.toIndexedSeq | ||
Future.collect(resultFutures) | ||
} | ||
} | ||
|
||
override def close { | ||
fm.close | ||
Await.result(store.close) | ||
Await.result(store().close) | ||
} | ||
} | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -25,5 +25,5 @@ import com.twitter.storehaus.ReadableStore | |
|
||
case class ReadableServiceFactory[-K, +V]( | ||
store: () => ReadableStore[K, V]) extends OnlineServiceFactory[K, V] { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this can just be: case class ReadableServiceFactory[-K, +V](override val serviceStore: () => ReadableStore[K, V]) extends OnlineServiceFactory[K, V] There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no love for my simplification. :) ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. didn't see this note :) will add the simplification |
||
def create = store() | ||
def serviceStore = store | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you write a return type on this. It is ambiguous (are we applying the function or no?). Also, all public methods should have types. |
||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -80,6 +80,9 @@ object Storm { | |
|
||
def service[K, V](serv: => ReadableStore[K, V]): ReadableServiceFactory[K, V] = ReadableServiceFactory(() => serv) | ||
|
||
def storeService[K, V](mStore: => Mergeable[(K, BatchID), V], sStore: => ReadableStore[K, V])(implicit batcher: Batcher): CombinedServiceStoreFactory[K, V] = | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. how do you use this? What do you put in for sStore? Don't we need the ClientMergeable for this? If so, shouldn't we just take either that or the inputs in needs (like the offline store)? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right now, I have it setup with ClientStore like so:
In this example we only use the online store, but the user can pass an offline store to it too. This is not great since the two stores (mStore and sStore) can be different (unless we somehow check that ClientStore wraps the same mergeable store), so we would have to be explicit about that. ClientMergeable extends Mergeable[(K, BatchID), V] and ReadableStore[(K, BatchID), V] right? So we still would need to handle the BatchID somehow in the leftjoin There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
def storeAndService[K, V](m: MergeableStore[(K, BatchID), V], batchesToKeep: Int)(implicit b: Batcher): CombinedServciceStoreFactory[K, V] =
// you can get the semigroup/monoid from the MergeableStore, right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. actually, that would make clientstore depend on summingbird-storm. Can we put that function in the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are you talking about the function:
? It is already in object Storm |
||
CombinedServiceStoreFactory(() => mStore, batcher, () => sStore) | ||
|
||
def toStormSource[T](spout: Spout[T], | ||
defaultSourcePar: Option[Int] = None)(implicit timeOf: TimeExtractor[T]): StormSource[T] = | ||
SpoutSource(spout.map(t => (Timestamp(timeOf(t)), t)), defaultSourcePar.map(SourceParallelism(_))) | ||
|
@@ -182,7 +185,7 @@ abstract class Storm(options: Map[String, Options], transformConfig: Summingbird | |
private def scheduleSummerBolt[K, V](jobID: JobId, stormDag: Dag[Storm], node: StormNode)(implicit topologyBuilder: TopologyBuilder) = { | ||
val summer: Summer[Storm, K, V] = node.members.collect { case c: Summer[Storm, K, V] => c }.head | ||
implicit val semigroup = summer.semigroup | ||
implicit val batcher = summer.store.batcher | ||
implicit val batcher = summer.store.mergeableBatcher | ||
val nodeName = stormDag.getNodeName(node) | ||
|
||
type ExecutorKeyType = (K, BatchID) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -40,5 +40,5 @@ class WritableStoreSink[K, V](writable: => WritableStore[K, V]) extends StormSin | |
class StormBuffer[K, V](supplier: => Store[K, V]) extends StormSink[(K, V)] with OnlineServiceFactory[K, V] { | ||
private lazy val constructed = supplier // only construct it once | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why aren't we doing this trick on the other store factories? Is it because this is possibly called twice, once with toFn the other with serviceStore? Can we add a comment here while we are working on updating this set of items? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure on the rationale behind having this lazy construct here and not in other places. AFAICT StormBuffer is not used anywhere either? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess it was just defensive. :/ Let's leave it for the time being. |
||
def toFn = { (kv: (K, V)) => constructed.put((kv._1, Some(kv._2))) } | ||
def create = constructed | ||
def serviceStore = () => constructed | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. return type please. |
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a sane constructor to have? left join would always come back None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, yea, we should only have the constructor with both stores defined