-
Notifications
You must be signed in to change notification settings - Fork 707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REPL: Add toIterator (and related methods) #929
Conversation
…ng Manifests available note: toIterator doesn't work for tuples, supposedly can't find deserializer, even though "write"/"save" both can deserialize, run things, and write back out
import cascading.tuple.Fields | ||
|
||
class TypedSequenceFile[T](path: String)( | ||
implicit val mf: Manifest[T], tget: TupleGetter[T], tset: TupleSetter[T]) extends SequenceFile(path, 0) with Mappable[T] with TypedSink[T] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are any of these needed? You are not using tset. And tget can fall back to the default, I think, without an issue. Where is mf needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to Oscar's comment, for the ones you do actually need, if you don't reference them explicitly, in this case you can do class TypedSequencefile[T: Manifest: TupleGetter: TupleSetter]
@johnynek: I added the Manifests as part of trying to solve an issue I'm having when T is a tuple. Everything with the
I thought maybe it didn't have enough type information somewhere. These sequence files seem to work fine when I use |
@bholt I see what is happening here. There is no serialization configured in this: I think this can be fixed with adding on line 22: // make sure to set up the required serialization for scalding:
Config.default.toMap.foreach { case (k, v) =>
conf.set(k, v)
} |
} | ||
|
||
object TypedSequenceFile { | ||
def apply[T](path: String)(implicit mf: Manifest[T]): TypedSequenceFile[T] = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same point as above. This can be apply[T: Manifest]
Thanks @johnynek, that fixed it for HDFS mode. Any idea what the equivalent is in CascadingLocal mode? |
@bholt I guess see this: and and make a similar change. |
@bholt actually, for cascading local mode, if you are going to in-memory anyway, why not use the MemorySink I implemented: Then you should not need to serialize anyway. |
Cool, I'll give that a try. Should be fine since we're already saying snapshot files should only be valid in the current session. |
@bholt you'll have to pattern match the Mode to see : CascadingLocal, otherwise, MemorySink won't work. |
…serialization problem
(please excuse the extra commits from merging repl+execution) |
|
||
TypedPipe.fromSingleField[T](SequenceFile(tmpSeq)) | ||
mode match { | ||
case Local(_) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you do: cl: CascadingLocal
you will get both Local and Test modes, if we want to support Test (which might be nice).
… can import that additionally if desired
import cascading.flow.FlowDef | ||
import cascading.tuple.Fields | ||
|
||
//import com.twitter.scalding.ReplImplicits._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no commented code, please.
Brandon, can you merge with develop and we can merge this? Scalding needs to go out, and this looks like an improvement to me. |
Was just doing that. Also moved TypedSequenceFile to new directory. |
sweet. |
one last merge needed. Sorry. :) I keep changing that ExecutionContext. |
*/ | ||
object ReplImplicitContext { | ||
/** Implicit flowDef for this Scalding shell session. */ | ||
implicit var fd = ReplImplicits.flowDef |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why aren't these implicit def flowDef: FlowDef = ...
.
I like to minimize the vars. Can we just make the ones in ReplImplicit vars?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, definitely didn't intend to make them "var".
def toOption(implicit fd: FlowDef, md: Mode): Option[T] = vp match { | ||
case _: EmptyValue => None | ||
case LiteralValue(v) => Some(v) | ||
case ComputedValue(tp) => tp.snapshot.toList match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tp.snapshot.toIterator.take(2).toList
is is a bit safer, in the case there is some bug. It will still error, but it won't blow up the memory if there are 2 or more items in the list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, why not tp.toIterator.take(2).toList
, why should we explicitly cass snapshot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is an error to have more than one value in a ValuePipe, right? In the match, should I do a sys.error
for case _?
merge when green. |
REPL: Add toIterator (and related methods)
Btw, added a tutorial for the new REPL to the wiki: https://github.com/twitter/scalding/wiki/Scalding-REPL |
Love all the new features on the REPL. Doesn't the Wiki page make this obsolete: I would rather not have the same info in two places. |
Did a quick edit, to fix a minor error with the old one. But yes. +1 to On Thu, Jul 3, 2014 at 2:46 PM, Sriram Krishnan notifications@github.com
Oscar Boykin :: @posco :: http://twitter.com/posco |
The open source wiki seemed like a better place for this documentation, so that it can be more easily updated (not tied to release schedules). Perhaps we should just put a pointer to the wiki page in the README? |
A pointer to the Wiki page sounds good. Having said that, I actually prefer the docs to be closer to the code - and indeed tied to release schedules (since the docs may actually be different for different releases). For instance, the Wiki page is now inconsistent with the 0.9.0 release. |
snapshot
into TemporarySequenceFile class (possibly separated out intoTemporaryFile
andTypedSequenceFile
)toIterator
that works at least for snapshotstoIterator
toList
anddump
are trivially implementable fromtoIterator