REPL: Add toIterator (and related methods) #929

bholt · 2014-07-01T01:40:37Z

Refactor ad-hoc code from snapshot into TemporarySequenceFile class (possibly separated out into TemporaryFile and TypedSequenceFile)
Add toIterator that works at least for snapshots
- If called on a TypedPipe that is not snapshotted, generate a snapshot and call toIterator
- Ideally would allow running simple flatMappable operations without creating new snapshots (almost working)
- toList and dump are trivially implementable from toIterator
Add documentation for new functionality, including changes made in Snapshot a pipe in the REPL #918
- Add note about the repl to the README
- Add a "Repl Walkthrough" Wiki page

…ng Manifests available note: toIterator doesn't work for tuples, supposedly can't find deserializer, even though "write"/"save" both can deserialize, run things, and write back out

johnynek · 2014-07-01T01:59:34Z

scalding-repl/src/main/scala/com/twitter/scalding/ShellPipe.scala

+import cascading.tuple.Fields
+
+class TypedSequenceFile[T](path: String)(
+  implicit val mf: Manifest[T], tget: TupleGetter[T], tset: TupleSetter[T]) extends SequenceFile(path, 0) with Mappable[T] with TypedSink[T] {


why are any of these needed? You are not using tset. And tget can fall back to the default, I think, without an issue. Where is mf needed?

In addition to Oscar's comment, for the ones you do actually need, if you don't reference them explicitly, in this case you can do class TypedSequencefile[T: Manifest: TupleGetter: TupleSetter]

bholt · 2014-07-01T02:14:37Z

@johnynek: I added the Manifests as part of trying to solve an issue I'm having when T is a tuple. Everything with the singleField stuff seems to work fine, but when I try to run toIterator on a snapshot of a TypedSequenceFile[(String,Int)], it errors with:

Caused by: cascading.CascadingException: unable to load deserializer for: scala.Tuple2 from: org.apache.hadoop.io.serializer.SerializationFactory
    at cascading.tuple.hadoop.TupleSerialization.getNewDeserializer(TupleSerialization.java:470)
    at cascading.tuple.hadoop.TupleSerialization$SerializationElementReader.getDeserializerFor(TupleSerialization.java:654)
    at cascading.tuple.hadoop.TupleSerialization$SerializationElementReader.read(TupleSerialization.java:621)
    at cascading.tuple.hadoop.io.HadoopTupleInputStream.readType(HadoopTupleInputStream.java:105)
    at cascading.tuple.hadoop.io.HadoopTupleInputStream.getNextElement(HadoopTupleInputStream.java:52)
    at cascading.tuple.io.TupleInputStream.readTuple(TupleInputStream.java:78)
    at cascading.tuple.hadoop.io.TupleDeserializer.deserialize(TupleDeserializer.java:40)
    at cascading.tuple.hadoop.io.TupleDeserializer.deserialize(TupleDeserializer.java:28)
    at org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1879)
    at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1852)
    at org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103)
    at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78)
    at cascading.tap.hadoop.util.MeasuredRecordReader.next(MeasuredRecordReader.java:61)
    at cascading.scheme.hadoop.SequenceFile.source(SequenceFile.java:93)
    at cascading.tuple.TupleEntrySchemeIterator.getNext(TupleEntrySchemeIterator.java:140)
    at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:120)
    ... 28 more

I thought maybe it didn't have enough type information somewhere. These sequence files seem to work fine when I use save or write to write out to a typed sink (TypedTsv). I only get this error when running toIterator.

johnynek · 2014-07-01T17:42:21Z

@bholt I see what is happening here. There is no serialization configured in this:

https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/Mode.scala#L119

I think this can be fixed with adding on line 22:

  // make sure to set up the required serialization for scalding:
  Config.default.toMap.foreach { case (k, v) =>
    conf.set(k, v)
  }

jcoveney · 2014-07-01T18:21:29Z

scalding-repl/src/main/scala/com/twitter/scalding/ShellPipe.scala

+}
+
+object TypedSequenceFile {
+  def apply[T](path: String)(implicit mf: Manifest[T]): TypedSequenceFile[T] =


same point as above. This can be apply[T: Manifest]

bholt · 2014-07-01T18:32:19Z

Thanks @johnynek, that fixed it for HDFS mode. Any idea what the equivalent is in CascadingLocal mode?

johnynek · 2014-07-01T18:50:13Z

@bholt I guess see this:
https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/Mode.scala#L136

and

http://docs.cascading.org/cascading/2.0/javadoc/cascading/flow/local/LocalFlowProcess.html#LocalFlowProcess(java.util.Properties)

and make a similar change.

johnynek · 2014-07-01T23:37:30Z

@bholt actually, for cascading local mode, if you are going to in-memory anyway, why not use the MemorySink I implemented:
https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/typed/MemorySink.scala#L33

Then you should not need to serialize anyway.

bholt · 2014-07-01T23:43:33Z

Cool, I'll give that a try. Should be fine since we're already saying snapshot files should only be valid in the current session.

johnynek · 2014-07-01T23:46:03Z

@bholt you'll have to pattern match the Mode to see : CascadingLocal, otherwise, MemorySink won't work.

…serialization problem

bholt · 2014-07-02T21:56:04Z

(please excuse the extra commits from merging repl+execution)

johnynek · 2014-07-02T22:33:10Z

scalding-repl/src/main/scala/com/twitter/scalding/ShellPipe.scala

-
-    TypedPipe.fromSingleField[T](SequenceFile(tmpSeq))
+    mode match {
+      case Local(_) =>


if you do: cl: CascadingLocal you will get both Local and Test modes, if we want to support Test (which might be nice).

… can import that additionally if desired

johnynek · 2014-07-03T18:08:17Z

scalding-repl/src/main/scala/com/twitter/scalding/ShellPipe.scala

+import cascading.flow.FlowDef
+import cascading.tuple.Fields
+
+//import com.twitter.scalding.ReplImplicits._


no commented code, please.

johnynek · 2014-07-03T20:06:21Z

Brandon, can you merge with develop and we can merge this? Scalding needs to go out, and this looks like an improvement to me.

bholt · 2014-07-03T20:13:56Z

Was just doing that. Also moved TypedSequenceFile to new directory.

johnynek · 2014-07-03T20:15:43Z

sweet.

johnynek · 2014-07-03T20:52:35Z

one last merge needed. Sorry. :) I keep changing that ExecutionContext.

johnynek · 2014-07-03T20:57:13Z

scalding-repl/src/main/scala/com/twitter/scalding/ReplImplicits.scala

+ */
+object ReplImplicitContext {
+  /** Implicit flowDef for this Scalding shell session. */
+  implicit var fd = ReplImplicits.flowDef


why aren't these implicit def flowDef: FlowDef = ....

I like to minimize the vars. Can we just make the ones in ReplImplicit vars?

Oops, definitely didn't intend to make them "var".

johnynek · 2014-07-03T21:04:08Z

scalding-repl/src/main/scala/com/twitter/scalding/ShellPipe.scala

+  def toOption(implicit fd: FlowDef, md: Mode): Option[T] = vp match {
+    case _: EmptyValue => None
+    case LiteralValue(v) => Some(v)
+    case ComputedValue(tp) => tp.snapshot.toList match {


tp.snapshot.toIterator.take(2).toList

is is a bit safer, in the case there is some bug. It will still error, but it won't blow up the memory if there are 2 or more items in the list.

actually, why not tp.toIterator.take(2).toList, why should we explicitly cass snapshot?

It is an error to have more than one value in a ValuePipe, right? In the match, should I do a sys.error for case _?

johnynek · 2014-07-03T21:23:54Z

merge when green.

…pl' module

REPL: Add toIterator (and related methods)

bholt · 2014-07-04T00:32:56Z

Btw, added a tutorial for the new REPL to the wiki: https://github.com/twitter/scalding/wiki/Scalding-REPL

sriramkrishnan · 2014-07-04T00:46:49Z

Love all the new features on the REPL. Doesn't the Wiki page make this obsolete:
https://github.com/twitter/scalding/blob/develop/scalding-repl/README.md

I would rather not have the same info in two places.

johnynek · 2014-07-04T00:55:27Z

Did a quick edit, to fix a minor error with the old one. But yes. +1 to
more features, -1 to duplicate docs.

On Thu, Jul 3, 2014 at 2:46 PM, Sriram Krishnan notifications@github.com
wrote:

Love all the new features on the REPL. Doesn't the Wiki page make this
obsolete:
https://github.com/twitter/scalding/blob/develop/scalding-repl/README.md

I would rather not have the same info in two places.

—
Reply to this email directly or view it on GitHub
#929 (comment).

Oscar Boykin :: @posco :: http://twitter.com/posco

bholt · 2014-07-04T02:07:36Z

The open source wiki seemed like a better place for this documentation, so that it can be more easily updated (not tied to release schedules). Perhaps we should just put a pointer to the wiki page in the README?

sriramkrishnan · 2014-07-04T02:36:43Z

A pointer to the Wiki page sounds good. Having said that, I actually prefer the docs to be closer to the code - and indeed tied to release schedules (since the docs may actually be different for different releases). For instance, the Wiki page is now inconsistent with the 0.9.0 release.

sriramkrishnan · 2014-07-04T02:50:19Z

#941

bholt added 2 commits June 30, 2014 16:01

mostly working toIterator

b768ce4

refactor to use TypedSequenceFile (single field), which requires havi…

c2ed7b3

…ng Manifests available note: toIterator doesn't work for tuples, supposedly can't find deserializer, even though "write"/"save" both can deserialize, run things, and write back out

johnynek reviewed Jul 1, 2014
View reviewed changes

jcoveney reviewed Jul 1, 2014
View reviewed changes

fix tuple reading for Hadoop mode

6e365d4

Brandon Holt added 7 commits July 2, 2014 09:24

Merge branch 'repl+execution' into repl+toiterator

9121d4a

Merge branch 'repl+execution' into repl+toiterator

9848c61

pattern match for head pipe with Converter

5e2801d

add TextLine.toIterator test

b886748

get rid of unnecessary implicit Manifest, TupleConverter, TupleSetters

5cb3605

add tests for toIterator, 'tuple' one currently breaks it because of …

906fd18

…serialization problem

snapshot: use MemorySink for Local mode, handle this case in toIterator

81f8088

Brandon Holt added 2 commits July 2, 2014 15:29

add implicit to enrich ValuePipe -> ShellPipe

4962ac0

test that implicit snapshot on toIterator works in all the cases

d88284e

johnynek reviewed Jul 2, 2014
View reviewed changes

bholt added 2 commits July 3, 2014 00:42

put implicit versions of FlowDef and Mode in a separate object so you…

0f16fa2

… can import that additionally if desired

add imports to make './sbt scalding-repl/console' work

e442240

johnynek reviewed Jul 3, 2014
View reviewed changes

Brandon Holt added 3 commits July 3, 2014 11:22

some cleanup, fix path suffix

64331ba

add 'ValuePipe.toOption' instead of toIterator

26d9b69

merge develop (includes Execution improvements)

de55cb3

move TypedSequenceFile to com.twitter.scalding.source

5aa2a12

merge with 'develop' to pick up more ExecutionContext changes

326e9ad

johnynek reviewed Jul 3, 2014
View reviewed changes

make implicits 'defs' in ReplImplicitContext

ea6971b

johnynek reviewed Jul 3, 2014
View reviewed changes

Brandon Holt added 2 commits July 3, 2014 14:08

handle EmptyPipe

0006f73

take 2 on ValuePipe iterator to be safe, throw error if more elements

044e5ca

merge 'develop', conflicted with project/Build on my additions to 're…

dcf97e1

…pl' module

ianoc added a commit that referenced this pull request Jul 3, 2014

Merge pull request #929 from bholt/repl+toiterator

c480067

REPL: Add toIterator (and related methods)

ianoc merged commit c480067 into twitter:develop Jul 3, 2014

bholt deleted the repl+toiterator branch July 3, 2014 22:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REPL: Add toIterator (and related methods) #929

REPL: Add toIterator (and related methods) #929

bholt commented Jul 1, 2014

johnynek Jul 1, 2014

jcoveney Jul 1, 2014

bholt commented Jul 1, 2014

johnynek commented Jul 1, 2014

jcoveney Jul 1, 2014

bholt commented Jul 1, 2014

johnynek commented Jul 1, 2014

johnynek commented Jul 1, 2014

bholt commented Jul 1, 2014

johnynek commented Jul 1, 2014

bholt commented Jul 2, 2014

johnynek Jul 2, 2014

johnynek Jul 3, 2014

johnynek commented Jul 3, 2014

bholt commented Jul 3, 2014

johnynek commented Jul 3, 2014

johnynek commented Jul 3, 2014

johnynek Jul 3, 2014

bholt Jul 3, 2014

johnynek Jul 3, 2014

johnynek Jul 3, 2014

bholt Jul 3, 2014

johnynek commented Jul 3, 2014

bholt commented Jul 4, 2014

sriramkrishnan commented Jul 4, 2014

johnynek commented Jul 4, 2014

bholt commented Jul 4, 2014

sriramkrishnan commented Jul 4, 2014

sriramkrishnan commented Jul 4, 2014

REPL: Add toIterator (and related methods) #929

REPL: Add toIterator (and related methods) #929

Conversation

bholt commented Jul 1, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bholt commented Jul 1, 2014

johnynek commented Jul 1, 2014

Choose a reason for hiding this comment

bholt commented Jul 1, 2014

johnynek commented Jul 1, 2014

johnynek commented Jul 1, 2014

bholt commented Jul 1, 2014

johnynek commented Jul 1, 2014

bholt commented Jul 2, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnynek commented Jul 3, 2014

bholt commented Jul 3, 2014

johnynek commented Jul 3, 2014

johnynek commented Jul 3, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnynek commented Jul 3, 2014

bholt commented Jul 4, 2014

sriramkrishnan commented Jul 4, 2014

johnynek commented Jul 4, 2014

bholt commented Jul 4, 2014

sriramkrishnan commented Jul 4, 2014

sriramkrishnan commented Jul 4, 2014