TypeDescriptor could support Either #1368

johnynek · 2015-07-16T11:53:29Z

We have product types (tuples, case classes) in TypeDescriptor, but not sum types (e.g. Either). We could support either in the following way:

If we have Either[L, R] and we have support for L and R, then we take the max(fields[L].size, fields[R].size) as the size and we can null pad the smaller one. Then we add an extra column to store either "L" or "R" (or any fixed token) to tell which comes next.

Then, for the types, if the flattened position has the same type on both L and R, use that type, otherwise go to AnyRef.

This may be a bit academic, since I'm not sure anyone has ever wanted to do this. Also, the more complex the encoding, the less portable it is, which is a main value of the Text encoding.

The text was updated successfully, but these errors were encountered:

sid-kap · 2015-07-16T17:24:17Z

Why not take fields[L].size + fields[R].size as the size and null pad whichever one is not being used? That way we would not need AnyRef.

johnynek · 2015-07-16T18:03:10Z

Consider:

Either[Int, (Int, Long)]

We care about not only the number of columns but type of those columns. So
here we could turn it into:

Id: Char, A: Int, B: AnyRef

And encode:
Left(2) => ('L', 2, null)
Right((4, 5L)) => ('R', 4, Long.valueOf(5L))

In this case we could have just used (Int, Long) for both and put junk Long
in the Right case, but what about Either[String, (Int, Long)]

Not we need columns like:

Id: Char, A: AnyRef, B: Long

Even if we take the idea of putting in a junk long. Here the common
superclass of String and Int is AnyRef.

Since encoded null is cheaper than 0L potentially, especially for text
where we can write the empty string, I proposed the AnyRef encoding.

On Thursday, July 16, 2015, Sidharth Kapur notifications@github.com wrote:

Why not take fields[L].size + fields[R].size as the size and null pad
whichever one is not being used? That way we would not need AnyRef.

—
Reply to this email directly or view it on GitHub
#1368 (comment).

Oscar Boykin :: @posco :: http://twitter.com/posco

johnynek · 2015-07-27T21:21:24Z

Actually the better way, since the goal here is probably to use scalding data in another system, would be to add columns for all of L and all of R and write all nulls/empty to one or the other. Then:

Either[Int, String]
Left(1) => "1,"
Right("yo") => ",yo"

As long as one or the other side is evident in the language of #1387, we could handle this case.

Gabriella439 · 2015-07-27T21:24:36Z

What will the decoder do if both columns are present (i.e. "1,yo")?

johnynek · 2015-07-27T22:25:52Z

@Gabriel439 undefined or it could throw. This is the question of what to do when we get bad data. Sometimes we can't detect bad data, sometimes we can but it would be expensive to validate all data when it is assumed to be rare that it is corrupted, and sometimes we want correctness so much we will pay to check even if errors happen with a vanishing rate.

Not clear what you would want here. Consider the case of thrift is kind of the same: with CompactThrift you can misparse quite a lot of data (and perhaps not notice you have corrupted data).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TypeDescriptor could support Either #1368

TypeDescriptor could support Either #1368

johnynek commented Jul 16, 2015

sid-kap commented Jul 16, 2015

johnynek commented Jul 16, 2015

johnynek commented Jul 27, 2015

Gabriella439 commented Jul 27, 2015

johnynek commented Jul 27, 2015

TypeDescriptor could support Either #1368

TypeDescriptor could support Either #1368

Comments

johnynek commented Jul 16, 2015

sid-kap commented Jul 16, 2015

johnynek commented Jul 16, 2015

johnynek commented Jul 27, 2015

Gabriella439 commented Jul 27, 2015

johnynek commented Jul 27, 2015