Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeDescriptor could support Either #1368

Open
johnynek opened this issue Jul 16, 2015 · 5 comments
Open

TypeDescriptor could support Either #1368

johnynek opened this issue Jul 16, 2015 · 5 comments

Comments

@johnynek
Copy link
Collaborator

We have product types (tuples, case classes) in TypeDescriptor, but not sum types (e.g. Either). We could support either in the following way:

If we have Either[L, R] and we have support for L and R, then we take the max(fields[L].size, fields[R].size) as the size and we can null pad the smaller one. Then we add an extra column to store either "L" or "R" (or any fixed token) to tell which comes next.

Then, for the types, if the flattened position has the same type on both L and R, use that type, otherwise go to AnyRef.

This may be a bit academic, since I'm not sure anyone has ever wanted to do this. Also, the more complex the encoding, the less portable it is, which is a main value of the Text encoding.

@sid-kap
Copy link
Contributor

sid-kap commented Jul 16, 2015

Why not take fields[L].size + fields[R].size as the size and null pad whichever one is not being used? That way we would not need AnyRef.

@johnynek
Copy link
Collaborator Author

Consider:

Either[Int, (Int, Long)]

We care about not only the number of columns but type of those columns. So
here we could turn it into:

Id: Char, A: Int, B: AnyRef

And encode:
Left(2) => ('L', 2, null)
Right((4, 5L)) => ('R', 4, Long.valueOf(5L))

In this case we could have just used (Int, Long) for both and put junk Long
in the Right case, but what about Either[String, (Int, Long)]

Not we need columns like:

Id: Char, A: AnyRef, B: Long

Even if we take the idea of putting in a junk long. Here the common
superclass of String and Int is AnyRef.

Since encoded null is cheaper than 0L potentially, especially for text
where we can write the empty string, I proposed the AnyRef encoding.

On Thursday, July 16, 2015, Sidharth Kapur notifications@github.com wrote:

Why not take fields[L].size + fields[R].size as the size and null pad
whichever one is not being used? That way we would not need AnyRef.


Reply to this email directly or view it on GitHub
#1368 (comment).

Oscar Boykin :: @posco :: http://twitter.com/posco

@johnynek
Copy link
Collaborator Author

Actually the better way, since the goal here is probably to use scalding data in another system, would be to add columns for all of L and all of R and write all nulls/empty to one or the other. Then:

Either[Int, String]
Left(1) => "1,"
Right("yo") => ",yo"

As long as one or the other side is evident in the language of #1387, we could handle this case.

@Gabriella439
Copy link
Contributor

What will the decoder do if both columns are present (i.e. "1,yo")?

@johnynek
Copy link
Collaborator Author

@Gabriel439 undefined or it could throw. This is the question of what to do when we get bad data. Sometimes we can't detect bad data, sometimes we can but it would be expensive to validate all data when it is assumed to be rare that it is corrupted, and sometimes we want correctness so much we will pay to check even if errors happen with a vanishing rate.

Not clear what you would want here. Consider the case of thrift is kind of the same: with CompactThrift you can misparse quite a lot of data (and perhaps not notice you have corrupted data).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants