-
Notifications
You must be signed in to change notification settings - Fork 707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TypeDescriptor could support Either #1368
Comments
Why not take |
Consider: Either[Int, (Int, Long)] We care about not only the number of columns but type of those columns. So Id: Char, A: Int, B: AnyRef And encode: In this case we could have just used (Int, Long) for both and put junk Long Not we need columns like: Id: Char, A: AnyRef, B: Long Even if we take the idea of putting in a junk long. Here the common Since encoded null is cheaper than 0L potentially, especially for text On Thursday, July 16, 2015, Sidharth Kapur notifications@github.com wrote:
Oscar Boykin :: @posco :: http://twitter.com/posco |
Actually the better way, since the goal here is probably to use scalding data in another system, would be to add columns for all of Either[Int, String]
Left(1) => "1,"
Right("yo") => ",yo" As long as one or the other side is |
What will the decoder do if both columns are present (i.e. |
@Gabriel439 undefined or it could throw. This is the question of what to do when we get bad data. Sometimes we can't detect bad data, sometimes we can but it would be expensive to validate all data when it is assumed to be rare that it is corrupted, and sometimes we want correctness so much we will pay to check even if errors happen with a vanishing rate. Not clear what you would want here. Consider the case of thrift is kind of the same: with CompactThrift you can misparse quite a lot of data (and perhaps not notice you have corrupted data). |
We have product types (tuples, case classes) in TypeDescriptor, but not sum types (e.g. Either). We could support either in the following way:
If we have
Either[L, R]
and we have support forL
andR
, then we take themax(fields[L].size, fields[R].size)
as the size and we can null pad the smaller one. Then we add an extra column to store either "L" or "R" (or any fixed token) to tell which comes next.Then, for the types, if the flattened position has the same type on both L and R, use that type, otherwise go to
AnyRef
.This may be a bit academic, since I'm not sure anyone has ever wanted to do this. Also, the more complex the encoding, the less portable it is, which is a main value of the Text encoding.
The text was updated successfully, but these errors were encountered: