-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix schema/type inference issue #216 #244
Conversation
Current coverage is
|
Thanks a lot for submitting this. The code looks good. Would you add another unit test that verify it works end-to-end. Maybe use the input you provided in the issue:
|
This reverts commit 9be0313.
Added an end to end test case to validate the proper schema/type inference. @falaki Take a look and merge if everything is fine. |
@@ -52,4 +83,14 @@ class InferSchemaSuite extends FunSuite { | |||
Array(LongType)).deep == Array(DoubleType).deep) | |||
} | |||
|
|||
test("Type/Schema inference works as expected for the simple parse dataset.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm.. Shouldn't this go to CsvSuite
and remove with BeforeAndAfterAll
,beforeAll()
and afterAll()
as this test is a end-to-end test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that CsvSuite performs all end to end tests but since there was a dedicated suite for SchemaInference, I did prefer to put the schema tests in there. Do you see any issues in that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted the refactoring part. Tests I kept as is. If more people feel CsvSuite would rather be the right place to put these tests in, I will make that change.
@HyukjinKwon Created #248 for the refactoring part. |
@@ -40,6 +63,14 @@ class InferSchemaSuite extends FunSuite { | |||
assert(InferSchema.inferField(LongType, "2015-08 14:49:00") == StringType) | |||
} | |||
|
|||
test("Merging Nulltypes should yeild Nulltype.") | |||
{ | |||
assert( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: the indent is off:
assert(
InferSchema.mergeRowTypes(Array(NullType),
Array(NullType)).deep == Array(NullType).deep)
@tanwanirahul left some more comments. Once you address I am going to merge this. After that would you mind submitting a patch to spark. Spark-csv is now inlined in Spark 2.0. |
@HyukjinKwon Moved end to end tests to CSVSuite, @falaki Also recommended the same. @falaki Take a look at the CSVSuite for test case and InferSchemaSuite for Indent comments you left. Also let me know your thoughts on #248 I will push the patch to spark repo soon. |
@@ -717,6 +717,16 @@ abstract class AbstractCsvSuite extends FunSuite with BeforeAndAfterAll { | |||
|
|||
assert(results.size === numCars) | |||
} | |||
|
|||
test("Type/Schema inference works as expected for the simple sparse dataset.") | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is ignoerable though.. I believe this {
might better move up to the previous line.
@HyukjinKwon @falaki Is there any eclipse formatting profile that I could load and apply it for the code I am writing? |
@falaki Could we re-build the commit? I don't think build failed because of the code issue. |
@falaki @HyukjinKwon Could you please look at what is causing the build to fail? Looking at the build error, I don't think its code issue. |
@falaki Could you please suggest next steps for this? Also, is there any date planned for the next release? We are currently using this change by adding an unmanaged dependency. |
Would you please try that combination of spark and openjdk version to investigate what goes wrong? |
Let me take a look when I have some time. |
@falaki Hm.. It looks like this was failed due to the cached ones in travis.ci. This works fine both on my local and travis with a new build here. I will create a new PR for this. Would you please make the author @tanwanirahul when you merge that PR by |
Can we maybe close this? |
The issue with current implementation is that, it bumps up the NullType to StringType in reduce phase of each partition. When the final reduce across partitions takes place the column type is inferred as StringType even though the column has only one integer value because of incorrect bumping up of NullTypes.
Here, in this pull, we don't bump up the Null to String as a part of reduce operation (within and across partitions). After the reduction, when we create Fields of StructType, we convert NullTypes to StringTypes.