Fix schema/type inference issue #216 #244

tanwanirahul · 2016-01-29T20:03:23Z

The issue with current implementation is that, it bumps up the NullType to StringType in reduce phase of each partition. When the final reduce across partitions takes place the column type is inferred as StringType even though the column has only one integer value because of incorrect bumping up of NullTypes.

Here, in this pull, we don't bump up the Null to String as a part of reduce operation (within and across partitions). After the reduction, when we create Fields of StructType, we convert NullTypes to StringTypes.

codecov-io · 2016-01-29T20:03:30Z

Current coverage is `85.74%`

Merging #244 into master will decrease coverage by -0.33% as of 8979a5a

@@            master    #244   diff @@
======================================
  Files           12      12       
  Stmts          517     519     +2
  Branches       149     149       
  Methods          0       0       
======================================
  Hit            445     445       
  Partial          0       0       
- Missed          72      74     +2

Review entire Coverage Diff as of 8979a5a

Powered by Codecov. Updated on successful CI builds.

falaki · 2016-01-29T23:34:39Z

Thanks a lot for submitting this. The code looks good. Would you add another unit test that verify it works end-to-end. Maybe use the input you provided in the issue:

A,B,C,D
1,,,
,1,,
,,1,
,,,1

…aset #216

This reverts commit 9be0313.

tanwanirahul · 2016-01-30T17:17:38Z

Added an end to end test case to validate the proper schema/type inference. @falaki Take a look and merge if everything is fine.

HyukjinKwon · 2016-02-01T01:18:08Z

src/test/scala/com/databricks/spark/csv/util/InferSchemaSuite.scala

@@ -52,4 +83,14 @@ class InferSchemaSuite extends FunSuite {
      Array(LongType)).deep == Array(DoubleType).deep)
  }

+  test("Type/Schema inference works as expected for the simple parse dataset.")


Hm.. Shouldn't this go to CsvSuite and remove with BeforeAndAfterAll ,beforeAll() and afterAll() as this test is a end-to-end test?

I agree that CsvSuite performs all end to end tests but since there was a dedicated suite for SchemaInference, I did prefer to put the schema tests in there. Do you see any issues in that?

No, I think it is okay though. I just said this because I see suites have been added in this way.

For example, #235 and #224

Reverted the refactoring part. Tests I kept as is. If more people feel CsvSuite would rather be the right place to put these tests in, I will make that change.

tanwanirahul · 2016-02-01T10:22:24Z

@HyukjinKwon Created #248 for the refactoring part.

falaki · 2016-02-03T18:51:05Z

src/test/scala/com/databricks/spark/csv/util/InferSchemaSuite.scala

@@ -40,6 +63,14 @@ class InferSchemaSuite extends FunSuite {
    assert(InferSchema.inferField(LongType, "2015-08 14:49:00") == StringType)
  }

+  test("Merging Nulltypes should yeild Nulltype.")
+  {
+      assert(


Nit: the indent is off:

assert( InferSchema.mergeRowTypes(Array(NullType), Array(NullType)).deep == Array(NullType).deep)

falaki · 2016-02-03T19:23:44Z

@tanwanirahul left some more comments. Once you address I am going to merge this. After that would you mind submitting a patch to spark. Spark-csv is now inlined in Spark 2.0.

tanwanirahul · 2016-02-04T11:24:18Z

@HyukjinKwon Moved end to end tests to CSVSuite, @falaki Also recommended the same.

@falaki Take a look at the CSVSuite for test case and InferSchemaSuite for Indent comments you left. Also let me know your thoughts on #248

I will push the patch to spark repo soon.

HyukjinKwon · 2016-02-04T11:40:36Z

src/test/scala/com/databricks/spark/csv/CsvSuite.scala

@@ -717,6 +717,16 @@ abstract class AbstractCsvSuite extends FunSuite with BeforeAndAfterAll {

    assert(results.size === numCars)
  }
+
+  test("Type/Schema inference works as expected for the simple sparse dataset.")
+  {


I think this is ignoerable though.. I believe this { might better move up to the previous line.

tanwanirahul · 2016-02-04T12:13:36Z

@HyukjinKwon @falaki Is there any eclipse formatting profile that I could load and apply it for the code I am writing?

tanwanirahul · 2016-02-04T13:21:23Z

@falaki Could we re-build the commit? I don't think build failed because of the code issue.

tanwanirahul · 2016-02-06T09:03:33Z

@falaki @HyukjinKwon Could you please look at what is causing the build to fail? Looking at the build error, I don't think its code issue.

tanwanirahul · 2016-02-10T19:44:01Z

@falaki Could you please suggest next steps for this? Also, is there any date planned for the next release? We are currently using this change by adding an unmanaged dependency.

falaki · 2016-02-10T21:16:37Z

Would you please try that combination of spark and openjdk version to investigate what goes wrong?

HyukjinKwon · 2016-02-11T03:52:52Z

Let me take a look when I have some time.

HyukjinKwon · 2016-02-11T06:18:40Z

@falaki Hm.. It looks like this was failed due to the cached ones in travis.ci. This works fine both on my local and travis with a new build here.

I will create a new PR for this. Would you please make the author @tanwanirahul when you merge that PR by merge_pr.py?

HyukjinKwon · 2016-02-11T07:07:09Z

Can we maybe close this?

#244 This is re-opened due to the build failure in travis. Author: Rahul Tanwani <tanwanirahul@gmail.com> Author: hyukjinkwon <gurwls223@gmail.com> Closes #261 from HyukjinKwon/ISSUE-216.

tanwanirahul added 2 commits January 29, 2016 19:55

Fix schema/type inference issue #216

a275120

Add testcases for #216

8b4dbdd

tanwanirahul added 8 commits January 30, 2016 14:31

Adding a simple parse dataset for testing type/schema inference #216

b150c55

Bit of refactoring

9be0313

Adding an end to end test case for type inference of simple parse dat…

ff96172

…aset #216

Revert "Bit of refactoring"

d36b4a8

This reverts commit 9be0313.

Fix scalastyle issue - max length 100

c20b852

Bit of refactoring

18957e2

Fix scalastyle issue - max length 100

d41d5fb

Fix scalastyle issue - max length 100

9a1f428

HyukjinKwon reviewed Feb 1, 2016
View reviewed changes

tanwanirahul added 2 commits February 1, 2016 06:38

Fix type

80fbac4

Reverting the refactoring part

bb510a6

tanwanirahul mentioned this pull request Feb 1, 2016

Refactoring CsvParser. #248

Open

falaki reviewed Feb 3, 2016
View reviewed changes

tanwanirahul added 6 commits February 4, 2016 09:51

Move end to end test case to CSV suite; Fix indentation comments

2c24965

Merge branch 'master' of github.com:tanwanirahul/spark-csv

d6424ef

Resolving merge conflicts

890fadd

Resolving merge conflicts

52cd74f

Fix indentation issue

6f90a6b

Revert build.sbt changes

3d07d82

HyukjinKwon reviewed Feb 4, 2016
View reviewed changes

Fix code style

718b467

Indentation

5ef29b8

Fix indentation issue

6771184

HyukjinKwon mentioned this pull request Feb 11, 2016

Fix schema/type inference issue #261

Closed

tanwanirahul closed this Feb 11, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix schema/type inference issue #216 #244

Fix schema/type inference issue #216 #244

tanwanirahul commented Jan 29, 2016

codecov-io commented Jan 29, 2016

falaki commented Jan 29, 2016

tanwanirahul commented Jan 30, 2016

HyukjinKwon Feb 1, 2016

tanwanirahul Feb 1, 2016

HyukjinKwon Feb 1, 2016

tanwanirahul Feb 1, 2016

tanwanirahul commented Feb 1, 2016

falaki Feb 3, 2016

falaki commented Feb 3, 2016

tanwanirahul commented Feb 4, 2016

HyukjinKwon Feb 4, 2016

tanwanirahul commented Feb 4, 2016

tanwanirahul commented Feb 4, 2016

tanwanirahul commented Feb 6, 2016

tanwanirahul commented Feb 10, 2016

falaki commented Feb 10, 2016

HyukjinKwon commented Feb 11, 2016

HyukjinKwon commented Feb 11, 2016

HyukjinKwon commented Feb 11, 2016

Fix schema/type inference issue #216 #244

Fix schema/type inference issue #216 #244

Conversation

tanwanirahul commented Jan 29, 2016

codecov-io commented Jan 29, 2016

Current coverage is 85.74%

falaki commented Jan 29, 2016

tanwanirahul commented Jan 30, 2016

HyukjinKwon Feb 1, 2016

Choose a reason for hiding this comment

tanwanirahul Feb 1, 2016

Choose a reason for hiding this comment

HyukjinKwon Feb 1, 2016

Choose a reason for hiding this comment

tanwanirahul Feb 1, 2016

Choose a reason for hiding this comment

tanwanirahul commented Feb 1, 2016

falaki Feb 3, 2016

Choose a reason for hiding this comment

falaki commented Feb 3, 2016

tanwanirahul commented Feb 4, 2016

HyukjinKwon Feb 4, 2016

Choose a reason for hiding this comment

tanwanirahul commented Feb 4, 2016

tanwanirahul commented Feb 4, 2016

tanwanirahul commented Feb 6, 2016

tanwanirahul commented Feb 10, 2016

falaki commented Feb 10, 2016

HyukjinKwon commented Feb 11, 2016

HyukjinKwon commented Feb 11, 2016

HyukjinKwon commented Feb 11, 2016

Current coverage is `85.74%`