Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix schema/type inference issue #216 #244

Closed
wants to merge 21 commits into from

Conversation

tanwanirahul
Copy link
Contributor

The issue with current implementation is that, it bumps up the NullType to StringType in reduce phase of each partition. When the final reduce across partitions takes place the column type is inferred as StringType even though the column has only one integer value because of incorrect bumping up of NullTypes.

Here, in this pull, we don't bump up the Null to String as a part of reduce operation (within and across partitions). After the reduction, when we create Fields of StructType, we convert NullTypes to StringTypes.

@codecov-io
Copy link

Current coverage is 85.74%

Merging #244 into master will decrease coverage by -0.33% as of 8979a5a

@@            master    #244   diff @@
======================================
  Files           12      12       
  Stmts          517     519     +2
  Branches       149     149       
  Methods          0       0       
======================================
  Hit            445     445       
  Partial          0       0       
- Missed          72      74     +2

Review entire Coverage Diff as of 8979a5a

Powered by Codecov. Updated on successful CI builds.

@falaki
Copy link
Member

falaki commented Jan 29, 2016

Thanks a lot for submitting this. The code looks good. Would you add another unit test that verify it works end-to-end. Maybe use the input you provided in the issue:

A,B,C,D
1,,,
,1,,
,,1,
,,,1

@tanwanirahul
Copy link
Contributor Author

Added an end to end test case to validate the proper schema/type inference. @falaki Take a look and merge if everything is fine.

@@ -52,4 +83,14 @@ class InferSchemaSuite extends FunSuite {
Array(LongType)).deep == Array(DoubleType).deep)
}

test("Type/Schema inference works as expected for the simple parse dataset.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm.. Shouldn't this go to CsvSuite and remove with BeforeAndAfterAll ,beforeAll() and afterAll() as this test is a end-to-end test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that CsvSuite performs all end to end tests but since there was a dedicated suite for SchemaInference, I did prefer to put the schema tests in there. Do you see any issues in that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think it is okay though. I just said this because I see suites have been added in this way.

For example, #235 and #224

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted the refactoring part. Tests I kept as is. If more people feel CsvSuite would rather be the right place to put these tests in, I will make that change.

@tanwanirahul
Copy link
Contributor Author

@HyukjinKwon Created #248 for the refactoring part.

@@ -40,6 +63,14 @@ class InferSchemaSuite extends FunSuite {
assert(InferSchema.inferField(LongType, "2015-08 14:49:00") == StringType)
}

test("Merging Nulltypes should yeild Nulltype.")
{
assert(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the indent is off:

assert(
  InferSchema.mergeRowTypes(Array(NullType),
  Array(NullType)).deep == Array(NullType).deep)

@falaki
Copy link
Member

falaki commented Feb 3, 2016

@tanwanirahul left some more comments. Once you address I am going to merge this. After that would you mind submitting a patch to spark. Spark-csv is now inlined in Spark 2.0.

@tanwanirahul
Copy link
Contributor Author

@HyukjinKwon Moved end to end tests to CSVSuite, @falaki Also recommended the same.

@falaki Take a look at the CSVSuite for test case and InferSchemaSuite for Indent comments you left. Also let me know your thoughts on #248

I will push the patch to spark repo soon.

@@ -717,6 +717,16 @@ abstract class AbstractCsvSuite extends FunSuite with BeforeAndAfterAll {

assert(results.size === numCars)
}

test("Type/Schema inference works as expected for the simple sparse dataset.")
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ignoerable though.. I believe this { might better move up to the previous line.

@tanwanirahul
Copy link
Contributor Author

@HyukjinKwon @falaki Is there any eclipse formatting profile that I could load and apply it for the code I am writing?

@tanwanirahul
Copy link
Contributor Author

@falaki Could we re-build the commit? I don't think build failed because of the code issue.

@tanwanirahul
Copy link
Contributor Author

@falaki @HyukjinKwon Could you please look at what is causing the build to fail? Looking at the build error, I don't think its code issue.

@tanwanirahul
Copy link
Contributor Author

@falaki Could you please suggest next steps for this? Also, is there any date planned for the next release? We are currently using this change by adding an unmanaged dependency.

@falaki
Copy link
Member

falaki commented Feb 10, 2016

Would you please try that combination of spark and openjdk version to investigate what goes wrong?

@HyukjinKwon
Copy link
Member

Let me take a look when I have some time.

@HyukjinKwon
Copy link
Member

@falaki Hm.. It looks like this was failed due to the cached ones in travis.ci. This works fine both on my local and travis with a new build here.

I will create a new PR for this. Would you please make the author @tanwanirahul when you merge that PR by merge_pr.py?

@HyukjinKwon
Copy link
Member

Can we maybe close this?

falaki pushed a commit that referenced this pull request Feb 13, 2016
#244
This is re-opened due to the build failure in travis.

Author: Rahul Tanwani <tanwanirahul@gmail.com>
Author: hyukjinkwon <gurwls223@gmail.com>

Closes #261 from HyukjinKwon/ISSUE-216.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants