Add a Note on jsonFile having separate JSON objects per line #3517

petervandenabeele · 2014-11-30T16:53:19Z

This commit hopes to avoid the confusion I faced when trying
to submit a regular, valid multi-line JSON file, also see

http://apache-spark-user-list.1001560.n3.nabble.com/Loading-JSON-Dataset-fails-with-com-fasterxml-jackson-databind-JsonMappingException-td20041.html

* This commit hopes to avoid the confusion I faced when trying to submit a regular, valid multi-line JSON file, also see http://apache-spark-user-list.1001560.n3.nabble.com/Loading-JSON-Dataset-fails-with-com-fasterxml-jackson-databind-JsonMappingException-td20041.html

AmplabJenkins · 2014-11-30T16:57:11Z

Can one of the admins verify this patch?

JoshRosen · 2014-12-01T03:39:00Z

Instead of adding this note, what do you think about changing the existing documentation to not say "JSON file" (since that brings along confusing connotations)? How about something like this:

`jsonFile` - loads data from a directory of text files where each line of the files is a JSON object.

petervandenabeele · 2014-12-01T17:27:26Z

@JoshRosen Good idea. Interestingly, the existing text already says a bit lower:

// The path can be either a single text file or a directory storing text files.      
val path = "examples/src/main/resources/people.json"

I would suggest to then also rename the example file to

val path = "examples/src/main/resources/people.txt"

to make clear it is not really a .json file.
I will think about it and may submit a next version of the patch
(which will result in a smaller diff then).

Would it not be better to start a new branch (pv-docs-note-on-jsonFile-format/02)
that I rebase of current master and only has the actual change (and not the
initial change that was too verbose) ?

JoshRosen · 2014-12-01T17:30:51Z

Would it not be better to start a new branch (pv-docs-note-on-jsonFile-format/02) that I rebase of current master and only has the actual change (and not the initial change that was too verbose) ?

That isn't necessary; when we merge pull requests, we use a script which squashes all commits in the PR down to a single combined commit, so it's fine to have many intermediate commits on this pull request's branch. I'd actually prefer if it if you pushed your new commit to this branch so that the discussion can stay on the same PR / page.

@JoshRosen

* remove the long Note * rename the example file to `people.txt` * inspired by feedback from @JoshRosen

JoshRosen · 2014-12-02T19:44:39Z

/cc @marmbrus, since this is a SQL change.

petervandenabeele · 2014-12-02T19:51:51Z

Thx @JoshRosen for your follow-up.

I locally verified a squashed version of my 2 commits. The squashed change is now very limited, affecting 6 lines with a replace of (JSON)|(json) by txt.

I hope it avoids the confusion I faced in trying to feed a genuine "json" file to sqlContext.jsonFile(path).

marmbrus · 2014-12-04T22:17:52Z

docs/sql-programming-guide.md

@@ -621,7 +621,7 @@ val sqlContext = new org.apache.spark.sql.SQLContext(sc)

 // A JSON dataset is pointed to by path.
 // The path can be either a single text file or a directory storing text files.
-val path = "examples/src/main/resources/people.json"
+val path = "examples/src/main/resources/people.txt"


We need to move the file too and update the other places that reference it:

examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQL.java: String path = "examples/src/main/resources/people.json"; examples/src/main/python/sql.py: path = os.path.join(os.environ['SPARK_HOME'], "examples/src/main/resources/people.json")

marmbrus · 2014-12-04T22:18:30Z

LGTM once my comment is addressed. Thanks!

JoshRosen · 2014-12-04T22:19:41Z

One thought: will the changed example file name / location be confusing for people reading documentation versions that don't match their Spark version?

marmbrus · 2014-12-05T03:41:15Z

Hmm, that is a good point. I have used this in quite a few presentation as well. Perhaps we can just change the error that gets printed when we encounter data that we can't parse?

petervandenabeele · 2014-12-05T20:05:30Z

More problematic (and sorry I had not seen that before) ... there already is an example file named people.txt with a different format:

$ spark git:(pv-docs-note-on-jsonFile-format/01) cat examples/src/main/resources/people.txt
Michael, 29
Andy, 30
Justin, 19

In that case, I could rename the example jsonFile to people.jsons. It is a weird name, but it's reasonably accurate (following the xs pattern from Scala, as it is like a list of json objects).

I would then indeed also need to change the name in all other locations where a reference to people.json is made (confirming the list mentioned by @marmbrus):

spark git:(pv-docs-note-on-jsonFile-format/01) grep -r 'people\.json' * | grep -v Binary | grep -v _site     
examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQL.java:    String path = "examples/src/main/resources/people.json";
examples/src/main/python/sql.py:    path = os.path.join(os.environ['SPARK_HOME'], "examples/src/main/resources/people.json")

On a more fundamental note, from the outside, I would have perceived it following the "principle of least astonishment" (POLA) if the import to this function required a standard valid json file that needs to be formatted as an array of hashes with identical "schema", like e.g.

[
  {"name": "Tom",
   "character":"cat"},
  {"name":"Jerry",
   "character":"mouse"}
]

This would have allowed us to simply import data generated from any other language with array.to_json.

I hear the proposal from @marmbrus to also improve the error message (that would also have helped us in more quickly understanding the issue), but it would suggest to put that in a different JIRA issue (that needs some real programming and testing work).

I look forward to directions on how to best fix at least the documentation to avoid this confusion for others.

Thanks.

petervandenabeele · 2014-12-14T18:17:55Z

Bump ...

I suggest we revert to something close to my original proposal:

no change in filenames (too complex for now)
add a small(er) note in the doc about the non-standard format

In our DataScienceBe project, I just got this message from a new Spark user:

"to reitarate (and make sure I understand correctly), the jsonFilefunction does not read valid JSON files, but rather special files containing a valid JSON object on each line."

Just making this clear to the users will already avoid some frustration.

Could you please confirm that I can make this proposal (or a different path to resolve this).

marmbrus · 2014-12-14T18:46:39Z

Sure, I'm happy with clarifications to the documentation.

petervandenabeele · 2014-12-15T17:37:46Z

I committed a revert that limits the squashed diff to a small addition of a Note for the 3 tabs of Scala, Java and Python.

If anything more needs to happen, glad to look into it.

There is no rebase required ? I could do it in a separate PR if useful.

* This commit hopes to avoid the confusion I faced when trying to submit a regular, valid multi-line JSON file, also see http://apache-spark-user-list.1001560.n3.nabble.com/Loading-JSON-Dataset-fails-with-com-fasterxml-jackson-databind-JsonMappingException-td20041.html Author: Peter Vandenabeele <peter@vandenabeele.com> Closes #3517 from petervandenabeele/pv-docs-note-on-jsonFile-format/01 and squashes the following commits: 1f98e52 [Peter Vandenabeele] Revert to people.json and simple Note text 6b6e062 [Peter Vandenabeele] Change the "JSON" connotation to "txt" fca7dfb [Peter Vandenabeele] Add a Note on jsonFile having separate JSON objects per line (cherry picked from commit 1a9e35e) Signed-off-by: Michael Armbrust <michael@databricks.com>

marmbrus · 2014-12-16T22:03:53Z

Thanks! Merged to master and 1.2.

BTW, in general there is no need to rebase or anything. Our script for merging PRs will always squash to a single linear commit.

Change the "JSON" connotation to "txt"

6b6e062

* remove the long Note * rename the example file to `people.txt` * inspired by feedback from @JoshRosen

marmbrus reviewed Dec 4, 2014
View reviewed changes

Revert to people.json and simple Note text

1f98e52

asfgit closed this in 1a9e35e Dec 16, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a Note on jsonFile having separate JSON objects per line #3517

Add a Note on jsonFile having separate JSON objects per line #3517

petervandenabeele commented Nov 30, 2014

AmplabJenkins commented Nov 30, 2014

JoshRosen commented Dec 1, 2014

petervandenabeele commented Dec 1, 2014

JoshRosen commented Dec 1, 2014

JoshRosen commented Dec 2, 2014

petervandenabeele commented Dec 2, 2014

marmbrus Dec 4, 2014

marmbrus commented Dec 4, 2014

JoshRosen commented Dec 4, 2014

marmbrus commented Dec 5, 2014

petervandenabeele commented Dec 5, 2014

petervandenabeele commented Dec 14, 2014

marmbrus commented Dec 14, 2014

petervandenabeele commented Dec 15, 2014

marmbrus commented Dec 16, 2014

Add a Note on jsonFile having separate JSON objects per line #3517

Add a Note on jsonFile having separate JSON objects per line #3517

Conversation

petervandenabeele commented Nov 30, 2014

AmplabJenkins commented Nov 30, 2014

JoshRosen commented Dec 1, 2014

petervandenabeele commented Dec 1, 2014

JoshRosen commented Dec 1, 2014

JoshRosen commented Dec 2, 2014

petervandenabeele commented Dec 2, 2014

marmbrus Dec 4, 2014

Choose a reason for hiding this comment

marmbrus commented Dec 4, 2014

JoshRosen commented Dec 4, 2014

marmbrus commented Dec 5, 2014

petervandenabeele commented Dec 5, 2014

petervandenabeele commented Dec 14, 2014

marmbrus commented Dec 14, 2014

petervandenabeele commented Dec 15, 2014

marmbrus commented Dec 16, 2014