Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Note on jsonFile having separate JSON objects per line #3517

Conversation

petervandenabeele
Copy link
Contributor

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@JoshRosen
Copy link
Contributor

Instead of adding this note, what do you think about changing the existing documentation to not say "JSON file" (since that brings along confusing connotations)? How about something like this:

`jsonFile` - loads data from a directory of text files where each line of the files is a JSON object.

@petervandenabeele
Copy link
Contributor Author

@JoshRosen Good idea. Interestingly, the existing text already says a bit lower:

// The path can be either a single text file or a directory storing text files.      
val path = "examples/src/main/resources/people.json"

I would suggest to then also rename the example file to

val path = "examples/src/main/resources/people.txt"

to make clear it is not really a .json file.
I will think about it and may submit a next version of the patch
(which will result in a smaller diff then).

Would it not be better to start a new branch (pv-docs-note-on-jsonFile-format/02)
that I rebase of current master and only has the actual change (and not the
initial change that was too verbose) ?

@JoshRosen
Copy link
Contributor

Would it not be better to start a new branch (pv-docs-note-on-jsonFile-format/02) that I rebase of current master and only has the actual change (and not the initial change that was too verbose) ?

That isn't necessary; when we merge pull requests, we use a script which squashes all commits in the PR down to a single combined commit, so it's fine to have many intermediate commits on this pull request's branch. I'd actually prefer if it if you pushed your new commit to this branch so that the discussion can stay on the same PR / page.

* remove the long Note
* rename the example file to `people.txt`
* inspired by feedback from @JoshRosen
@JoshRosen
Copy link
Contributor

/cc @marmbrus, since this is a SQL change.

@petervandenabeele
Copy link
Contributor Author

Thx @JoshRosen for your follow-up.

I locally verified a squashed version of my 2 commits. The squashed change is now very limited, affecting 6 lines with a replace of (JSON)|(json) by txt.

I hope it avoids the confusion I faced in trying to feed a genuine "json" file to sqlContext.jsonFile(path).

@@ -621,7 +621,7 @@ val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// A JSON dataset is pointed to by path.
// The path can be either a single text file or a directory storing text files.
val path = "examples/src/main/resources/people.json"
val path = "examples/src/main/resources/people.txt"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to move the file too and update the other places that reference it:

examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQL.java:    String path = "examples/src/main/resources/people.json";
examples/src/main/python/sql.py:    path = os.path.join(os.environ['SPARK_HOME'], "examples/src/main/resources/people.json")

@marmbrus
Copy link
Contributor

marmbrus commented Dec 4, 2014

LGTM once my comment is addressed. Thanks!

@JoshRosen
Copy link
Contributor

One thought: will the changed example file name / location be confusing for people reading documentation versions that don't match their Spark version?

@marmbrus
Copy link
Contributor

marmbrus commented Dec 5, 2014

Hmm, that is a good point. I have used this in quite a few presentation as well. Perhaps we can just change the error that gets printed when we encounter data that we can't parse?

@petervandenabeele
Copy link
Contributor Author

More problematic (and sorry I had not seen that before) ... there already is an example file named people.txt with a different format:

$ spark git:(pv-docs-note-on-jsonFile-format/01) cat examples/src/main/resources/people.txt
Michael, 29
Andy, 30
Justin, 19

In that case, I could rename the example jsonFile to people.jsons. It is a weird name, but it's reasonably accurate (following the xs pattern from Scala, as it is like a list of json objects).

I would then indeed also need to change the name in all other locations where a reference to people.json is made (confirming the list mentioned by @marmbrus):

spark git:(pv-docs-note-on-jsonFile-format/01) grep -r 'people\.json' * | grep -v Binary | grep -v _site     
examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQL.java:    String path = "examples/src/main/resources/people.json";
examples/src/main/python/sql.py:    path = os.path.join(os.environ['SPARK_HOME'], "examples/src/main/resources/people.json")

On a more fundamental note, from the outside, I would have perceived it following the "principle of least astonishment" (POLA) if the import to this function required a standard valid json file that needs to be formatted as an array of hashes with identical "schema", like e.g.

[
  {"name": "Tom",
   "character":"cat"},
  {"name":"Jerry",
   "character":"mouse"}
]

This would have allowed us to simply import data generated from any other language with array.to_json.

I hear the proposal from @marmbrus to also improve the error message (that would also have helped us in more quickly understanding the issue), but it would suggest to put that in a different JIRA issue (that needs some real programming and testing work).

I look forward to directions on how to best fix at least the documentation to avoid this confusion for others.

Thanks.

@petervandenabeele
Copy link
Contributor Author

Bump ...

I suggest we revert to something close to my original proposal:

  • no change in filenames (too complex for now)
  • add a small(er) note in the doc about the non-standard format

In our DataScienceBe project, I just got this message from a new Spark user:

"to reitarate (and make sure I understand correctly), the jsonFilefunction does not read valid JSON files, but rather special files containing a valid JSON object on each line."

Just making this clear to the users will already avoid some frustration.

Could you please confirm that I can make this proposal (or a different path to resolve this).

@marmbrus
Copy link
Contributor

Sure, I'm happy with clarifications to the documentation.

@petervandenabeele
Copy link
Contributor Author

I committed a revert that limits the squashed diff to a small addition of a Note for the 3 tabs of Scala, Java and Python.

If anything more needs to happen, glad to look into it.

There is no rebase required ? I could do it in a separate PR if useful.

@asfgit asfgit closed this in 1a9e35e Dec 16, 2014
asfgit pushed a commit that referenced this pull request Dec 16, 2014
* This commit hopes to avoid the confusion I faced when trying
  to submit a regular, valid multi-line JSON file, also see

  http://apache-spark-user-list.1001560.n3.nabble.com/Loading-JSON-Dataset-fails-with-com-fasterxml-jackson-databind-JsonMappingException-td20041.html

Author: Peter Vandenabeele <peter@vandenabeele.com>

Closes #3517 from petervandenabeele/pv-docs-note-on-jsonFile-format/01 and squashes the following commits:

1f98e52 [Peter Vandenabeele] Revert to people.json and simple Note text
6b6e062 [Peter Vandenabeele] Change the "JSON" connotation to "txt"
fca7dfb [Peter Vandenabeele] Add a Note on jsonFile having separate JSON objects per line

(cherry picked from commit 1a9e35e)
Signed-off-by: Michael Armbrust <michael@databricks.com>
@marmbrus
Copy link
Contributor

Thanks! Merged to master and 1.2.

BTW, in general there is no need to rebase or anything. Our script for merging PRs will always squash to a single linear commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants