-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a Note on jsonFile having separate JSON objects per line #3517
Add a Note on jsonFile having separate JSON objects per line #3517
Conversation
* This commit hopes to avoid the confusion I faced when trying to submit a regular, valid multi-line JSON file, also see http://apache-spark-user-list.1001560.n3.nabble.com/Loading-JSON-Dataset-fails-with-com-fasterxml-jackson-databind-JsonMappingException-td20041.html
Can one of the admins verify this patch? |
Instead of adding this note, what do you think about changing the existing documentation to not say "JSON file" (since that brings along confusing connotations)? How about something like this:
|
@JoshRosen Good idea. Interestingly, the existing text already says a bit lower:
I would suggest to then also rename the example file to
to make clear it is not really a .json file. Would it not be better to start a new branch (pv-docs-note-on-jsonFile-format/02) |
That isn't necessary; when we merge pull requests, we use a script which squashes all commits in the PR down to a single combined commit, so it's fine to have many intermediate commits on this pull request's branch. I'd actually prefer if it if you pushed your new commit to this branch so that the discussion can stay on the same PR / page. |
* remove the long Note * rename the example file to `people.txt` * inspired by feedback from @JoshRosen
/cc @marmbrus, since this is a SQL change. |
Thx @JoshRosen for your follow-up. I locally verified a squashed version of my 2 commits. The squashed change is now very limited, affecting 6 lines with a replace of I hope it avoids the confusion I faced in trying to feed a genuine "json" file to |
@@ -621,7 +621,7 @@ val sqlContext = new org.apache.spark.sql.SQLContext(sc) | |||
|
|||
// A JSON dataset is pointed to by path. | |||
// The path can be either a single text file or a directory storing text files. | |||
val path = "examples/src/main/resources/people.json" | |||
val path = "examples/src/main/resources/people.txt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to move the file too and update the other places that reference it:
examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQL.java: String path = "examples/src/main/resources/people.json";
examples/src/main/python/sql.py: path = os.path.join(os.environ['SPARK_HOME'], "examples/src/main/resources/people.json")
LGTM once my comment is addressed. Thanks! |
One thought: will the changed example file name / location be confusing for people reading documentation versions that don't match their Spark version? |
Hmm, that is a good point. I have used this in quite a few presentation as well. Perhaps we can just change the error that gets printed when we encounter data that we can't parse? |
More problematic (and sorry I had not seen that before) ... there already is an example file named
In that case, I could rename the example jsonFile to I would then indeed also need to change the name in all other locations where a reference to
On a more fundamental note, from the outside, I would have perceived it following the "principle of least astonishment" (POLA) if the import to this function required a standard valid json file that needs to be formatted as an array of hashes with identical "schema", like e.g.
This would have allowed us to simply import data generated from any other language with I hear the proposal from @marmbrus to also improve the error message (that would also have helped us in more quickly understanding the issue), but it would suggest to put that in a different JIRA issue (that needs some real programming and testing work). I look forward to directions on how to best fix at least the documentation to avoid this confusion for others. Thanks. |
Bump ... I suggest we revert to something close to my original proposal:
In our DataScienceBe project, I just got this message from a new Spark user: "to reitarate (and make sure I understand correctly), the Just making this clear to the users will already avoid some frustration. Could you please confirm that I can make this proposal (or a different path to resolve this). |
Sure, I'm happy with clarifications to the documentation. |
I committed a revert that limits the squashed diff to a small addition of a Note for the 3 tabs of Scala, Java and Python. If anything more needs to happen, glad to look into it. There is no rebase required ? I could do it in a separate PR if useful. |
* This commit hopes to avoid the confusion I faced when trying to submit a regular, valid multi-line JSON file, also see http://apache-spark-user-list.1001560.n3.nabble.com/Loading-JSON-Dataset-fails-with-com-fasterxml-jackson-databind-JsonMappingException-td20041.html Author: Peter Vandenabeele <peter@vandenabeele.com> Closes #3517 from petervandenabeele/pv-docs-note-on-jsonFile-format/01 and squashes the following commits: 1f98e52 [Peter Vandenabeele] Revert to people.json and simple Note text 6b6e062 [Peter Vandenabeele] Change the "JSON" connotation to "txt" fca7dfb [Peter Vandenabeele] Add a Note on jsonFile having separate JSON objects per line (cherry picked from commit 1a9e35e) Signed-off-by: Michael Armbrust <michael@databricks.com>
Thanks! Merged to master and 1.2. BTW, in general there is no need to rebase or anything. Our script for merging PRs will always squash to a single linear commit. |
This commit hopes to avoid the confusion I faced when trying
to submit a regular, valid multi-line JSON file, also see
http://apache-spark-user-list.1001560.n3.nabble.com/Loading-JSON-Dataset-fails-with-com-fasterxml-jackson-databind-JsonMappingException-td20041.html