Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to preserve the original ordering of columns #75

Merged
merged 2 commits into from
Nov 11, 2021

Conversation

kdeggelman
Copy link
Contributor

Summary:

  • It's nice to have the option of keeping the columns in the same order
    as the input file for json and dict input formats.
  • On a related note, I'm not sure what BigQuery's autodetect functionality
    uses to determine order, but as of today it is not alphabetical.

Summary:
 - It's nice to have the option of keeping the columns in the same order
 as the input file for json and dict input formats.
 - On a related note, I'm not sure what BigQuery's `autodetect` functionality
 uses to determine order, but as of today it is not alphabetical.
@kdeggelman
Copy link
Contributor Author

@bxparks -- thanks for all your work on this tool, it's quickly becoming an integral part of our data pipeline.

I probably should have opened an issue to discuss this change before opening a pull request, but figured it was easy enough to make the code change and we can discuss from here.

Copy link
Owner

@bxparks bxparks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor formatting nits below.

I guess the bigger question is, does the sorting order matter for JSON input data? Because the order of keys in JSON file is undefined:

https://datatracker.ietf.org/doc/html/rfc7159#section-1
" An object is an unordered collection of zero or more name/value
pairs, where a name is a string and a value is a string, number,
boolean, null, object, or array."

So the BQ schema, as represented by a JSON file, should be considered identical no matter how the keys are ordered.

I guess I had sorted the keys when printing out the schema, because bq show --schema used to sort the keys. But if you are saying that it no longer sorts the keys, then what does it do? If the order is random, or a non-deterministic outcome of how the autodetect feature worked, then I am not sure I can replicate the exact order of those keys in this script. In other words, I'm not convinced that --preserve_input_sort_order will produce a JSON schema file that is identical to the one produced by bg show --schema.

@@ -1042,6 +1043,12 @@ def main():
' This can be fetched with:'
' `bq show --schema <project_id>:<dataset>:<table_name>',
default=None)
parser.add_argument(
'--keep-input-sort-order',
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use --preserve_input_sort_order using underscore for consistency with other flags.

@@ -113,7 +114,7 @@ def __init__(
# If CSV, preserve the original ordering because 'bq load` matches the
# CSV column with the respective schema entry using the position of the
# column in the schema.
self.sorted_schema = (input_format in {'json', 'dict'})
self.sorted_schema = (input_format in {'json', 'dict'}) and not keep_input_sort_order
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you run $ make flake8 at the root of the project, you'll see that it will complain about being longer than 80 columns. Break this into smaller lines using something like:

self.sorted_schema = (
  (input_format in {'json', 'dict'})
  and not keep_input_sort_order
)

@@ -1042,6 +1043,12 @@ def main():
' This can be fetched with:'
' `bq show --schema <project_id>:<dataset>:<table_name>',
default=None)
parser.add_argument(
'--keep-input-sort-order',
help='Preserve the original ordering of columns from input instead of sorting alphabetically.'
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

break long line < 80 columns

@@ -86,6 +86,7 @@ def __init__(
debugging_map=False,
sanitize_names=False,
ignore_invalid_lines=False,
keep_input_sort_order=False,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use preserve_input_sort_order

@bxparks
Copy link
Owner

bxparks commented Nov 11, 2021

My guess is that at some point, Google changed their internal schema generator tool from Python to Go, and Go's JSON serializer does not preserve ordering of the keys, in keeping with the JSON specification:
golang/go#27179
So the ordering of the JSON keys is now non-deterministic.

@kdeggelman
Copy link
Contributor Author

I think the critical piece to all of this is that a BigQuery schema is a JSON array [] and not a JSON object {}. The next line from the RFC you quoted says "An array is an ordered sequence of zero or more values."

I think the behavior boils down to these questions:

When you provide a json schema to bq load, does the order of the columns match the order of json?
Yes, this is the behavior that I've seen consistently for the reason stated above.

When you don't provide a json schema, what does autodetect use to order the columns? Is it deterministic?
I'm not sure what they use, it doesn't seem to match the order of columns in the source json, but it's also not alphabetical. When I've run autodetect on the same source file multiple times (5 times), it has resulted in the same ordering each time, but I'm not confident enough to say that the autodetect ordering is deterministic.

Is bq show --schema deterministic and if so, what ordering does it use?
Yes, the ordering matches the order of the columns.

@bxparks
Copy link
Owner

bxparks commented Nov 11, 2021

I agree the BQ schema JSON is ordered because it uses an array. And I think this flag is worth adding. But I have a concern that we are making some assumptions that may be transitory implementation details of the bq command. Because what I meant was that the ordering of the keys in the JSON input data is undefined when it is read into memory. Let's say I have a line like:

{ "s": "string", "x": 3.2, "i": 3, "b": true }
  • If I read that into memory using C++ std::map, the ordering of the keys is undefined.
  • If read that in into memory in Python using json.load(), in Python < 3.7 the ordering was considered implementation defined, and in Python >= 3.7 the ordering was changed to be input-order.
  • If I read that using Java HashMap, the ordering of the keys is undefined.
  • In Go lang, the key ordering is apparently undefined.

If subsequent lines in the input data had a different key ordering, like this:

{ "s": "string", "x": 3.2, "i": 3, "b": true }
...
{  "b": true, "i": 3, "s": "string", "x": 3.2 }

does bq load --autodetect use the ordering of the first line, or the last line?

Do you know any BigQuery documentation that explicitly defines the ordering of the keys in the BQ schema in relation to the ordering of the keys in the input JSON data?

@bxparks
Copy link
Owner

bxparks commented Nov 11, 2021

It also seems tome that the input ordering has further subtleties, because each JSON data line does not need to contain all the keys. So, in the following:

{ "s": "string", "i": 3 }
{ "x": 3.2, "i": 3, "s": "string" }
{  "b": true, "i": 3, "s": "string", "x": 3.2 }

I think your --preserve_input_sort_order flag would generate a schema with the keys in the following order:

["s", "i", "x", "b"]

@kdeggelman
Copy link
Contributor Author

Do you know any BigQuery documentation that explicitly defines the ordering of the keys in the BQ schema in relation to the ordering of the keys in the input JSON data?

No, I couldn't find anything that dealt with the ordering of columns. https://cloud.google.com/bigquery/docs/schema-detect

I don't think we need to be overly concerned with the behavior of bq load --autodetect as the purpose of this tool is to avoid using that instead use --schema with the output of this tool. Maybe it's worth adding a note to the help message of preserve_input_sort_order that says something like 'this is not intended to replicate the order you'd get by using `bq load --autodetect`'

However...

If read that in into memory in Python using json.load(), in Python < 3.7 the ordering was considered implementation defined, and in Python >= 3.7 the ordering was changed to be input-order.

seems like a big deal to us, as setup.py currently states python_requires='>=3.6',. This leads me to believe that my new flag will not be guaranteed work correctly for python 3.6.

@bxparks
Copy link
Owner

bxparks commented Nov 11, 2021

(Correction, C++ std::map is an ordered map, so the ordering is defined. It's hard to remember the implementation details of various languages and libraries.)

@bxparks
Copy link
Owner

bxparks commented Nov 11, 2021

I don't think we need to be overly concerned with the behavior of bq load --autodetect as the purpose of this tool is to avoid using that instead use --schema with the output of this tool.

That's actually a good point.

Maybe it's worth adding a note to the help message of preserve_input_sort_order that says something like 'this is not intended to replicate the order you'd get by using `bq load --autodetect`'

I'll have to add an entry in the README.md for this flag, so I think we put that warning there.

If read that in into memory in Python using json.load(), in Python < 3.7 the ordering was considered implementation defined, and in Python >= 3.7 the ordering was changed to be input-order.

seems like a big deal to us, as setup.py currently states python_requires='>=3.6',. This leads me to believe that my new flag will not be guaranteed work correctly for python 3.6.

Python 3.6 should be a decreasing percentage of users of this library, so I'm ok with not guaranteeing that this flag will work in 3.6. Apparently the implementation detail of dict in CPython 3.6 is to preserve ordering. They just made it guaranteed in 3.7. https://stackoverflow.com/questions/5629023/the-order-of-keys-in-dictionaries

I want to write some unit tests to verify edge cases, but I think this PR is worth adding.

@kdeggelman
Copy link
Contributor Author

Thanks for working through this with me. I'm happy to update the README and add tests, but if you'd prefer to do that yourself then fine by me 😄

@bxparks
Copy link
Owner

bxparks commented Nov 11, 2021

It'll take me longer to explain the administrative work, than just doing it myself. Thanks for offering though.

@bxparks bxparks merged commit 02e09dd into bxparks:develop Nov 11, 2021
@bxparks
Copy link
Owner

bxparks commented Nov 11, 2021

I'll let this simmer for a few days, before pushing a new release.

bxparks added a commit that referenced this pull request Nov 11, 2021
@kdeggelman kdeggelman deleted the keep-original-order branch November 12, 2021 04:33
@bxparks bxparks mentioned this pull request Nov 14, 2021
@bxparks
Copy link
Owner

bxparks commented Nov 14, 2021

Pushed v1.5 to PyPI. Thanks for the work on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants