Add option to preserve the original ordering of columns #75

kdeggelman · 2021-11-10T23:35:33Z

Summary:

It's nice to have the option of keeping the columns in the same order
as the input file for json and dict input formats.
On a related note, I'm not sure what BigQuery's autodetect functionality
uses to determine order, but as of today it is not alphabetical.

Summary: - It's nice to have the option of keeping the columns in the same order as the input file for json and dict input formats. - On a related note, I'm not sure what BigQuery's `autodetect` functionality uses to determine order, but as of today it is not alphabetical.

kdeggelman · 2021-11-10T23:37:43Z

@bxparks -- thanks for all your work on this tool, it's quickly becoming an integral part of our data pipeline.

I probably should have opened an issue to discuss this change before opening a pull request, but figured it was easy enough to make the code change and we can discuss from here.

bxparks

Some minor formatting nits below.

I guess the bigger question is, does the sorting order matter for JSON input data? Because the order of keys in JSON file is undefined:

https://datatracker.ietf.org/doc/html/rfc7159#section-1
" An object is an unordered collection of zero or more name/value
pairs, where a name is a string and a value is a string, number,
boolean, null, object, or array."

So the BQ schema, as represented by a JSON file, should be considered identical no matter how the keys are ordered.

I guess I had sorted the keys when printing out the schema, because bq show --schema used to sort the keys. But if you are saying that it no longer sorts the keys, then what does it do? If the order is random, or a non-deterministic outcome of how the autodetect feature worked, then I am not sure I can replicate the exact order of those keys in this script. In other words, I'm not convinced that --preserve_input_sort_order will produce a JSON schema file that is identical to the one produced by bg show --schema.

bxparks · 2021-11-11T05:34:32Z

bigquery_schema_generator/generate_schema.py

@@ -1042,6 +1043,12 @@ def main():
        ' This can be fetched with:'
        ' `bq show --schema <project_id>:<dataset>:<table_name>',
        default=None)
+    parser.add_argument(
+        '--keep-input-sort-order',


Let's use --preserve_input_sort_order using underscore for consistency with other flags.

bxparks · 2021-11-11T05:36:26Z

bigquery_schema_generator/generate_schema.py

@@ -113,7 +114,7 @@ def __init__(
        # If CSV, preserve the original ordering because 'bq load` matches the
        # CSV column with the respective schema entry using the position of the
        # column in the schema.
-        self.sorted_schema = (input_format in {'json', 'dict'})
+        self.sorted_schema = (input_format in {'json', 'dict'}) and not keep_input_sort_order


If you run $ make flake8 at the root of the project, you'll see that it will complain about being longer than 80 columns. Break this into smaller lines using something like:

self.sorted_schema = ( (input_format in {'json', 'dict'}) and not keep_input_sort_order )

bxparks · 2021-11-11T05:37:26Z

bigquery_schema_generator/generate_schema.py

@@ -1042,6 +1043,12 @@ def main():
        ' This can be fetched with:'
        ' `bq show --schema <project_id>:<dataset>:<table_name>',
        default=None)
+    parser.add_argument(
+        '--keep-input-sort-order',
+        help='Preserve the original ordering of columns from input instead of sorting alphabetically.'


break long line < 80 columns

bxparks · 2021-11-11T05:38:01Z

bigquery_schema_generator/generate_schema.py

@@ -86,6 +86,7 @@ def __init__(
        debugging_map=False,
        sanitize_names=False,
        ignore_invalid_lines=False,
+        keep_input_sort_order=False,


Let's use preserve_input_sort_order

bxparks · 2021-11-11T05:59:42Z

My guess is that at some point, Google changed their internal schema generator tool from Python to Go, and Go's JSON serializer does not preserve ordering of the keys, in keeping with the JSON specification:
golang/go#27179
So the ordering of the JSON keys is now non-deterministic.

kdeggelman · 2021-11-11T15:36:42Z

I think the critical piece to all of this is that a BigQuery schema is a JSON array [] and not a JSON object {}. The next line from the RFC you quoted says "An array is an ordered sequence of zero or more values."

I think the behavior boils down to these questions:

When you provide a json schema to bq load, does the order of the columns match the order of json?
Yes, this is the behavior that I've seen consistently for the reason stated above.

When you don't provide a json schema, what does autodetect use to order the columns? Is it deterministic?
I'm not sure what they use, it doesn't seem to match the order of columns in the source json, but it's also not alphabetical. When I've run autodetect on the same source file multiple times (5 times), it has resulted in the same ordering each time, but I'm not confident enough to say that the autodetect ordering is deterministic.

Is bq show --schema deterministic and if so, what ordering does it use?
Yes, the ordering matches the order of the columns.

bxparks · 2021-11-11T16:19:27Z

I agree the BQ schema JSON is ordered because it uses an array. And I think this flag is worth adding. But I have a concern that we are making some assumptions that may be transitory implementation details of the bq command. Because what I meant was that the ordering of the keys in the JSON input data is undefined when it is read into memory. Let's say I have a line like:

{ "s": "string", "x": 3.2, "i": 3, "b": true }

If I read that into memory using C++ std::map, the ordering of the keys is undefined.
If read that in into memory in Python using json.load(), in Python < 3.7 the ordering was considered implementation defined, and in Python >= 3.7 the ordering was changed to be input-order.
If I read that using Java HashMap, the ordering of the keys is undefined.
In Go lang, the key ordering is apparently undefined.

If subsequent lines in the input data had a different key ordering, like this:

{ "s": "string", "x": 3.2, "i": 3, "b": true }
...
{  "b": true, "i": 3, "s": "string", "x": 3.2 }

does bq load --autodetect use the ordering of the first line, or the last line?

Do you know any BigQuery documentation that explicitly defines the ordering of the keys in the BQ schema in relation to the ordering of the keys in the input JSON data?

bxparks · 2021-11-11T16:34:54Z

It also seems tome that the input ordering has further subtleties, because each JSON data line does not need to contain all the keys. So, in the following:

{ "s": "string", "i": 3 }
{ "x": 3.2, "i": 3, "s": "string" }
{  "b": true, "i": 3, "s": "string", "x": 3.2 }

I think your --preserve_input_sort_order flag would generate a schema with the keys in the following order:

["s", "i", "x", "b"]

kdeggelman · 2021-11-11T16:40:52Z

Do you know any BigQuery documentation that explicitly defines the ordering of the keys in the BQ schema in relation to the ordering of the keys in the input JSON data?

No, I couldn't find anything that dealt with the ordering of columns. https://cloud.google.com/bigquery/docs/schema-detect

I don't think we need to be overly concerned with the behavior of bq load --autodetect as the purpose of this tool is to avoid using that instead use --schema with the output of this tool. Maybe it's worth adding a note to the help message of preserve_input_sort_order that says something like 'this is not intended to replicate the order you'd get by using `bq load --autodetect`'

However...

If read that in into memory in Python using json.load(), in Python < 3.7 the ordering was considered implementation defined, and in Python >= 3.7 the ordering was changed to be input-order.

seems like a big deal to us, as setup.py currently states python_requires='>=3.6',. This leads me to believe that my new flag will not be guaranteed work correctly for python 3.6.

bxparks · 2021-11-11T16:41:13Z

(Correction, C++ std::map is an ordered map, so the ordering is defined. It's hard to remember the implementation details of various languages and libraries.)

bxparks · 2021-11-11T16:54:53Z

I don't think we need to be overly concerned with the behavior of bq load --autodetect as the purpose of this tool is to avoid using that instead use --schema with the output of this tool.

That's actually a good point.

Maybe it's worth adding a note to the help message of preserve_input_sort_order that says something like 'this is not intended to replicate the order you'd get by using `bq load --autodetect`'

I'll have to add an entry in the README.md for this flag, so I think we put that warning there.

If read that in into memory in Python using json.load(), in Python < 3.7 the ordering was considered implementation defined, and in Python >= 3.7 the ordering was changed to be input-order.

seems like a big deal to us, as setup.py currently states python_requires='>=3.6',. This leads me to believe that my new flag will not be guaranteed work correctly for python 3.6.

Python 3.6 should be a decreasing percentage of users of this library, so I'm ok with not guaranteeing that this flag will work in 3.6. Apparently the implementation detail of dict in CPython 3.6 is to preserve ordering. They just made it guaranteed in 3.7. https://stackoverflow.com/questions/5629023/the-order-of-keys-in-dictionaries

I want to write some unit tests to verify edge cases, but I think this PR is worth adding.

kdeggelman · 2021-11-11T16:58:15Z

Thanks for working through this with me. I'm happy to update the README and add tests, but if you'd prefer to do that yourself then fine by me 😄

bxparks · 2021-11-11T18:17:05Z

It'll take me longer to explain the administrative work, than just doing it myself. Thanks for offering though.

bxparks · 2021-11-11T18:24:27Z

I'll let this simmer for a few days, before pushing a new release.

bxparks · 2021-11-14T16:36:28Z

Pushed v1.5 to PyPI. Thanks for the work on this!

bxparks reviewed Nov 11, 2021

View reviewed changes

Update option name and fix line lengths

cb73385

bxparks merged commit 02e09dd into bxparks:develop Nov 11, 2021

bxparks added a commit that referenced this pull request Nov 11, 2021

Add unit tests for --preserve_input_sort_order (see #75)

7673bb5

bxparks added a commit that referenced this pull request Nov 11, 2021

Add documentation for --preserve_input_sort_order (see #75)

264dfbd

bxparks added a commit that referenced this pull request Nov 11, 2021

README.md: Update --help output (see #75)

7b4115d

kdeggelman deleted the keep-original-order branch November 12, 2021 04:33

bxparks mentioned this pull request Nov 14, 2021

merge v1.5 into master #76

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to preserve the original ordering of columns #75

Add option to preserve the original ordering of columns #75

kdeggelman commented Nov 10, 2021

kdeggelman commented Nov 10, 2021

bxparks left a comment

bxparks Nov 11, 2021

bxparks Nov 11, 2021

bxparks Nov 11, 2021

bxparks Nov 11, 2021

bxparks commented Nov 11, 2021

kdeggelman commented Nov 11, 2021

bxparks commented Nov 11, 2021 •

edited

Loading

bxparks commented Nov 11, 2021

kdeggelman commented Nov 11, 2021

bxparks commented Nov 11, 2021

bxparks commented Nov 11, 2021

kdeggelman commented Nov 11, 2021

bxparks commented Nov 11, 2021

bxparks commented Nov 11, 2021

bxparks commented Nov 14, 2021

Add option to preserve the original ordering of columns #75

Add option to preserve the original ordering of columns #75

Conversation

kdeggelman commented Nov 10, 2021

kdeggelman commented Nov 10, 2021

bxparks left a comment

Choose a reason for hiding this comment

bxparks Nov 11, 2021

Choose a reason for hiding this comment

bxparks Nov 11, 2021

Choose a reason for hiding this comment

bxparks Nov 11, 2021

Choose a reason for hiding this comment

bxparks Nov 11, 2021

Choose a reason for hiding this comment

bxparks commented Nov 11, 2021

kdeggelman commented Nov 11, 2021

bxparks commented Nov 11, 2021 • edited Loading

bxparks commented Nov 11, 2021

kdeggelman commented Nov 11, 2021

bxparks commented Nov 11, 2021

bxparks commented Nov 11, 2021

kdeggelman commented Nov 11, 2021

bxparks commented Nov 11, 2021

bxparks commented Nov 11, 2021

bxparks commented Nov 14, 2021

bxparks commented Nov 11, 2021 •

edited

Loading