Skip to content

Commit

Permalink
Add documentation for --preserve_input_sort_order (see #75)
Browse files Browse the repository at this point in the history
  • Loading branch information
bxparks committed Nov 11, 2021
1 parent 7673bb5 commit 264dfbd
Show file tree
Hide file tree
Showing 2 changed files with 92 additions and 1 deletion.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
# Changelog

* Unreleased
* Make the column order in the BQ schema file match the order of appearance
in the JSON data file using the `--preserve_input_sort_order` flag.
Thanks to kdeggelman@ in
[PR#75](https://github.com/bxparks/bigquery-schema-generator/pull/75).
* 1.4.1 (2021-08-23)
* Add documentation for the `input_format='dict'` option.
* Add additional input format 'json' and 'dict' test cases.
Expand Down
89 changes: 88 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,8 @@ $ generate-schema --input_format csv < file.data.csv > file.schema.json
* [Sanitize Names (`--sanitize_names`)](#SanitizedNames)
* [Ignore Invalid Lines (`--ignore_invalid_lines`)](#IgnoreInvalidLines)
* [Existing Schema Path (`--existing_schema_path`)](#ExistingSchemaPath)
* [Preserve Input Sort Order
(`--preserve_input_sort_order`)](#PreserveInputSortOrder)
* [Using as a Library](#UsingAsLibrary)
* [`SchemaGenerator.run()`](#SchemaGeneratorRun)
* [`SchemaGenerator.deduce_schema()`](#SchemaGeneratorDeduceSchema)
Expand Down Expand Up @@ -547,6 +549,88 @@ See discussion in
[PR #57](https://github.com/bxparks/bigquery-schema-generator/pull/57) for
more details.
<a name="PreserveInputSortOrder"></a>
#### Preserve Input Sort Order (`--preserve_input_sort_order`)
By default, the order of columns in the BQ schema file is sorted
lexicographically, which matched the original behavior of `bq load
--autodetect`. If the `--preserve_input_sort_order` flag is given, the columns
in the resulting schema file is not sorted, but preserves the order of
appearance in the input JSON data. For example, for the following JSON data with
the `--preserve_input_sort_order` flag will produce:
```bash
$ generate-schema --preserve_input_sort_order
{ "s": "string", "i": 3, "x": 3.2, "b": true }
^D
[
{
"mode": "NULLABLE",
"name": "s",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "i",
"type": "INTEGER"
},
{
"mode": "NULLABLE",
"name": "x",
"type": "FLOAT"
},
{
"mode": "NULLABLE",
"name": "b",
"type": "BOOLEAN"
}
]
```
It is possible that each JSON record line contains only a partial subset of the
total possible columns in the data set. The order of the columns in the BQ
schema will then be in order that each column was first *seen* by the script:
```bash
$ generate-schema --preserve_input_sort_order
{ "s": "string", "i": 3 }
{ "x": 3.2, "s": "string", "i": 3 }
{ "b": true, "x": 3.2, "s": "string", "i": 3 }
^D
[
{
"mode": "NULLABLE",
"name": "s",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "i",
"type": "INTEGER"
},
{
"mode": "NULLABLE",
"name": "x",
"type": "FLOAT"
},
{
"mode": "NULLABLE",
"name": "b",
"type": "BOOLEAN"
}
]
```
**Note**: In Python 3.6 (the earliest version of Python supported by this
project), the order of keys in a `dict` was the insertion-order, but this
ordering was an implementation detail, and not guaranteed. In Python 3.7, that
ordering was made permanent. So the `--preserve_input_sort_order` flag
**should** work in Python 3.6 but is not guaranteed.
See discussion in
[PR #75](https://github.com/bxparks/bigquery-schema-generator/pull/75) for
more details.
<a name="UsingAsLibrary"></a>
### Using As a Library
Expand All @@ -572,6 +656,7 @@ generator = SchemaGenerator(
debugging_map=debugging_map,
sanitize_names=sanitize_names,
ignore_invalid_lines=ignore_invalid_lines,
preserve_input_sort_order=preserve_input_sort_order,
)
generator.run(input_file=input_file, output_file=output_file)
```
Expand Down Expand Up @@ -936,4 +1021,6 @@ people ask similar questions later.
by Austin Brogle (abroglesc@) and Bozo Dragojevic (bozzzzo@).
* Allow `SchemaGenerator.deduce_schema()` to accept a list of native Python
`dict` objects, by Zigfrid Zvezdin (ZiggerZZ@).
* Make the column order in the BQ schema file match the order of appearance in
the JSON data file using the `--preserve_input_sort_order` flag. By Kevin
Deggelman (kdeggelman@).

0 comments on commit 264dfbd

Please sign in to comment.