-
Notifications
You must be signed in to change notification settings - Fork 2
Jobs (Functions) (Syntax and Usage)
The search_and_flatten_csv
function is a type of job that allows you to flatten JSON data from files into a CSV format based on the search configurations defined. The purpose of the search_and_flatten_csv function is to simplify this process of JSON to CSV while also having low memory and the ability to handle files that are HUGE (even 100GB) by using a "streaming" approach. This allows you to analyze and process your JSON data more effectively, especially when dealing with large datasets that can't easily be explored in local tools.
{
"jobs": [
{
"type": "search_and_flatten_csv",
"name": "search",
"search_config_path": "searches.json",
"searchconfigs": {
"search_and_flatten1": "example1.json",
"search_and_flatten2": "example2.json"
},
"delimiter": ",",
"mode": "normal",
"num_test_rows": 100,
"verbose": false
},
...more jobs here...
]
}
Below is an explanation of the configuration parameters in the jobs configuration file:
-
type
: The type of job. For this function, it should be"search_and_flatten_csv"
. -
name
: The name of the job. -
search_config_path
: The path to the JSON file that contains the search configuration. -
searchconfigs
: An object containing pairs of search configuration names and corresponding JSON file paths. These files contain the specific search criteria for each job. -
delimiter
: The delimiter used in the resulting CSV file. Typically this is a comma (,
) but can be set to other characters as required. -
mode
: The running mode. Can be"normal"
or"test"
. -
num_test_rows
: The number of test rows to be generated when the mode is"test"
. -
verbose
: A boolean value indicating whether to log additional information during the execution. Set totrue
to enable,false
to disable.
The search configuration is defined in a separate JSON file as specified in search_config_path
. This JSON file contains the keys to be flattened from the JSON data.
Below are the properties used in the search configuration:
-
root_key
: The root key from where the search should begin in the JSON structure. If this is an empty string, the search begins from the top level. If the input JSON is one single object, this will result in one row in a CSV. If the root key points to an array of objects, then the result will be multiple rows in the outputted CSV. -
search_config
: An array of keys or a string"*"
indicating that all keys should be included. This can also be an object that defines specific properties for each key. -
similarity_threshold
: A decimal value ranging from 0 to 1 indicating the threshold of similarity when comparing keys. -
array_handling
: Defines how array data should be handled. The"stringify"
option converts the entire array into a single string, while"explode"
treats each element in the array as a separate entity, and"vertical"
makes new columns for up to the max # of items in that array over the whole JSON file. -
object_handling
: Defines how object data should be handled. The"stringify"
option converts the applicable object or value into a single string, while"explode"
flattens the object into all of its children keys (using dot notation - "parent_key.child.grandchild"). -
allow_dot_notation
: A boolean indicating whether dot notation should be use to find keys. Iftrue
, keys can be accessed as"object.key"
. Iffalse
, the program will go slower, but it can utilizesimilarity_threshold
to use "jaccard index" to find keys that approximately match. -
max_string_length
: The maximum string length allowed for a value. If a string exceeds this length, it will be handled according to thelong_string_handling
option. -
long_string_handling
: Defines how strings exceedingmax_string_length
should be handled."truncate"
will cut off the string at the maximum length, while"split"
will break the string into multiple keys.
look at search_config
options (below) for configuring the search in detail
{
"search_and_flatten1":{
"root_key":"",
"search_config":[ # list search
"account_name",
"asset_unique_id",
...
],
...
},
"search_and_flatten2":{
"root_key":"",
"search_config":"*", # wildcard search
...
},
"search_and_flatten3":{
"root_key":"",
"search_config": # granular search
"key1": {
"similarity_threshold": 1.0,
"array_handling": "stringify",
"object_handling": "stringify",
"allow_dot_notation": true
},
"key2": {
"similarity_threshold": 0.9,
"array_handling": "stringify",
"object_handling": "recurse",
"allow_dot_notation": false
},
...
},
...
}
In this example, 3 separate configurations for search_config
have been chosen: one that "tabulizes" all of the keys in the JSON file, and the other which does it with a select list of keys.
This is the most important parameter for search configuration!
The search_config parameter in the search configuration can take a wildcard ("*"), list, or granular (dict) search options.
The wildcard search configuration uses an asterisk ("*") to expand all keys in the JSON depending on the base level options. This is a quick way to flatten all data, but is not recommended for large JSON files with many keys due to performance considerations and issues that still haven't been solved with the "DynamicDictWriter
" function and "searchAndFlatten
".
{
"all_the_data": {
"root_key": "",
"search_config": "*",
"similarity_threshold": 1.0,
"array_handling": "stringify",
"object_handling": "recurse",
"allow_dot_notation": true,
"max_string_length": 32750,
"long_string_handling": "truncate"
}
}
The list search configuration allows you to apply base level options to a specific list of keys. This is useful when you want to flatten a subset of the data.
{
"specific_keys_only": {
"root_key": "",
"search_config": [
"key1",
"key2",
"key3",
"key4.subkey",
"key5.subkey.subsubkey"
],
"similarity_threshold": 1.0,
"array_handling": "stringify",
"object_handling": "stringify",
"allow_dot_notation": true,
"max_string_length": 32750,
"long_string_handling": "truncate"
}
}
The granular search configuration applies specific options to each key that is searched for. This provides the most flexibility as you can customize how each key is handled.
{
"granular_control": {
"root_key": "",
"search_config": {
"key1": {
"similarity_threshold": 1.0,
"array_handling": "stringify",
"object_handling": "stringify",
"allow_dot_notation": true
},
"key2": {
"similarity_threshold": 0.9,
"array_handling": "stringify",
"object_handling": "recurse",
"allow_dot_notation": false
}
},
"similarity_threshold": 1.0,
"array_handling": "stringify",
"object_handling": "recurse",
"allow_dot_notation": true,
"max_string_length": 32750,
"long_string_handling": "truncate"
}
}
In the granular search, you can still specify the base level options, which will act as the default for all keys. However, the specific options within each key will override these base level options.