Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support newline-delimited JSON for seeds #2365

Open
jml opened this issue Apr 28, 2020 · 24 comments
Open

Support newline-delimited JSON for seeds #2365

jml opened this issue Apr 28, 2020 · 24 comments
Labels
enhancement New feature or request paper_cut A small change that impacts lots of users in their day-to-day seeds Issues related to dbt's seed functionality

Comments

@jml
Copy link

jml commented Apr 28, 2020

Describe the feature

Allow dbt users to load in data from newline-delimited JSON files, rather than just CSV.

CSV is barely a format, and has lots of ambiguities around quoting, names, and so forth. Supporting newline-delimited JSON would make things

Describe alternatives you've considered

  • handling this outside of dbt (what we do now)
  • converting JSON data to CSV as part of the build process (a bit of a faff)
  • supporting other formats, like Avro (not likely to store binary data in a git repo)

Additional context

Loading in newline-delimited data into BigQuery is really easy: there's an API for it.

Who will this benefit?

Anyone looking into seeds who already has their data in newline-delimited JSON format (hi!). Anyone who has ever tried to debug dodgy quoting issues in CSV.

@jml jml added enhancement New feature or request triage labels Apr 28, 2020
@drewbanin drewbanin removed the triage label Apr 29, 2020
@drewbanin
Copy link
Contributor

hey @jml - check out this issue for some prior art: #2276

I'm going to close this issue out as a dupe, but this is a good suggestion and I'm keen to support non-csv seeds in the future!

@jml jml mentioned this issue Apr 29, 2020
@stewartbryson
Copy link

hey @jml - check out this issue for some prior art: #2276

I'm going to close this issue out as a dupe, but this is a good suggestion and I'm keen to support non-csv seeds in the future!

@drewbanin Are you saying this request is a dupe of #2276 ? I don’t understand the relationship. Asking for newline-delimited JSON as a seed format doesn’t seem related to Jinga blocks at all.

@drewbanin
Copy link
Contributor

Hey @stewartbryson - the issue I linked to includes an example of a newline-delimited json seed file. The included example looks like:

{% seed my_json_seed type=json %}

{"col1": "value", "col2": "other value"},
{"col1": "value2", "col2": "other value2"},

{% endseed %}

I do think there's an opportunity to support newline-delimited json seed files without building jinja seed blocks, but it's probably not a change we're going to prioritize in isolation. Can you tell me more about what you're looking to use newline-delimited json seed files for? That might help us better prioritize an issue like this one

@jtcohen6 jtcohen6 reopened this Sep 10, 2020
@aspfohl
Copy link

aspfohl commented Nov 5, 2020

@jml Would you be able to share how you currently handle it outside of dbt? I agree this would be useful, it could help around csv quoting issues, readability (if there was support for other delimiters), and representing unstructured data

@jml
Copy link
Author

jml commented Nov 6, 2020

Sure. We use BigQuery-specific APIs.

def load_file_to_bigquery(
    client: bigquery.Client,
    source_file: IO[bytes],
    destination_table: str,
    load_config: bigquery.LoadJobConfig,
    rewind: bool = False,
) -> LoadJob:
    logging.info("Starting load into %s in BigQuery", destination_table)
    job = client.load_table_from_file(
        source_file, destination_table, job_config=load_config, rewind=rewind
    )
    logging.info("Loading data into BigQuery")
    duration = _wait_for_job(job)
    logging.info(
        "Done in %0.1f seconds! %s rows and %s bytes were loaded",
        duration,
        job.output_rows,
        job.output_bytes,
    )
    return job

def _wait_for_job(job: bigquery.LoadJob) -> float:
    start_time = time.monotonic()
    try:
        # Starts the load asynchronously and polls until completed, raising an exception in case of problems
        job.result()
    except BadRequest:
        logging.error(
            "Errors occurred during loading: %s", job.errors, extra={"errors": job.errors}
        )
    return time.monotonic() - start_time

def load_json_line_data_to_bigquery(
    project: str,
    location: str,
    destination_table: str,
    filename: str,
    table_description: str,
    field_descriptions: Dict[str, str],
) -> None:  # pragma: no cover
    """Load line-separated JSON files from data/ to BigQuery."""
    client = get_bigquery_client(project=project, location=location)
    path = config.RAW_DATA_PATH.joinpath(filename)
    with path.open(mode="rb") as f:
        load_file_to_bigquery(
            client=client,
            source_file=f,
            destination_table=destination_table,
            load_config=bigquery.LoadJobConfig(
                autodetect=True,
                source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
                write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE,
            ),
        )
    update_schema_documentation(client, destination_table, table_description, field_descriptions)

def update_schema_documentation(
    client: bigquery.Client,
    table_name: str,
    table_description: str,
    field_descriptions: Dict[str, str],
) -> None:  # pragma: no cover
    """Update the documentation for a BigQuery table.

    Known limitations:

    - Does not support nested field definitions

    Feel free to fix these limitations!
    """
    table = client.get_table(table_name)
    table.description = table_description
    table.schema = _document_schema(table_name, table.schema, field_descriptions)
    client.update_table(table, ["description", "schema"])


def _document_schema(
    table_name: str, schema: Sequence[SchemaField], field_descriptions: Dict[str, str]
) -> List[SchemaField]:
    """Create a documented version of the given BigQuery schema."""
    existing_fields = set()
    new_schema = []
    for field in schema:
        description = field_descriptions.get(field.name)
        new_schema.append(
            SchemaField(
                name=field.name,
                field_type=field.field_type,
                mode=field.mode,
                description=description,
                fields=field.fields,
            )
        )
        existing_fields.add(field.name)
    undescribed_fields = existing_fields - set(field_descriptions)
    # TODO: Also raise an exception if we have described fields that don't exist.
    if undescribed_fields:
        # TODO: Raise a more specific exception so we don't have to pass table_name in.
        raise RuntimeError(f"Unexpected fields defined in {table_name}: {undescribed_fields}")
    return new_schema

Happy to answer any questions. Memrise Limited is making this code available under the Apache License 2.0.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Oct 4, 2021

If we're willing to stick with agate as our dataframe provider—very much an open question, since there are some very good reasons to rip it out (#3413)—but if we were to stick with it, I think this might be a fairly trivial addition, since agate supports a from_json method that looks quite similar the from_csv method we use for seeds today.

We'd need to update the load_agate_table context method, and/or turn the agate_helper.from_csv method into a generalized from_file method.

Those are implementation details. I'd still like to hear a bit more from users about the rationale for supporting NDJSON—ease of quoting/escaping? ease of debugging?—rather than requiring conversion to CSV as a preprocessing step. Unless I'm missing something big, I don't think JSON seeds would have a leg up in terms of support for semi-/unstructured data, since the ultimate aim is always conversion to tabular format (data frame).

@jml
Copy link
Author

jml commented Oct 4, 2021

I'd still like to hear a bit more from users about the rationale for supporting NDJSON—ease of quoting/escaping? ease of debugging?—rather than requiring conversion to CSV as a preprocessing step.

For us it's about having a harder-to-get-wrong format. It's really easy to get CSV wrong (mostly quoting/escaping, as you say).

There's also a migration factor. Before we switched to dbt, we were loading data from NDJSON files. Now, switching to dbt, we have to make those files worse (FSVO 'worse'), which feels off, given that everything else about dbt has been an improvement.

@rumbin
Copy link

rumbin commented Oct 4, 2021

Our use case:
We JSON in our "seeds" in order to reflect some nested relationships. Something like one row per user containing a JSON list of groups or roles that this user belongs to. We do not want to flatten such lists immediately but rather keep them as VARIANT or ARRAY type.

Currently we maintain such JSON "seeds" as a string literal of a JSON array of these objects, parse them via Snowflake's parse_json() and flatten() the outer array.

It would be much more convenient to be able to use NDJSON here, so the editor can provide syntax highlighting etc.
However, as said, our use case requires nested lists and maps to stay nested.

@rumbin
Copy link

rumbin commented Oct 4, 2021

If a flat table is the result that we are aiming for after loading the seed, we could live with the nested fields being included as escaped text, though, as that could be easily converted downstream.

So, yes, type and quoting safety are the big benefits of JSON here. CSV is error-prone and too less standardized imho.
Totally agree with @jml here.

@amardeep
Copy link

amardeep commented Oct 4, 2021

Same as what some others have mentioned. We use bigquery, and miss not having support for nested data in seeds.

One use case was testing where it would have been nice if the seeds mirrored the structure of what was being tested - without using nesting in seeds it added an extra step to first create that nested structure from csv read.

@jtcohen6 jtcohen6 added the seeds Issues related to dbt's seed functionality label Oct 5, 2021
@frosty5689
Copy link

Hello,

We have JSONB columns in our Postgres database that we use as a source for our data. Managing JSON inside CSVs is painful where as having it inside NDJSON or YAML will make it much easier to manage. That is our use case for wanting something other than CSV as the seed file format.

@naga-kondapaturi-zefr
Copy link

We use Snowflake source tables that heavily use VARIANT columns. Using CSV seeds to add sample data for testing transformations is resulting in error SQL compilation error: Expression type does not match column data type, expecting VARIANT but got VARCHAR(210) for column PAYLOAD. I am not entirely sure how to use the seed json example given above inside sources.yml file as the tables are already created in our EL (extract and load) process and we don't define them as models in dbt.

We are overriding the snowflake load macro snowflake__load_csv_rows to use parse_json for variant columns but it seems like an ugly hack. I am not sure if support of ndjson will resolve our issue but a way to load text into Snowflake variant columns will be quite useful.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Mar 2, 2022

@naga-kondapaturi-zefr Are you trying to use seed-related functionality for a table that already exists in your data warehouse? Or in order to load a CSV containing a column with a JSON value? (Perhaps loading it as a string and casting it within a staging model?)

@boxysean
Copy link
Contributor

I worked with a dbt user recently who uses Snowflake, who had an interesting way to solve the lack of JSON seeds (ND or otherwise):

They create dbt models that use the PARSE_JSON function, to persist VARIANT type tables, that they can use as static datasets. For example:

# models/animals.sql
select parse_json('{"cats": "the best", "dogs": "the worst"}') as json_blob

@brad-meru
Copy link

@drewbanin is this one on the roadmap? This would be a helpful feature for Meru for the same reason @jml mentioned:

For us it's about having a harder-to-get-wrong format. It's really easy to get CSV wrong (mostly quoting/escaping, as you say).

@jhsb25
Copy link

jhsb25 commented Nov 22, 2022

We use Snowflake source tables that heavily use VARIANT columns. Using CSV seeds to add sample data for testing transformations is resulting in error SQL compilation error: Expression type does not match column data type, expecting VARIANT but got VARCHAR(210) for column PAYLOAD. I am not entirely sure how to use the seed json example given above inside sources.yml file as the tables are already created in our EL (extract and load) process and we don't define them as models in dbt.

We are overriding the snowflake load macro snowflake__load_csv_rows to use parse_json for variant columns but it seems like an ugly hack. I am not sure if support of ndjson will resolve our issue but a way to load text into Snowflake variant columns will be quite useful.

@naga-kondapaturi-zefr would you be able to share the override implementation of snowflake__load_csv_rows to use parse_json for variant columns. We are facing the exact same problem.

@jhsb25
Copy link

jhsb25 commented Nov 23, 2022

Anyone in need of the hack solution in overriding snowflake__load_csv_rows here it is:


{% macro snowflake__load_csv_rows(model, agate_table) %}
    {% set batch_size = get_batch_size() %}
    {% set cols_sql = get_seed_column_quoted_csv(model, agate_table.column_names) %}
    {% set bindings = [] %}
    {% set column_list = [] %}
    {% set variant_columns = [] %}
    {% set statements = [] %}

    {#  Get all columns that have type variant  #}
    {% for i in model['config']['column_types'].items() %}
        {% if i[1] == 'variant' %}
            {% do variant_columns.append(i[0]) %}
        {% endif %}
    {% endfor %}

    {#  Create column list expression for columns using parse_json for variant types  #}
    {% for i in agate_table.column_names %}
        {% if i in variant_columns %}
            {% do column_list.append('parse_json(column{0})'.format(loop.index)) %}
        {% else %}
            {% do column_list.append('column{0}'.format(loop.index)) %}
        {% endif %}
    {% endfor %}

    {#  Create insert query  #}
    {% for chunk in agate_table.rows | batch(batch_size) %}
        {% set bindings = [] %}

        {% for row in chunk %}
            {% do bindings.extend(row) %}
        {% endfor %}

        {% set sql %}
            insert into {{ this.render() }} select {{ ', '.join(column_list) }} from values
            {% for row in chunk -%}
                ({%- for column in agate_table.column_names -%}
                    %s
                    {%- if not loop.last%},{%- endif %}
                {%- endfor -%})
                {%- if not loop.last%},{%- endif %}
            {%- endfor %}
        {% endset %}

        {% do adapter.add_query('BEGIN', auto_begin=False) %}
        {% do adapter.add_query(sql, bindings=bindings, abridge_sql_log=True) %}
        {% do adapter.add_query('COMMIT', auto_begin=False) %}

        {% if loop.index0 == 0 %}
            {% do statements.append(sql) %}
        {% endif %}
    {% endfor %}

    {# Return SQL so we can render it out into the compiled files #}
    {{ return(statements[0]) }}
{% endmacro %}

{% materialization seed, adapter='snowflake' %}
    {% set original_query_tag = set_query_tag() %}

    {% set relations = materialization_seed_default() %}

    {% do unset_query_tag(original_query_tag) %}

    {{ return(relations) }}
{% endmaterialization %}

@jtcohen6 jtcohen6 added the paper_cut A small change that impacts lots of users in their day-to-day label Nov 28, 2022
@ruslan-marian
Copy link

ruslan-marian commented Feb 14, 2023

I'd still like to hear a bit more from users about the rationale for supporting NDJSON—ease of quoting/escaping? ease of debugging?—rather than requiring conversion to CSV as a preprocessing step. Unless I'm missing something big, I don't think JSON seeds would have a leg up in terms of support for semi-/unstructured data, since the ultimate aim is always conversion to tabular format (data frame).

@jtcohen6 , my use case is very similar to what @naga-kondapaturi-zefr mentioned above.

I have a raw table in Snowflake that receives events through Kafka from an operational system. The table has only two columns, both of VARIANT type.

create table if not exists PROD.RAW.EVENTS (
	RECORD_METADATA VARIANT,
	RECORD_CONTENT VARIANT
); 

The first column is generated by a Snowflake Connector for Kafka and the second contains the actual JSON events. We are using dbt to extract the info that we need from events and transform it to analytics. Now, I need to create an environment for testing. It is located in a separate Snowflake account, so I can't clone the original raw table. I was looking for options on how to load JSON to Snowflake, and when I learned about dbt seed I thought that it would be the perfect tool for the job, but then it was disappointing to realize that only CSV was supported (and I'm getting the same SQL compilation error as mentioned by @naga-kondapaturi-zefr).

One thing to mention here. In my use case I have huge JSON events, with lots of nested levels of varying depth, that should be loaded to the VARIANT column as they are, without being parsed and loaded to specific columns, as is the case for CSV files (and I assume is the case for some people in this thread that would like to get this feature). Perhaps that is something to bear in mind for the person who would be working on this feature, and load the data depending on the destination table (VARIANT datatype or not).

This message seems to have irrelevant details, but it's for the purpose of providing more context and understanding the use case.

@adamcunnington-mlg
Copy link

adamcunnington-mlg commented May 5, 2023

@jtcohen6 / @dbeatty10 please can you advise what the current status is of considering supporting JSON seeds? This issue is pretty long so I've not read all of the proposals / use cases but it seems to me like a very reasonable and generic thing to do to want to load seed data that contains nested data. At which point, you're dead out of the water with our primitive friend, CSV.

As far as I can tell, the only workaround right now is to create a CSV with some JSON columns and then to create an intermediate model on top of the seed that parses the JSON - it makes for ugly maintenance of the CSV data though as developer would find it much easier to just maintain a JSON doc.

@mk-oliver-s
Copy link

Not sure if its worth adding another issue for this but YAML as well would be good here for multiline strings.

eg I want to store prompts in a database and use seeds to add them, yaml would look something like this

- id: 9ryuf89wh
  prompt: |
       your task is to do x
       you should do it via y
       here is the data: {data}
- id: i8d3h89
  prompt: |
       your task is to do c
       you should do it via b
       here is the data: {data}

@dbeatty10 dbeatty10 mentioned this issue Jan 31, 2024
3 tasks
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Feb 27, 2024
@dbeatty10 dbeatty10 removed the stale Issues that have gone stale label Feb 27, 2024
@dbeatty10
Copy link
Contributor

I removed the stale label to keep this issue open.

Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Aug 26, 2024
@brad-meru
Copy link

Commenting on the issue to keep it open. We'd still like this functionality.

@github-actions github-actions bot removed the stale Issues that have gone stale label Aug 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request paper_cut A small change that impacts lots of users in their day-to-day seeds Issues related to dbt's seed functionality
Projects
None yet
Development

No branches or pull requests