-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inspecting and selecting large json files #415
Comments
@c-nv-s Thanks, I'm looking into this now. What would I do without you? 🙏 |
@c-nv-s... Hmmmn, I'm seeing different behavior. $ sq inspect @cities
SOURCE DRIVER NAME FQ NAME SIZE TABLES VIEWS LOCATION
@cities json cities.large.json cities.large.json 19.9MB 1 0 /Users/neilotoole/work/sq/sq/drivers/json/testdata/cities.large.json
NAME TYPE ROWS COLS
data table 146994 name, lat, lng, country, admin1, admin2
$ sq '@cities.data | .[]'
name lat lng country admin1 admin2
Sant Julià de Lòria 42.46372 1.49129 AD 6 NULL
Pas de la Casa 42.54277 1.73361 AD 3 NULL
[SNIP] So, Standalone period '.'$ sq '@cities.data | .' So, that's not currently valid jsonlAlso, the DebuggingSo, back to the hanging part. What happens when you do the following (note that I've named the file $ cat cities.large.json | sq '.data | .[0:3]' This is what I see: And then after processing, I get the expected result. To debug this, can you provide the output of:
We'll get to the bottom of this! |
sorry neil please disregard that last comment, I deleted it because I was testing it in a container that was still on v47 |
@c-nv-s If running in a container, you probably won't see the progress bar I suppose? And yes, it does take its time to ingest (although thankfully this is now a one-time cost with the ingest caching that was implemented a few versions ago). It takes about 35s on my beefy macbook, I imagine the container could be significantly less beefy? Having been digging around in the JSON ingest process recently, I wouldn't be surprised if I could speed it up by an order of magnitude. That's a story for another day though. |
I do see the progress bar, that is working fine, and yes the container is resource limited so patience is required. |
@c-nv-s The JSON ingester definitely needs some serious love. Compare it with the CSV ingester (which admittedly is the simplest possible data format): $ time cat cities.large.json | sq '.data | count'
count
146994
cat cities.large.json 0.00s user 0.03s system 0% cpu 34.013 total
sq '.data | count' 3.27s user 26.91s system 88% cpu 34.154 total
$ time cat cities.large.csv | sq '.data | count'
count
146994
cat cities.large.csv 0.00s user 0.00s system 2% cpu 0.271 total
sq '.data | count' 0.46s user 0.06s system 184% cpu 0.283 total So, that's 34 seconds for JSON, and 0.28 seconds for CSV... 😬 There is a lot of low-hanging fruit for performance improvements in the JSON ingester. But the main reason it's so slow is that it's designed to deal with any JSON that's thrown at it: nested objects, arrays, fields that change type (e.g. from integer to text, which actually happens in that cities.json file). If the ingester knew ahead of time that the JSON was flat, fields were always the same type, etc, then it could go way faster. I'm certain there's a good strategy to be found that combines the best of both worlds, but I haven't spent any time on it yet. Plus, I figured you'd rather see #143 happen first... |
yes, sadly it looks like I'm going to have to leave this operation running overnight to see the outcome which is a shame but not the end of the world. |
FYI it looks like the import of the jsonl still doesn't work. As you mentioned, the determination of the schema type is tricky because it changes from
The schema on the sql was created before the attempt to import the data (see below for schema)
the existing schema for the sqlite table on my system looks like this:
|
@c-nv-s: Apologies for the delay in responding, family vacation for spring break. So, the problem as I see it is that the There's a basic data incompatiblity here. You either need to change the type of your table's Also, something I'm a little confused about... Are you sure that the schema you're showing is for your target table (
indicates that the type of |
no worries at all, I hope you enjoyed your break. I did eventually manage to migrate all the data, however, your response leaves me with a couple of doubts about my previous mental model. Note that when I split up the file into 2 parts so that second part begins at the BRC row then it does work... which again lead me to suspect that either the target schema was still being determined by first row of the data, or your logic to switch schema types during the ingest phase was not working as intended. I think I remember also testing it by just creating the target table but not specifying a schema, and in that case it worked, so there might also be a subtle warning about the differences in those situations. |
just wanted to mention that a nice "alternative" for this would be if the
could be
|
Describe the bug
I was preparing to copy a json file to sqlite (similar to https://sq.io/docs/cookbook#import-json-array-to-database ) but could not because I encountered some buggy behaviour.
sample data: https://github.com/lutangar/cities.json/raw/master/cities.json
To Reproduce
Expected behavior
Not actually sure to be honest, but definitely not what was received haha
sq version
"version": "v0.47.2",
"commit": "135318f",
"timestamp": "2024-01-30T05:13:40Z",
"latest_version": "v0.48.1",
The text was updated successfully, but these errors were encountered: