crash on binary data, native support for compressed csv? #2301

wardi · 2024-11-20T16:49:41Z

Describe the bug

qsv crashes if given binary data

To Reproduce

$ qsv stats mybigdata.csv.gz 
thread 'main' panicked at /home/runner/.cargo/git/checkouts/rust-csv-4524c5d96b17e863/7dc2760/src/byte_record.rs:277:56:
range end index 3569017630560 out of range for slice of length 1
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Expected behavior
Report an error with the expected format, or for bonus points handle .gz, .bz2, .xz etc automatically

Screenshots/Backtrace/Sample Data

Desktop (please complete the following information):

OS: Ubuntu Linux
qsv Version 0.138.0

Additional context

Happily qsv does work fine in a pipeline like zcat mybigdata.csv.gz | qsv stats

The text was updated successfully, but these errors were encountered:

ondohotola · 2024-11-20T17:16:23Z

I would consider this a feature :-)-O and on Linux and the Mac you can easily filter for that and then put it into a pipe in front of qsv.

The binary format supported is snappy and while I am unimpressed by it, myself, it is very fast, so on very large data sets I would first (g)unzip and then snap for repeated use.

While I like gzip I am not sure feature bloat is helpful when it can be easily done by pipe or Shell script.

wardi · 2024-11-20T19:21:54Z

Crashing with a thread 'main' panicked is a feature? 🤔

jqnatividad · 2024-11-20T21:00:59Z

Hi @wardi ,
As stats is a central qsv command and the main engine behind DataPusher+, I've tweaked it over time to squeeze as much performance as possible from it to enable the "Automagical Metadata" qualities we're both working on in CKAN.

As such, its top goal is performance.

That's why I chose to support Snappy, instead of more popular compression formats like gz and zip..

Another goal of qsv is composability, so as you and @ondohotola pointed out, qsv can be easily used with a other purpose-built command-line tools.

But you're right, qsv should at least check for supported formats and fail gracefully rather than panic.

Currently, it already has logic to detect CSV, TSV/TAB and SSV formats and their Snappy compressed variants (csv.sz, tsv.sz, tab.sz and ssv.sz) and set the default delimiter accordingly and compress/decompress automatically and it could be easily extended.

In the meantime, you may want to use validate upstream of stats in your pipeline. That's what DP+ and qsv pro does, as the first thing it does when ingesting a dataset. If not provided a JSON Schema, it goes to RFC-4180 validation mode and also checks if the file is UTF-8 encoded.

wardi · 2024-11-20T22:34:45Z

Thanks @jqnatividad, maybe when I finally get into rust I could send a PR with some more automatic stream compression/decompression formats

jqnatividad · 2024-11-21T22:13:15Z

Hi @wardi ,
Went the extra mile and added mime-type inferencing using the file-format crate, which is already being used by the sniff command (which, may be of interest to you too as sniff was created to support next-gen CKAN harvesting - being able to harvest remote CSVs metadata by just sampling them)

Also added a more human-friendly panic handler with the human-panic crate.

jqnatividad · 2024-11-25T16:15:58Z

Ended up simplifying input format checking to just checking for supported file extensions and removing the mime-type sniffing as it was causing false positive failures on CI property tests.

#2308

jqnatividad · 2024-12-01T01:27:27Z

The sqlp command now supports auto-decompression of gzip, zstd and zlib compressed csv files when using the read_csv() table function.

   qsv sqlp SKIP_INPUT "select * from read_csv('data.csv.gz')"
   qsv sqlp SKIP_INPUT "select * from read_csv('data.csv.zst')"
   qsv sqlp SKIP_INPUT "select * from read_csv('data.csv.zlib')"

this was made possible by enabling polars decompress-fast feature in #2315.

For suite-wide auto decompression support beyond snappy, PRs are still welcome @wardi. 😄

jqnatividad added the enhancement New feature or request. Once marked with this label, its in the backlog. label Nov 20, 2024

jqnatividad self-assigned this Nov 20, 2024

jqnatividad added bug Something isn't working and removed enhancement New feature or request. Once marked with this label, its in the backlog. labels Nov 21, 2024

jqnatividad mentioned this issue Nov 21, 2024

add mime-type checking; better panic handling #2304

Merged

jqnatividad closed this as completed in #2304 Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crash on binary data, native support for compressed csv? #2301

crash on binary data, native support for compressed csv? #2301

wardi commented Nov 20, 2024

ondohotola commented Nov 20, 2024

wardi commented Nov 20, 2024

jqnatividad commented Nov 20, 2024 •

edited

Loading

wardi commented Nov 20, 2024

jqnatividad commented Nov 21, 2024 •

edited

Loading

jqnatividad commented Nov 25, 2024 •

edited

Loading

jqnatividad commented Dec 1, 2024 •

edited

Loading

crash on binary data, native support for compressed csv? #2301

crash on binary data, native support for compressed csv? #2301

Comments

wardi commented Nov 20, 2024

ondohotola commented Nov 20, 2024

wardi commented Nov 20, 2024

jqnatividad commented Nov 20, 2024 • edited Loading

wardi commented Nov 20, 2024

jqnatividad commented Nov 21, 2024 • edited Loading

jqnatividad commented Nov 25, 2024 • edited Loading

jqnatividad commented Dec 1, 2024 • edited Loading

jqnatividad commented Nov 20, 2024 •

edited

Loading

jqnatividad commented Nov 21, 2024 •

edited

Loading

jqnatividad commented Nov 25, 2024 •

edited

Loading

jqnatividad commented Dec 1, 2024 •

edited

Loading