Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crash on binary data, native support for compressed csv? #2301

Closed
wardi opened this issue Nov 20, 2024 · 7 comments · Fixed by #2304
Closed

crash on binary data, native support for compressed csv? #2301

wardi opened this issue Nov 20, 2024 · 7 comments · Fixed by #2304
Assignees
Labels
bug Something isn't working

Comments

@wardi
Copy link

wardi commented Nov 20, 2024

Describe the bug

qsv crashes if given binary data

To Reproduce

$ qsv stats mybigdata.csv.gz 
thread 'main' panicked at /home/runner/.cargo/git/checkouts/rust-csv-4524c5d96b17e863/7dc2760/src/byte_record.rs:277:56:
range end index 3569017630560 out of range for slice of length 1
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Expected behavior
Report an error with the expected format, or for bonus points handle .gz, .bz2, .xz etc automatically

Screenshots/Backtrace/Sample Data

Desktop (please complete the following information):

  • OS: Ubuntu Linux
  • qsv Version 0.138.0

Additional context

Happily qsv does work fine in a pipeline like zcat mybigdata.csv.gz | qsv stats

@ondohotola
Copy link

I would consider this a feature :-)-O and on Linux and the Mac you can easily filter for that and then put it into a pipe in front of qsv.

The binary format supported is snappy and while I am unimpressed by it, myself, it is very fast, so on very large data sets I would first (g)unzip and then snap for repeated use.

While I like gzip I am not sure feature bloat is helpful when it can be easily done by pipe or Shell script.

@wardi
Copy link
Author

wardi commented Nov 20, 2024

Crashing with a thread 'main' panicked is a feature? 🤔

@jqnatividad
Copy link
Collaborator

jqnatividad commented Nov 20, 2024

Hi @wardi ,
As stats is a central qsv command and the main engine behind DataPusher+, I've tweaked it over time to squeeze as much performance as possible from it to enable the "Automagical Metadata" qualities we're both working on in CKAN.

As such, its top goal is performance.

That's why I chose to support Snappy, instead of more popular compression formats like gz and zip..

Another goal of qsv is composability, so as you and @ondohotola pointed out, qsv can be easily used with a other purpose-built command-line tools.

But you're right, qsv should at least check for supported formats and fail gracefully rather than panic.

Currently, it already has logic to detect CSV, TSV/TAB and SSV formats and their Snappy compressed variants (csv.sz, tsv.sz, tab.sz and ssv.sz) and set the default delimiter accordingly and compress/decompress automatically and it could be easily extended.

In the meantime, you may want to use validate upstream of stats in your pipeline. That's what DP+ and qsv pro does, as the first thing it does when ingesting a dataset. If not provided a JSON Schema, it goes to RFC-4180 validation mode and also checks if the file is UTF-8 encoded.

@jqnatividad jqnatividad added the enhancement New feature or request. Once marked with this label, its in the backlog. label Nov 20, 2024
@jqnatividad jqnatividad self-assigned this Nov 20, 2024
@wardi
Copy link
Author

wardi commented Nov 20, 2024

Thanks @jqnatividad, maybe when I finally get into rust I could send a PR with some more automatic stream compression/decompression formats

@jqnatividad jqnatividad added bug Something isn't working and removed enhancement New feature or request. Once marked with this label, its in the backlog. labels Nov 21, 2024
@jqnatividad
Copy link
Collaborator

jqnatividad commented Nov 21, 2024

Hi @wardi ,
Went the extra mile and added mime-type inferencing using the file-format crate, which is already being used by the sniff command (which, may be of interest to you too as sniff was created to support next-gen CKAN harvesting - being able to harvest remote CSVs metadata by just sampling them)

Also added a more human-friendly panic handler with the human-panic crate.

@jqnatividad
Copy link
Collaborator

jqnatividad commented Nov 25, 2024

Ended up simplifying input format checking to just checking for supported file extensions and removing the mime-type sniffing as it was causing false positive failures on CI property tests.

#2308

@jqnatividad
Copy link
Collaborator

jqnatividad commented Dec 1, 2024

The sqlp command now supports auto-decompression of gzip, zstd and zlib compressed csv files when using the read_csv() table function.

   qsv sqlp SKIP_INPUT "select * from read_csv('data.csv.gz')"
   qsv sqlp SKIP_INPUT "select * from read_csv('data.csv.zst')"
   qsv sqlp SKIP_INPUT "select * from read_csv('data.csv.zlib')"

this was made possible by enabling polars decompress-fast feature in #2315.

For suite-wide auto decompression support beyond snappy, PRs are still welcome @wardi. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants