Releases: dathere/qsv
2.2.1
[2.2.1] - 2025-01-27
Changed
- deps: bumped polars to 0.46.0. This will allow us to publish qsv to crates.io as qsv was using features that were not enabled in polars 0.45.1 275b2b8
Fixed
stats
: fix cache json processing bug. Fixes #2476 #2477- benchmarks: v6.1.0 - ensured all
stats
cache benchmarks actually used the stats cache even if the default--cache-threshold
is 5 seconds - too high to trigger stats cache creation ac33010
Full Changelog: 2.2.0...2.2.1
2.2.0
[2.2.0] - 2025-01-26
Highlights:
stats
- the β€οΈ of qsv, got a little tune-up:- It got a tad faster now that we only compute string length stats for string types. Previously, we were also computing length for numbers, thinking it'll be useful for storage sizing purposes (as everything is stored as string with CSV). But as performance is goal number 1, we're no longer doing so. Besides, this sizing info can be derived using other stats.
- Fixed the problem with the stats cache being deleted/ignored even when not necessary.
This bug snuck in while implementing the--cache-threshold
cache suppression option. Withstats
getting its cache mojo back - expect near-instant cache-backed response not only forstats
but also other "automagical" smart commands πͺ.
diff
- @janriemer squashed some bugs without sacrificingdiff
's ludicrous speed! πvalidate
: addeddynamicEnum
custom JSON Schema keyword column specifier support.
You can now specify which column to validate against (by name or by 0-based column index), instead of always using the first column. This works for local & remote lookup files using thehttp/s://
,ckan://
anddathere://
URL schemes.extdedup
now actually uses a proper memory-mapped backed on-disk hash table.
Previously, it was only deduping in-memory as the odht crate was not properly wired to a memory mapped file π€¦ (I took the name of the odht crate literally and thought it was handling it π€·). Thanks for the detailed bug report @Svenskunganka!- JSON query parsing overhaul.
Thefetch
,fetchpost
&json
commands now use the latestjaq
engine, making for faster performance especially now that we're precompiling and caching the jaq filter. - Polars engine upgraded. π»ββοΈ
By two versions! py-polars 1.20.0 and 1.21.0 - giving thesqlp
,joinp
,pivotp
&count
commands a little boost. π
NOTE: qsv v2.2.0 is not available on crates.io as it does not allow enabling unreleased features as we await a new version of Polars. As soon as Polars 0.46.0 is published, a new qsv patch release will be published to crates.io.
This means that installation option 3 usingcargo install
will be limited to 1.0.0 - the last qsv version available on crates.io. All other installation and update options to install/update qsv 2.2.0 still work.
Added
diff
: add--delimiter
"convenience" option. Fulfills #2447 #2464slice
: add stdin and snappy compressed file support ab34a62validate
: add dynamicEnum column specifier support. Fulfills #2470 #2472
Changed
fetch
,fetchpost
&json
:jaq
dependency upgrade - fromjaq-interpret
&jaq-parse
tojaq-core
/jaq-json
/jaq-std
#2458fetch
&fetchpost
: cache compiled jaq filter #2467joinp
: adjust asofby test to reflect Polars py-1.20.0 behavior 853a266stats
: compute string length stats for string type only #2471sqlp
: wordsmith fastpath explanation 4e3f853- refactor: standardize -q and -Q shortcut options. Fulfills #2466 #2468
- deps: bump polars to 0.45.1 at py-polars-1.20.0 tag #2448
- deps: bump polars to 0.45.1 at py-polars-1.21.0 tag 4525d00
- deps: Bump csv-diff to 0.1.1 by @janriemer in #2456
- deps: Bump csvlens to latest upstream 27a723e
- deps: use latest strum upstream 2ca1b0d
- build(deps): bump base62 from 2.2.0 to 2.2.1 by @dependabot in #2440
- build(deps): bump chrono-tz from 0.10.0 to 0.10.1 by @dependabot in #2449
- build(deps): bump data-encoding from 2.6.0 to 2.7.0 by @dependabot in #2444
- build(deps): bump indexmap from 2.7.0 to 2.7.1 by @dependabot in #2461
- build(deps): bump jsonschema from 0.28.1 to 0.28.2 by @dependabot in #2469
- build(deps): bump jsonschema from 0.28.2 to 0.28.3 by @dependabot in #2473
- build(deps): bump log from 0.4.22 to 0.4.25 by @dependabot in #2439
- build(deps): bump semver from 1.0.24 to 1.0.25 by @dependabot in #2459
- build(deps): bump serde_json from 1.0.135 to 1.0.136 by @dependabot in #2455
- build(deps): bump serde_json from 1.0.136 to 1.0.137 by @dependabot in #2460
- build(deps): bump simple-home-dir from 0.4.5 to 0.4.6 by @dependabot in #2445
- build(deps): bump uuid from 1.11.1 to 1.12.0 by @dependabot in #2441
- build(deps): bump uuid from 1.12.0 to 1.12.1 by @dependabot in #2465
- tests: enabled Windows CI caching for faster CI tests
- bumped numerous indirect dependencies to latest versions
- applied select clippy lint suggestions
Fixed
count
: Sometimes, polars count returns zero even if there are rows. Fixed by doing a regular csv reader count when polars count returns zero abcd365diff
: Fix name to index conversion by @janriemer. Fixes #2443 #2457extdedup
: refactor/fix to actually have on-disk hash table backed by a mem-mapped file. Fixes #2462 #2475stats
: fix stats caching as it was inadvertently deleting the stats cache even when not necessary 96e6d28
Removed
foreach
: refactored to remove unmaintainedlocal-encoding
dependency #2454- remove
polars
feature from qsvdp binary variant. We'll use py-polars from DP+ directly.
Full Changelog: 2.1.0...2.2.0
2.1.0
[2.1.0] - 2025-01-12
Highlights:
join
&joinp
fine-tuning continues, with several join key transformation options (--ignore-leading-zeros
&--norm-unicode
);join
fixes for--right-anti
and--right-semi
joins; and reverting ajoin
performance regression with 2.0.0.pivotp
uses more summary statistics for even smarter aggregation suggestions
NOTE: qsv v2.1.0 is not available on crates.io. This was caused by qsv's use of a brand new
string_normalize
Polars feature that is not yet available on the latest release of Polars - v0.45.1. Once a new version of Polars is published with this feature, a new qsv patch release will be published to crates.io.
This means that installation option 3 usingcargo install
will be limited to 1.0.0 - the last qsv version available on crates.io. All other installation and update options to install/update qsv 2.1.0 still work.
Added
join
: add--ignore-leading-zeros
option #2430joinp
add--norm-unicode
option to unicode normalize join keys #2436pivotp
added more smart aggregation suggestions #2428template
: added to qsvdp binary variant 9df85e6benchmarks
: addedpivotp
benchmark 92e4c51
Changed
joinp
: refactored--ignore-leading-zeros
handling #2433- Migrate from unmaintained dynfmt to dynfmt2 #2421
- deps: bump csvlens to latest upstream 52c766d
- deps: bump to latest csv qsv-optimized fork 58ac650
- deps: bumped MiniJinja to 2.6.0 8176368
- deps: bump to latest Polars upstream
- deps: bump qsv-stats to 0.26.0
- build(deps): bump azure/trusted-signing-action from 0.5.0 to 0.5.1 by @dependabot in #2420
- build(deps): bump base62 from 2.0.3 to 2.1.0 by @dependabot in #2419
- build(deps): bump base62 from 2.1.0 to 2.2.0 by @dependabot in #2426
- build(deps): bump phf from 0.11.2 to 0.11.3 by @dependabot in #2417
- build(deps): bump pyo3 from 0.23.3 to 0.23.4 by @dependabot in #2431
- build(deps): bump serde_json from 1.0.134 to 1.0.135 by @dependabot in #2416
- build(deps): bump tokio from 1.42.0 to 1.43.0 by @dependabot in #2423
- build(deps): bump uuid from 1.11.0 to 1.11.1 by @dependabot in #2427
- apply several clippy suggestions
- bumped numerous indirect dependencies to latest versions
- bumped Rust nightly from 2024-12-19 to 2025-01-05 (same version used by Polars)
- bump MSRV to latest Rust stable - v1.84.0
Fixed
join
: revert optimization that actually resulted in a performance regression e42af2bjoin
:--right-anti
and--right-semi
joins didn't swap headers properly #2435count
: polars-poweredcount
didn't use the right data type SQL count(*) d8c1524
Full Changelog: 2.0.0...2.1.0
2.0.0
qsv v2.0.0 is here! π
It took 193 releases to get to v1.0.0, and we're already at v2.0.0 a month later!?!
Yes! We wanted a running start for 2025, and qsv 2.0.0 marks qsv's biggest release yet!
- It fully enables the "Data Resource Upload First (DRUF)" workflow, allowing Datapusher+ to infer "automagical metadata" from the data itself. It exposes two Domain Specific Language (DSL) options - Luau and MiniJinja - to enable powerful data transformation and validation capabilities. This allows data stewards to upload data first, then use qsv's DSL capabilities inside DP+ to automatically generate rich metadata - including data dictionaries, field descriptions, data quality rules, and data validation schemas. This "automagical metadata" approach dramatically reduces the friction in compiling high-quality, high-resolution metadata (using the DCAT-US 3.0 specification as a reference) that would otherwise be a manual, laborious, and error-prone process.
Under the hood, thefetchpost
,template
,stats
,validate
andluau
commands now have the necessary scaffolding to fully support this workflow inside Datapusher+ and ckanext-scheming. - It adds a new "smart"
pivotp
command, powered by Polars, to enable fast pivot operations on large datasets. It's "smart" as it uses the stats cache to automatically suggest an aggregation based on a column's data type and summary statistics. You can now pivot your data in seconds by simply specifying the columns to pivot on while blowing past Excel's PivotTable limitations. stats
now computes geometric mean and harmonic mean and adds string length stats, all while getting a performance boost.join
andjoinp
got a lot of love in this release, with several new options:joinp
: non-equi join support! ππ―π₯³
See "Lightning Fast and Space Efficient Inequality Joins" paper and this Polars non-equi join tracking issue.join
&joinp
:--right-anti
and--right-semi
joinsjoinp
:--ignore-leading-zeros
option for join keysjoinp
:--maintain-order
option to maintain the order of the either the left or right dataset in the outputjoinp
: expanded--cache-schema
options to makejoinp
smarter/faster by leveraging the stats cachejoin
:--keys-output
option to write successfully joined keys to a separate output file.
This release lays the groundwork for the outliers
"smart" command to quickly identify outliers using stats/frequency info.
It also sets the stage for an initial implementation of our "Data Concierge" that leverages all the high-quality, high-res metadata we automagically compile with DRUF to enable Metadata Gardening Agents to proactively link seemingly unrelated data and glean insights as it constantly grooms the Data Catalog - effectively making it a FAIR Data Factory.
Added
fetchpost
: add--globals-json
option #2357fixlengths
: add--remove-empty
option; refactored for performance. Fulfills #2391. #2411join
: add--keys-output
option. Fulfills #2407. #2408join
: add--right-anti
and--right-semi
options. Fulfills #2379. #2380joinp
: add non-equi join support! ππ―π₯³ #2409joinp
: add--ignore-leading-zeros
option. Fulfills #2398. #2400joinp
: add--maintain-order
option #2338joinp
: add--right-anti
and--right-semi
options. Fulfills #2377. #2378luau
: addl helper functions. Fulfills #1782. #2362luau
: addqsv_writejson
helper #2375pivotp
: new polars polars-powered command. Fulfills #799. #2364pivotp
: "smart" pivotp. #2367stats
: add geometric mean and harmonic mean. Fulfills #2227. #2342stats
: add string length stats to set stage for upcomingoutliers
"smart" command to quickly identify outliers using stats/frequency info #2390template
: add--globals-json
option #2356tojsonl
: add--quiet
option. Fulfills #2335. #2336validate
: add--validate-schema
option to check if the JSON Schema itself is valid #2393contrib(completions)
: add joinp--ignore-case
and slice--invert
by @rzmk in #2322contrib(completions)
: add--quiet
totojsonl
by @rzmk in #2337ci
: add qsv_glibc_2.31-headless to action by @rzmk in #2330- Add license to MSI installer by @rzmk in #2321
Changed
lens
: optimized csvlens library usage, dropping clap dependency #2403pivotp
: an even smarterpivotp
#2368stats
: performance boost 51349ba- Update deb package by @tino097 in #2226
ci
: attempt using files-folder instead of files by @rzmk in #2320- Setting QSV_FREEMEMORY_HEADROOM_PCT to 0 disables memory availability check #2353
- build(deps): bump actix-governor from 0.7.0 to 0.8.0 by @dependabot in #2351
- build(deps): bump bytemuck from 1.20.0 to 1.21.0 by @dependabot in #2361
- build(deps): bump chrono from 0.4.38 to 0.4.39 by @dependabot in #2345
- build(deps): bump crossbeam-channel from 0.5.13 to 0.5.14 by @dependabot in #2354
- build(deps): bump flexi_logger from 0.29.6 to 0.29.7 by @dependabot in #2348
- build(deps): bump governor from 0.7.0 to 0.8.0 by @dependabot in #2347
- build(deps): bump itertools from 0.13.0 to 0.14.0 by @dependabot in #2413
- build(deps): bump jsonschema from 0.26.1 to 0.26.2 by @dependabot in #2355
- build(deps): bump jsonschema from 0.26.2 to 0.27.0 by @dependabot in #2371
- build(deps): bump jsonschema from 0.27.1 to 0.28.0 by @dependabot in #2389
- build(deps): bump jsonschema from 0.28.0 to 0.28.1 by @dependabot in #2396
- bump polars from 0.44.2 to 0.45 #2340
- build(deps): bump polars from 0.45.0 to 0.45.1 by @dependabot in #2344
- bump pyo3 from 0.22 to 0.23 now that Polars supports it #2352
- build(deps): bump redis from 0.27.5 to 0.27.6 by @dependabot in #2331
- build(deps): bump reqwest from 0.12.9 to 0.12.11 by @dependabot in #2385
- build(deps): bump reqwest from 0.12.11 to 0.12.12 by @dependabot in #2395
- build(deps): bump rfd from 0.15.1 to 0.15.2 by @dependabot in #2404
- build(deps): bump serde from 1.0.215 to 1.0.216 by @dependabot in #2349
- build(deps): bump serde from 1.0.216 to 1.0.217 by @dependabot in #2384
- build(deps): bump serde_json from 1.0.133 to 1.0.134 by @dependabot in #2365
- build(deps): bump sysinfo from 0.32.1 to 0.33.0 by @dependabot in #2334
- build(deps): bump sysinfo from 0.33.0 to 0.33.1 by @dependabot in #2383
- deps: bump tabwriter to 1.4.1 bbcbeba
- build(deps): bump tokio from 1.41.1 to 1.42.0 by @dependabot in #2333
- build(deps): bump xxhash-rust from 0.8.12 to 0.8.13 by @dependabot in #2359
- build(deps): bump xxhash-rust from 0.8.13 to 0.8.14 by @dependabot in #2372
- build(deps): bump xxhash-rust from 0.8.14 to 0.8.15 by @dependabot in #2392
- apply several clippy suggestions
- bumped numerous indirect dependencies to latest versions
- bumped Rust nightly from 2024-11-28 to 2024-12-19 (same version used by Polars)
Fixed
1.0.0
qsv v1.0.0 is here! π
After over 3 years of development, nearly 200 releases, and 11,000+ commits, qsv has finally reached v1.0.0!
What started as a hobby project to learn Rust during COVID has evolved into a powerful data wrangling tool used in multiple datHere products, open source projects, and even in several mission-critical production environments!
To mark this major milestone, this larger than usual release includes major performance improvements, new features, and various optimizations!
Added
joinp
: add--ignore-case
option #2287py
: add ability to load python expression from file #2295replace
: add--not-one
flag (resolves #2305) by @rzmk in #2307slice
: add--invert
option #2298stats
: add dataset-level stats #2297sqlp
: auto-decompression of gzip, zstd & zlib compressed csv files withread_csv
table function (implements suggestion from @wardi in #2301) #2315template
: add lookup support #2313- added
ui
feature to make it easier to make a headless build of qsv #2289 - added better panic handling #2304
- added new benchmark for
template
command cd7e480 - added π
lookup support
legend b46de73
Changed
- move qsv from personal Github repo to datHere GitHub org #2317
template
: parallelized template rendering for significant speedups #2273- simplify input format check #2309
- bump embedded
luau
from 0.650 to 0.653 986a1d3 - deps: Switch back to
simple-home-dir
fromsimple-expand-tilde
#2319 - deps: Add minijinja contrib #2276
- deps: bump pyo3 down to 0.21.2 because polars-mem-engine is not compatible with pyo3 0.23.x yet 7f9fc8a
- build(deps): bump base62 from 2.0.2 to 2.0.3 by @dependabot in #2281
- build(deps): bump bytemuck from 1.19.0 to 1.20.0 by @dependabot in #2299
- build(deps): bump bytes from 1.8.0 to 1.9.0 by @dependabot in #2314
- build(deps): bump file-format from 0.25.0 to 0.26.0 by @dependabot in #2277
- build(deps): bump hashbrown from 0.15.1 to 0.15.2 by @dependabot in #2310
- build(deps): bump itoa from 1.0.11 to 1.0.12 by @dependabot in #2300
- build(deps): bump itoa from 1.0.12 to 1.0.13 by @dependabot in #2302
- build(deps): bump itoa from 1.0.13 to 1.0.14 by @dependabot in #2311
- build(deps): bump mlua from 0.10.0 to 0.10.1 by @dependabot in #2280
- build(deps): bump mlua from 0.10.1 to 0.10.2 by @dependabot in #2316
- build(deps): bump serial_test from 3.1.1 to 3.2.0 by @dependabot in #2279
- build(deps): bump minijinja from 2.4.0 to 2.5.0 by @dependabot in #2284
- build(deps): bump minijinja-contrib from 2.3.1 to 2.5.0 by @dependabot in #2283
- build(deps): bump rfd from 0.15.0 to 0.15.1 by @dependabot in #2291
- build(deps): bump sanitize-filename from 0.5.0 to 0.6.0 by @dependabot in #2275
- build(deps): bump serde from 1.0.214 to 1.0.215 by @dependabot in #2286
- build(deps): bump serde_json from 1.0.132 to 1.0.133 by @dependabot in #2292
- build(deps): bump tempfile from 3.13.0 to 3.14.0 by @dependabot in #2278
- build(deps): bump tokio from 1.41.0 to 1.41.1 by @dependabot in #2274
- build(deps): bump url from 2.5.3 to 2.5.4 by @dependabot in #2306
- applied several clippy suggestions
- bumped numerous indirect dependencies to latest versions
- bumped MSRV to latest Rust stable (1.83.0)
- bumped Rust nightly from 2024-11-01 to 2024-11-28, the same version used by Polars
Fixed
- fix
get_stats_records()
helper to handle input files with embedded spaces (fixes #2294) #2296 - added better panic handling (fixes #2301) #2304
- implement simple format check for input files (fixes #2301) #2308
Removed
- removed
simple-expand-tilde
dependency in favor ofsimple-home-dir
#2318 - removed patched fork of
indicatif
now that 0.17.9 is released, fixing GH unmaintained advisory forinstant
33fa54a - removed
clipboard
command fromqsvlite
binary variant 9c663d8
Full Changelog: 0.138.0...1.0.0
0.138.0
Highlights:
-
β New
template
command for rendering templates with CSV data.
Generate complex documents from CSVs (Form letters, HTML, JSON, XML files, etc.) with the powerful MiniJinja template engine (Example template). -
β New
lookup
module for fetching reference data from remote and local files.
In addition to the typicalhttp
/https
schemes for remote files, qsv adds two additional schemes -CKAN://
anddatHere://
, fetching lookup data from a CKAN site or datHere maintained reference data respectively. The lookup module has simple file-based caching as well to minimize repeated fetching of typically static reference data (default cache age: 600 seconds).
Thelookup
module is now being used by theluau
(for itsqsv_register_lookup
helper) andvalidate
(for itsdynamicEnum
custom JSON Schema keyword) commands. More commands will take advantage of this module over time (e.g.apply
,geocode
,template
,sqlp
, etc.) to do extended lookups (e.g. lookup Census information given spatiotemporal data - like demographic info of a Census tract). -
β¨ Enhanced
fetchpost
with MiniJinja templating for payload construction.
Previously,fetchpost
was limited to posting url-encoded HTML Form data with content typeapplication/x-www-form-urlencoded
. Now with the new--payload-tpl
and--content-type
options, users can post request bodies rendered with MiniJinja and specify other content types (typicallyapplication/json
,text/plain
,multipart/form-data
) as well. -
β¨ Improved Polars integration with automatic schema detection
Thejoinp
andsqlp
commands now use qsv's stats cache to automatically determine column data types, rather than having Polars scan a sample of rows. This provides two key benefits:- Faster execution by skipping Polars' schema inference step
- GUARANTEED data type inferencing since the stats cache analyzes the entire dataset, not just a sample
-
π
fast-float2
crate for faster float parsing
Casting string/bytes to float is now much faster (2 to 8x faster than Rust's standard library) withfast-float2
. -
πͺ Major dependency updates including Polars 0.44.2, Luau 0.650, mlua 0.10.0 and jsonschema 0.26.1
These core crates underpin qsv's advanced commands. Using the latest version of these crates allow qsv to stay true to its goal of being the fastest and most comprehensive data-wrangling toolkit.
Added
- added lookup module - enabling fetching and caching of reference data from remote and local files #2262
fetchpost
: add--payload-tpl <file>
and--content-type
options to construct payload using MiniJinja with the appropriate content-type #2268 5921498joinp
: derive polars schema from stats cache 86fe22esqlp
: derive polars schema from stats cache #2256template
: new command to render MiniJinja templates with CSV data #2267validate
: adddynamicEnum
lookup support #2265contrib(completions)
: add template command and update fetchpost by @rzmk in #2269- add
fast-float2
dependency for faster bytes to float conversion 7590e4e 3ca30aa - added more benchmarks for new/updated commands f8a1d4f cd7e480
Changed
luau
: adapt to mlua 0.10 API changes 268cb45luau
: refactored stage management 31ef58aluau
: now uses the lookup module 2f4be34stats
: minor perf refactoring 6cdd6ea- build(deps): bump actions/setup-python from 5.2.0 to 5.3.0 by @dependabot in #2243
- build(deps): bump azure/trusted-signing-action from 0.4.0 to 0.5.0 by @dependabot in #2239
- build(deps): bump bytes from 1.7.2 to 1.8.0 by @dependabot in #2231
- build(deps): bump cached from 0.53.1 to 0.54.0 by @dependabot in #2272
- build(deps): bump flexi_logger from 0.29.3 to 0.29.4 by @dependabot in #2229
- build(deps): bump flexi_logger from 0.29.4 to 0.29.5 by @dependabot in #2261
- build(deps): bump flexi_logger from 0.29.5 to 0.29.6 by @dependabot in #2266
- build(deps): bump hashbrown from 0.15.0 to 0.15.1 by @dependabot in #2270
- build(deps): bump jsonschema from 0.24.0 to 0.24.1 by @dependabot in #2234
- build(deps): bump jsonschema from 0.24.1 to 0.24.2 by @dependabot in #2238
- build(deps): bump jsonschema from 0.24.2 to 0.24.3 by @dependabot in #2240
- build(deps): bump jsonschema from 0.25.0 to 0.25.1 by @dependabot in #2244
- build(deps): bump jsonschema from 0.26.0 to 0.26.1 by @dependabot in #2260
- build(deps): bump regex from 1.11.0 to 1.11.1 by @dependabot in #2242
- build(deps): bump reqwest from 0.12.8 to 0.12.9 by @dependabot in #2258
- build(deps): bump serde from 1.0.210 to 1.0.211 by @dependabot in #2232
- build(deps): bump serde from 1.0.211 to 1.0.213 by @dependabot in #2236
- build(deps): bump serde from 1.0.213 to 1.0.214 by @dependabot in #2259
- build(deps): bump simd-json from 0.14.1 to 0.14.2 by @dependabot in #2235
- build(deps): bump tokio from 1.40.0 to 1.41.0 by @dependabot in #2237
deps
: updated our fork of the csv crate with more perf optimizations eae7d76deps
: use calamine upstream with unreleased fixes 4cc7f37deps
: use our csvlens fork untl PR removing unneeded arboard features is merged bb32322deps
: bump jsonschema from 0.25 to 0.26 #2251deps
: bump embedded Luau from 0.640 to 0.650 8c54b87 aca30b0deps
: bump mlua from 0.9 to 0.10 #2249deps
: bump Polars from 0.43.1 at py-1.11.0 tag to latest 0.44.2 upstream #2255 0e40a44- apply select clippy lint suggestions
- updated indirect dependencies
- aligned Rust nightly to Polars nightly - 2024-10-28 - 245bcb5
Fixed
Removed
- removed need to set RAYON_NUM_THREADS env var and just call the Rayon API directly aa6ef89
- removed unneeded
create_dir_all_threadsafe
helper now that std::create_dir_all is threadsafe d0af83b
Full Changelog: 0.137.0...0.138.0
0.137.0
Highlights:
extdedup
&extsort
now support two modes - LINE mode and CSV mode. Previously, both commands only sorted on a line-by-line basis (LINE mode).
With the addition of CSV mode, you can now deduplicate or sort CSV files on a column-by-column basis, with the powerful--select
option to specify which columns to deduplicate or sort on.
This is especially useful for large CSV files with many columns, where you only want to deduplicate or sort on a subset of columns. And since both commands use disk-backed algorithms (an on-disk hash table forextdedup
, and an external merge sort forextsort
) - they can handle files larger than memory.sqlp
now has a--cache-schema
option that caches the inferred schema of the input CSV file, which can significantly speed up subsequent queries on the same file, as the initial schema inferencing step is skipped.fetch
andfetchpost
have been updated to use thejaq
crate instead of thejql
crate. This change was made to improve performance and to make the commands consistent with thejson
command which also usesjaq
. Furthermore,jaq
is a clone of jq - a widely used JSON parsing tool, so it should be more familiar to users.stats
is a tad faster as we keep squeezing more performance from this central command.
Added
extdedup
: now supports two modes - LINE mode and CSV mode #2208extsort
: now also has two modes - CSV mode and LINE mode #2210sqlp
: add--cache-schema
option #2224- added
sqlp --cache-schema
benchmarks
Changed
apply
&applydp
: use smallvec for operations vector & other minor performance optimizations #2219 & bc837aeapply
&applydp
: specify min_length for parallel iterators 7d6ce5efetch
&fetchpost
: replace jql with jaq #2222stats
: performance optimizations f205809 e26c27f 4579c1bvalidate
: specify min_length for parallel iterators a5b8185deps
: updated polars to 0.43.1 at the py-1.10.0 tag.- build(deps): bump calamine from 0.26.0 to 0.26.1 by @dependabot in #2204
- build(deps): bump csvs_convert from 0.8.14 to 0.9.0 by @dependabot in #2215
- build(deps): bump flexi_logger from 0.29.2 to 0.29.3 by @dependabot in #2209
- build(deps): bump jsonschema from 0.23.0 to 0.24.0 by @dependabot in #2223
- build(deps): bump pyo3 from 0.22.3 to 0.22.4 by @dependabot in #2207
- build(deps): bump pyo3 from 0.22.4 to 0.22.5 by @dependabot in #2212
- build(deps): bump redis from 0.27.3 to 0.27.4 by @dependabot in #2202
- build(deps): bump redis from 0.27.4 to 0.27.5 by @dependabot in #2217
- build(deps): bump serde_json from 1.0.129 to 1.0.130 by @dependabot in #2218
- build(deps): bump serde_json from 1.0.131 to 1.0.132 by @dependabot in #2220
- build(deps): bump uuid from 1.10.0 to 1.11.0 by @dependabot in #2213
- apply select clippy lints
- bumped indirect dependencies
- bumped MSRV to 1.82
Fixed:
- fix performance regression in batched commands by refactoring
optimal_batch_size
to require indexed CSV files #2206
Removed:
fetch
&fetchpost
: removed jql options; replaced with jaq #2222
Full Changelog: 0.136.0...0.137.0
0.136.0
π qsv pro is now available in the Microsoft Store! π
It's Data Wrangling Democratized on the Desktop, featuring:
- π Familiar Spreadsheet Interface
tap the power of qsv to query, analyze, enrich, scrub and transform huge Excel files and multi-gigabyte CSV files in seconds, without having to deal with the command-line. - CKAN desktop client
designed to make data publishing easier for portal operators and data stewards using the CKAN platform. - π₯ Flow
allows you to build custom node-based flows and data pipelines using a visual interface. - π§ Toolbox
features an ever-expanding library of reusable scripts for common data-wrangling use cases. - β and more!
Natural Language Interface (RAG), Polars SQL query support, an API, Python/Luau support, automatic Data Dictionaries, DCAT 3 metadata profile inferencing, along with a retinue of other cloud-based services (e.g. customizable street-level geocoding, data feeds, reference data lookups, geo-ip lookups, cloud storage support,.qsv
file format, etc.) that will be unveiled in future versions.
Like qsv, we're iterating rapidly with qsv pro, so your feedback is essential. Give it a try!
Other highlights:
excel
: new--table
option for XLSX files; new--header-row
option; expanded--range
option, adding support for Named Ranges and absolute ranges (e.g.Sheet2!$A$1:$J$10
); and expanded metadata export now including Named Ranges and Tables (for XLSX files)- Improved performance for several commands (
apply
,datefmt
,tojsonl
andvalidate
) through automatic batch size optimization validate
:dynamicEnum
custom JSON Schema keyword in validate command (renamed fromdynenum
) and enhanced email validationschema
: automatic JSON Schemaconst
inferencing for columns with just one value- Significant dependency updates, including latest upstream versions of Polars, jsonschema, and serde_json with unreleased performance upgrades, new features and fixes
NOTE: You can see qsv & qsv pro in action in our "The Problem with Data Portals" webinar Wed, Oct 23, 2024. 1-2pm EDT
Added
- π qsv pro is now in the Microsoft Store!!! π
apply
,datefmt
,tojsonl
,validate
: added logic to automatically determine optimal batch size for better parallelization #2178enum
: added--new-column
support for all enum modes, not just--increment
#2173excel
: new--table
option for XLSX files #2194excel
: new--header-row
option 458f79aexcel
: expanded range and metadata options #2195schema
: added JSON Schema automaticconst
inferencing #2180- Add signing step to qsv MSI installer GitHub Action by @rzmk in #2182
contrib(completions)
: add--table
option toqsv excel
by @rzmk in #2197completions
: add--header-row
option toqsv excel
e8794d5- added new
apply operations sentiment
benchmark b745e64 docs
: added indexing section to PERFORMANCE.md 804145a
Changed
stats
: various minor micro-optimizations 62d95fc 2c2862avalidate
: renamed custom keyworddynenum
todynamicEnum
to be more consistent with JSON schema naming conventions 0.135.0...master#diff-9783631cdad9e1f47f60266303dc2d56a6e7a486784b61c40961601e8192f7cfvalidate
: optimizations for increased performance; replace serde_json with simd_json 0.135.0...master#diff-9783631cdad9e1f47f60266303dc2d56a6e7a486784b61c40961601e8192f7cf- apply new
clippy::ref_option
lint to Config::new API #2192 - Update debian package readme by @tino097 in #2187
deps
: bumpcalamine
from 0.25 to 0.26 b42279adeps
:jsonschema
use latest 0.22.3 upstream with unreleased features/fixesdeps
:polars
use latest 0.43.1 upstream with unreleased features/fixesdeps
: created our own fork of unmaintained vader_sentiment crate b426761deps
: useserde_json
upstream with unreleased perf improvement/fixes https://github.com/jqnatividad/qsv/blob/1c1174b3b8b65d9dfd9c841597366fb09d0a047c/Cargo.toml#L221- build(deps): bump flate2 from 1.0.33 to 1.0.34 by @dependabot in #2171
- build(deps): bump flexi_logger from 0.29.0 to 0.29.1 by @dependabot in #2189
- build(deps): bump flexi_logger from 0.29.1 to 0.29.2 by @dependabot in #2196
- build(deps): bump hashbrown from 0.14.5 to 0.15.0 by @dependabot in #2186
- build(deps): bump jsonschema from 0.20.0 to 0.21.0 by @dependabot in #2177
- build(deps): bump jsonschema from 0.22.1 to 0.22.2 by @dependabot in #2191
- build(deps): bump regex from 1.10.6 to 1.11.0 by @dependabot in #2176
- build(deps): bump reqwest from 0.12.7 to 0.12.8 by @dependabot in #2183
- build(deps): bump simd-json from 0.14.0 to 0.14.1 #2199
- build(deps): bump simple-expand-tilde from 0.4.2 to 0.4.3 by @dependabot in #2190
- build(deps): bump sysinfo from 0.31.4 to 0.32.0 by @dependabot in #2193
- build(deps): bump tempfile from 3.12.0 to 3.13.0 by @dependabot in #2175
- apply select clippy lints
- bumped indirect dependencies
- aligned Rust nightly to Polars nightly - 2024-09-29 7cd2de1
Fixed
schema
: fixenum
so it only adds a list when the number of unique values >--enum-threshold
#2180- Upload artifact fix for Debian package publishing by @tino097 in #2168
- fixed typos configuration 627de89
- fixed various GitHub Actions publishing workflow issues
Full Changelog: 0.135.0...0.136.0
0.135.0
Highlights
JSON Schema validation just got a whole lot more powerful with the introduction of qsv's custom dynenum
keyword!
With dynenum
, you can now dynamically lookup valid enum values from a CSV (on the filesystem or on a URL), allowing for more flexible and responsive data validation.
Unlike the standardenum
keyword, dynenum
does not require hardcoding valid values at schema definition time, and can be used to validate data against a changing set of valid values.
For an example, see #1872 (reply in thread).
In an upcoming qsv pro release, we're planning on making dynenum
even more powerful by allowing you to easily specify high-value reference data (e.g. US Census data, World Bank data, data.gov, etc.) that is maintained at data.dathere.com and other CKAN instances.
This release also add the custom currency
JSON Schema format, which enables currency validation according to the ISO 4217 standard.
The Polars engine was also upgraded to 0.43.1 at the py-1.81.1 tag - making for various under-the-hood improvements for the sqlp
, joinp
and count
commands, as we set the stage for more Polars-powered features in future releases.
Added
foreach
: enabledforeach
command on Windows prebuilt binaries def9c8flens
: added support for QSV_SNIFF_DELIMITER env var and snappy auto-decompression 8340e89sample
: add--max-size
option e845a3cvalidate
: addeddynenum
custom JSON Schema keyword for dynamic validation lookups #2166tests
: add tests for https://100.dathere.com/lessons/2 by @rzmk in #2141- added
stats_sorted
andfrequency_sorted
benchmarks - added
validate_dynenum
benchmarks
Changed
json
: add error for empty key and update usage text by @rzmk in #2167prompt
: gateprompt
command behindprompt
feature #2163validate
: expandedcurrency
JSON Schema custom format to support ISO 4217 currency codes and alternate formats 5202508validate
: migrate to newjsonschema
crate api 5d65054- Update ubuntu version for deb package by @tino097 in #2126
contrib(completions)
: update completions for qsv v0.134.0 and fix subcommand options by @rzmk in #2135contrib(completions)
: add--max-size
completion forsample
by @rzmk in #2142deps
: bump to polars 0.43.1 at py-1.81.1 #2130deps
: switch back to calamine upstream instead of our fork 677458f- build(deps): bump actix-governor from 0.5.0 to 0.6.0 by @dependabot in #2146
- build(deps): bump anyhow from 1.0.87 to 1.0.88 by @dependabot in #2132
- build(deps): bump arboard from 3.4.0 to 3.4.1 by @dependabot in #2137
- build(deps): bump bytes from 1.7.1 to 1.7.2 by @dependabot in #2148
- build(deps): bump geosuggest-core from 0.6.3 to 0.6.4 by @dependabot in #2153
- build(deps): bump geosuggest-utils from 0.6.3 to 0.6.4 by @dependabot in #2154
- build(deps): bump jql-runner from 7.1.13 to 7.2.0 by @dependabot in #2165
- build(deps): bump jsonschema from 0.18.1 to 0.18.2 by @dependabot in #2127
- build(deps): bump jsonschema from 0.18.2 to 0.18.3 by @dependabot in #2134
- build(deps): bump jsonschema from 0.18.3 to 0.19.1 by @dependabot in #2144
- build(deps): bump jsonschema from 0.19.1 to 0.20.0 by @dependabot in #2152
- build(deps): bump pyo3 from 0.22.2 to 0.22.3 by @dependabot in #2143
- build(deps): bump rfd from 0.14.1 to 0.15.0 by @dependabot in #2151
- build(deps): bump simple-expand-tilde from 0.4.0 to 0.4.2 by @dependabot in #2129
- build(deps): bump qsv_currency from 0.6.0 to 0.7.0 by @dependabot in #2159
- build(deps): bump qsv_docopt from 1.7.0 to 1.8.0 by @dependabot in #2136
- build(deps): bump redis from 0.26.1 to 0.27.0 by @dependabot in #2133
- build(deps): bump simdutf8 from 0.1.4 to 0.1.5 by @dependabot in #2164
- bump indirect dependencies
- apply select clippy lint suggestions
- several usage text/documentation improvements
- bump MSRV to 1.81.0
Fixed
validate
: correctfail_validation_error!
macro; reformat error messages to use hyphens as the JSONschema error message already starts with "error:" 9a25524- moved
--help
output from stderr to stdout as per GNU CLI guidelines #2138 lens
: fixed parsing of lens options 1cdd1bcsearchset
: fixed usage text for<regexset-file>
9a60fb0- used patched forks of
arrow
,csvlens
andxlsxwriter
crates that replaces a dependency on an old version oflexical-core
with known soundness issues - https://rustsec.org/advisories/RUSTSEC-2023-0086. Once those crates have updated theirlexical-core
dependency, we will revert to the original crates.
Removed
- removed
prompt
command from qsvlite #2163 - publish: remove
lens
feature from i686 targets as it does not compile 959ca76 deps
: remove anyhow dependency #2150
Full Changelog: 0.134.0...0.135.0
0.134.0
qsv pro v1 is here! π
If you've been using qsv for a while, even if you're a command-line ninja, you'll find a lot of new capabilities in qsv pro that can make your data wrangling experience even better!
Apart from making qsv easier to use, qsv pro has a multitude of features including: view interactive data tables; browse stats/frequency/metadata; run recipes and tools (scripts); run Polars SQL queries; use Natural Language queries (using Retrieval Augmented Generation (RAG) techniques); regular expression search; export to multiple file formats; download/upload from/to compatible CKAN instances; design custom node-based flows and data pipelines; interact with a local API from external programs including the qsv pro command; run various qsv commands in a graphical user interface; and the list goes on!
And that's just the beginning, there's more to come! You just have to try it!
Download qsv pro v1 now at qsvpro.dathere.com.
Other highlights include:
pro
: new command to allow qsv to interact with the qsv pro API to tap into qsv pro exclusive features.lens
: new command to interactively view CSVs using the csvlens crate.- The ludicrously fast
diff
command is now easier to use with its--drop-equal-fields
option. @janriemer continues to work on hiscsv-diff
crate, and there's morediff
UX improvements coming soon! stats
addssum_length
andavg_length
"streaming" statistics in addition to the existingmin_length
andmax_length
metrics. These are especially useful for datasets with a lot of "free text" columns.stats
also got "smarter" and "faster" by dog-fooding its own statistics to make it run faster!
It's a little complicated, but the waystats
works is that it compiles the "streaming" statistics on the fly first as it multiplex load the data across several threads, and the more expensive advanced statistics are "lazily" computed at the end.
Since we now compile "sort order" in a streaming manner, we use this info when deriving cardinality at the end to see if we can skip sorting - an otherwise necessary step to get cardinality which is done by "scanning" all the sorted values of a column. Everytime two neighboring values differ in a sorted column, it increments the cardinality count.
Apart from this "sort order" optimization, we also improved the "cardinality scan" algorithm - halving its memory footprint and making it faster still for larger datasets by parallelizing the computation. This in turn, makes thefrequency
command faster and more memory efficient.
It's performance tweaks like these, that despite adding six metrics (is_ascii
,sort_order
,sum_length
,avg_length
,sem
- standard error of the mean &cv
- coefficient of variation) in recent releases, thatstats
is still able to compile 35 statistics and do GUARANTEED data type inferences of a million row, 41 column, 520 MB sample of NYC's 311 data in 1.327 seconds (753,580 records per second)!1- we now also use our own fork of the
csv
crate, featuring SIMD-accelerated UTF-8 validation and other minor perf tweaks, making the entire qsv suite faster still!
Added
pro
: addqsv pro
command to interact with qsv pro API by @rzmk in #2039lens
: new command to interactively view CSVs using the csvlens crate #2117apply
: add crc32 operation #2121count
: add --delimiter option #2120diff
: add flag--drop-equal-fields
by @janriemer in #2114stats
: addsum_length
andavg_length
columns #2113stats
: smarter cardinality computation - added new parallel algorithm for large datasets (10,000+ rows) and updated sequential algorithm for smaller datasets 4e63fec
Changed
count
: added comment to justify magic number 5241e39stats
: use simdjson for faster JSONL parsing; micro-optimizecompute
hot loop 0e8b734stats
: standardized OVERFLOW and UNDERFLOW messages 38c6128sort
: renamed symbol so eliminate devskim lint false positive warning 12db739- enable
lens
feature in GH workflows #2122 deps
: bump polars 0.42.0 to latest upstream at time of release 3c17ed1deps
: use our own optimized fork of csv crate, with simdutf8 validation and other minor perf tweaks e4bcd71- build(deps): bump serde from 1.0.209 to 1.0.210 by @dependabot in #2111
- build(deps): bump serde_json from 1.0.127 to 1.0.128 by @dependabot in #2106
- build(deps): bump qsv-stats from 0.19.0 to 0.22.0 #2107 #2112 cb1eb60
- apply select clippy lint suggestions
- updated several indirect dependencies
- made various doc and usage text improvements
Fixed
schema
: Print an error if theqsv stats
invocation fails by @abrauchli in #2110
New Contributors
- @abrauchli made their first contribution in #2110
Full Changelog: 0.133.1...0.134.0