Releases: eBay/tsv-utils
v2.2.1
v2.2.0 Release: Line buffering; New tsv-filter features (--count, --label)
To download and unpack prebuilt binaries:
$ # Linux
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v2.2.0/tsv-utils-v2.2.0_linux-x86_64_ldc2.tar.gz | tar xz
$ # MacOS
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v2.2.0/tsv-utils-v2.2.0_osx-x86_64_ldc2.tar.gz | tar xz
Installation instructions are in the ReleasePackageReadme.txt
file in the release package.
To be notified of new releases:
GitHub supports notification of new releases. Click the "Watch" button on the repository page and select "Releases Only".
Release 2.2.0 Changes:
tsv-filter
: New feature, count matches rather than filtering (--c|count
). This option causes the number of matching lines to be printed rather than the individual matching lines.tsv-filter
: New feature, marking records rather than filtering (--label
). This option causes every record to be marked with an indication of whether it satisfied the test. Marking is done by appending a new field with an indicator value. See PR #338 for details.- New option: Line buffering, available in most tools (
--line-buffered
). This option causes each line to read and written as soon as it is available. This overrides the default buffering behavior. This is useful when reading from slow input streams. See PR #336 for details.
Other Changes
- Prebuilt binaries have been updated to use LDC compiler version ldc-1.24.0.
- Changes to the LDC build parameters to better support Archlinux and other platforms. See PR #329.
v2.1.2 Minor Release
To download and unpack prebuilt binaries:
$ # Linux
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v2.1.2/tsv-utils-v2.1.2_linux-x86_64_ldc2.tar.gz | tar xz
$ # MacOS
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v2.1.2/tsv-utils-v2.1.2_osx-x86_64_ldc2.tar.gz | tar xz
Installation instructions are in the ReleasePackageReadme.txt
file in the release package.
To be notified of new releases:
GitHub supports notification of new releases. Click the "Watch" button on the repository page and select "Releases Only".
Release 2.1.2 Changes
- Small performance improvement in several tools by switching from
File.write
toFile.rawWrite
. See PR #316. - Stopped using LDC option
-disable-fp-elim
. This option is no longer available starting with LDC 1.24.0 (next version) and is a required change. See PR #316.
Prebuilt binaries have been built using the latest LDC compiler (ldc-1.23.0).
v2.1.1 Minor Release
To download and unpack prebuilt binaries:
$ # Linux
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v2.1.1/tsv-utils-v2.1.1_linux-x86_64_ldc2.tar.gz | tar xz
$ # MacOS
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v2.1.1/tsv-utils-v2.1.1_osx-x86_64_ldc2.tar.gz | tar xz
Installation instructions are in the ReleasePackageReadme.txt
file in the release package.
To be notified of new releases:
GitHub supports notification of new releases. Click the "Watch" button on the repository page and select "Releases Only".
Release 2.1.1 Changes
- Improved
csv2tsv
buffer utilization. Enables better performance of subsequent tasks in a pipeline due to more frequent writes to standard output (better parallelization). Minor performance benefits tocsv2tsv
by itself. See PR #305. - Code change to support an upcoming D language change (minor). A tagged release with this change is needed to support
tsv-utils
use in the D Language project tester. See PR #306.
Prebuilt binaries have been built using the latest LDC compiler (ldc-1.23.0).
v2.1.0 Release: csv2tsv updates
To download and unpack prebuilt binaries:
$ # Linux
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v2.1.0/tsv-utils-v2.1.0_linux-x86_64_ldc2.tar.gz | tar xz
$ # MacOS
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v2.1.0/tsv-utils-v2.1.0_osx-x86_64_ldc2.tar.gz | tar xz
Installation instructions are in the ReleasePackageReadme.txt
file in the release package.
To be notified of new releases:
GitHub supports notification of new releases. Click the "Watch" button on the repository page and select "Releases Only".
Release 2.1.0 Changes: csv2tsv
- Performance improvements:
csv2tsv
is significantly faster as a result of switching to a buffer-based conversion algorithm. The2.1.0
version runs 40-60% faster than the2.0.0
version on tests on Mac OS, depending on the type of file. See PR #301 for details. - UTF-8 Byte Order Marks (BOMs) found in CSV input files are discarded when producing TSV output. See PR #302 for details.
- TAB and Newline replacement strings can now be specified separately. Previously, only one replacement string was allowed for both newline and TAB characters in the CSV data. Now different replacements can be provided. This uses the new command line arguments
--r|tab-replacement
and--n|newline-replacement
. See PR #303 for details.
Other Changes
- Prebuilt binaries have been updated to use the latest LDC compiler (ldc-1.23.0).
v2.0.0 Release: Named Fields
To download and unpack prebuilt binaries:
$ # Linux
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v2.0.0/tsv-utils-v2.0.0_linux-x86_64_ldc2.tar.gz | tar xz
$ # MacOS
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v2.0.0/tsv-utils-v2.0.0_osx-x86_64_ldc2.tar.gz | tar xz
Installation instructions are in the ReleasePackageReadme.txt
file in the release package.
To be notified of new releases:
GitHub supports notification of new releases. Click the "Watch" button on the repository page and select "Releases Only".
Release 2.0.0 Changes: Named Field Support
Release 2.0.0 adds named field support to all tools in the tsv-utils toolkit. This is a significant usability improvement.
Named fields can be used with any file or data stream that has a header line. Named fields are enabled by the --H|header
option. Field numbers can be used as well, just as in the prior versions of the toolkit. Glob-style wildcards can be used and escapes can be used to specify field names containing special characters.
Details are available in the Field Syntax section of the Tools Reference manual.
Examples - Assume a file with the header fields:
1 test_name
2 run
3 elapsed_time
4 user_time
5 system_time
6 max_memory
Commands like the following can be used:
$ # Select individual fields, like 'cut'
$ tsv-select data.tsv -H -f user_time # Field 4
$ tsv-select data.tsv -H -f test_name,user_time # Fields 1,4
$ tsv-select data.tsv -H -f '*_time' # Fields 3,4,5
$ # Filter lines using numeric comparisons against individual fields
$ tsv-filter data.tsv -H --lt elapsed_time:100
$ tsv-filter data.tsv -H --gt elapsed_time:100 --lt system_time:20
$ # Statistical summaries
$ tsv-summarize data.tsv -H --median elapsed_time
$ tsv-summarize data.tsv -H --median '*_time'
$ tsv-summarize data.tsv -H --group-by test_name --median '*_time'
$ # Uniq'ing on a field
$ tsv-uniq data.tsv -H -f test_name
$ # Joins - Assume another file 'test_info.tsv' with 'test_name' and
$ # 'expected_time' fields. A join can be performed using column names.
$ tsv-join -H -f test_into.tsv data.tsv --key-fields test_name --append-fields expected_time
See the reference docs or online help for details on specific tools. There is also documentation in the Tools Overview section of the main project README file.
Named field support addresses enhancement request #25. It implemented via PRs #284 through #300.
Other Changes
- Prebuilt binaries have been updated to use the latest LDC compiler (ldc-1.22.0).
v1.6.1 Minor Release
To download and unpack prebuilt binaries:
$ # Linux
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.6.1/tsv-utils-v1.6.1_linux-x86_64_ldc2.tar.gz | tar xz
$ # MacOS
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.6.1/tsv-utils-v1.6.1_osx-x86_64_ldc2.tar.gz | tar xz
Installation instructions are in the ReleasePackageReadme.txt
file in the release package.
To be notified of new releases:
GitHub supports notification of new releases. Click the "Watch" button on the repository page and select "Releases Only".
Release 1.6.1 Changes:
v1.6.0 Release
To download and unpack prebuilt binaries:
$ # Linux
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.6.0/tsv-utils-v1.6.0_linux-x86_64_ldc2.tar.gz | tar xz
$ # MacOS
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.6.0/tsv-utils-v1.6.0_osx-x86_64_ldc2.tar.gz | tar xz
Installation instructions are in the ReleasePackageReadme.txt
file in the release package.
To be notified of new releases:
GitHub supports notification of new releases. Click the "Watch" button on the repository page and select "Releases Only".
Release 1.6.0 Changes:
-
Prebuilt binaries have been updated to use the latest LDC compiler (1.20.1).
-
tsv-select
: New feature, the ability to exclude fields (PR #267).Fields to exclude are specified with the --e|exclude option. Examples:
$ # Drop the first field, keep everything else. $ # Equivalent to `cut -f 2- file.tsv` $ tsv-select --exclude 1 file.tsv $ # Drop fields 3-10, keep everything else $ tsv-select --exclude 3-10 file.tsv
See the tsv-select reference for more information.
-
New tool:
tsv-split
(PR #270)tsv-split
is used to split one or more input files into multiple output files. There are three modes of operation:-
Fixed number of lines per file (
--l|lines-per-file NUM
): Each input block of NUM lines is written to a new file. This is similar to the Unixsplit
utility. -
Random assignment (
--n|num-files NUM
): Each input line is written to a randomly selected output file. Random selection is from NUM files. -
Random assignment by key (
--n|num-files NUM, --k|key-fields FIELDS
): Input lines are written to output files using fields as a key. Each unique key is randomly assigned to one of NUM output files. All lines with the same key are written to the same file.
Examples:
$ # Split a file into files of 10,000 lines each. $ tsv-split data.txt --lines-per-file 10000 --dir split_files $ # Split a file into 1000 files with lines randomly assigned. $ tsv-split data.txt --num-files 1000 --dir split_files # Randomly assign lines to 1000 files using field 3 as a key. $ tsv-split data.tsv --num-files 1000 -key-fields 3 --dir split_files
See the tsv-split reference for more information.
-
v1.5.0 Release
To download and unpack prebuilt binaries:
$ # Linux
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.5.0/tsv-utils-v1.5.0_linux-x86_64_ldc2.tar.gz | tar xz
$ # MacOS
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.5.0/tsv-utils-v1.5.0_osx-x86_64_ldc2.tar.gz | tar xz
Installation instructions are in the ReleasePackageReadme.txt
file in the release package.
To be notified of new releases:
GitHub supports notification of new releases. Click the "Watch" button on the repository page and select "Releases Only".
Release 1.5.0 Changes:
-
Prebuilt binaries have been updated to use the latest LDC compiler (1.20.0).
-
tsv-filter
: Field list support (PR #259).Field list provide a compact way to specify multiple fields for a command. Most tsv-utils tools already support field lists, now
tsv-filter
does as well. Examples:$ # Select lines where fields 1-10 are not empty. $ tsv-filter --not-empty 1-10 data.tsv $ # Select lines where fields 1-5 and 17 are less than 100 $ tsv-filter --lt 1-5,17:100 data.tsv
-
tsv-filter
: New field length tests based on either characters or bytes (PR #258).The new operators allow filtering on field length. Field length can be measured in either characters or bytes. (Characters can occupy multiple bytes in UTF-8). Examples:
$ # Keep only lines where field 3 is less than 50 characters $ tsv-filter --char-len-lt 3:50 data.tsv $ # Find lines where field 5 is more than 20 bytes $ tsv-filter --byte-len-gt 5:20
Character length tests have names of the form:
--char-len-eq|ne|lt|le|gt|ge]
. Byte length tests have names of the form:--byte-len-[eq|ne|lt|le|gt|ge]
. -
tsv-filter
: Improved error messages when invalid regular expressions are used.The error message printed by
tsv-filter
now includes the error text provided by the D regular expression engine. This is helpful when trying to debug complex regular expressions. Examples:$ # Old error message (tsv-filter 1.4.4) $ tsv-filter --regex 4:'abc(d|e' data.tsv [tsv-filter] Error processing command line arguments: Invalid values in option: '--regex 4:abc(d|e'. Expected: '--regex <field>:<val>' where <field> is a number and <val> is a regular expression. $ # New error message (tsv-filter 1.5.0) [tsv-filter] Error processing command line arguments: Invalid regular expression: '--regex 4:abc(d|e'. no matching ')' Pattern with error: `abc(d|e` <--HERE-- `` Expected: '--regex <field>:<val>' or '--regex <field-list>:<val>' where <val> is a regular expression.
The formatting of the message can be improved and is likely to be updated in the future.
-
tsv-uniq
: Performance improvements (PRs #234, #235).Better memory management and other changes improved
tsv-uniq
performance by 5-35% depending on the operation. -
tsv-sample
: Performance improvements reading large data blocks from standard input (PR #238).Sampling and shuffling operations requiring that all data be read into memory were unnecessarily slow when large amounts of data was read from standard input. Performance issues were noticed with data sizes larger than 10 GB. This is now fixed.
-
Sample bash scripts included in release package (PR #254).
Sample versions of the
tsv-sort
andtsv-sort-fast
scripts described on the Tips and Tricks page are now included in the repository and in prebuilt binary packages.
v.1.4.4 Minor Release
Changes:
-
New
tsv-sample
option--i|inorder
This option preserves input order when using simple or weighted random sampling. These sampling modes are engaged when a sample size is selected via the
--n|num NUM
option. Documentation was updated to better reflect the distinction between shuffling the full data set and random sampling which selects a subset of lines. (PR #226) -
tsv-summarize
--min
and--max
operators changed to preserve original input stringThe prior behavior of the operators was to read the values to a double, then use numeric formatting to print the recorded double. In some cases this would cause the original input to change, especially if it was a long format number, for example, 16 digits long. (PR #220)
The prior behavior makes sense for calculations like mean and median, but not for min and max. In particular, preserving the original values allows them to be joined with or compared to the original data.
-
Prebuilt binaries have been updated to use the latest LDC compiler (1.17.0).
To download and unpack the prebuilt binaries:
$ # Linux
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.4.4/tsv-utils-v1.4.4_linux-x86_64_ldc2.tar.gz | tar xz
$ # MacOS
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.4.4/tsv-utils-v1.4.4_osx-x86_64_ldc2.tar.gz | tar xz