feature request: csvtk split by number of lines per chunk #122

avilella · 2021-01-19T16:40:29Z

This is a feature request for the csvtk split command to have and additional --nlines option so that it behaves similarly to the GNU utils split --lines (https://www.gnu.org/software/coreutils/manual/html_node/split-invocation.html) but deals with the headers in a nice way.

E.g. we have a file with 5 entries:
a,b,c,d
1,2,3,4
2,3,4,5
3,4,5,6
4,5,6,7
5,6,7,8

We run csvtk split --nlines 2, which produces chunks of 2 entries per line:
##file1
a,b,c,d
1,2,3,4
2,3,4,5
##file2
a,b,c,d
3,4,5,6
4,5,6,7
##file3
a,b,c,d
5,6,7,8

Thanks in advance

The text was updated successfully, but these errors were encountered:

san-r · 2021-07-25T09:36:33Z

I need to use this feature when working with very large csv files, which I usually keep compressed with gzip or zstd (which supports significantly faster decompression speed). For the moment, I use xsv from https://github.com/BurntSushi/xsv which does exactly what has been asked above. However, it outputs uncompressed csv chunks only. I haven't figured out a way to output chunks compressed with gzip or zstd. This feature would be a very useful addition to csvtk.

zhanxw · 2024-09-28T05:17:23Z

Here is a work around.

E.g. put 2 lines per output file.

Input file

$ cat a.csv
a,b,c,d
1,2,3,4
2,3,4,5
3,4,5,6
4,5,6,7
5,6,7,8

Command

$ cat a.csv |csvtk grep -p '.*' -r -n -N |csvtk mutate2 -n chunk0 -e '($row+1)/2' |csvtk mutate -f chunk0 -n chunk -p '([0-9]+).*$'  |csvtk cut -f -row,-chunk0 |csvtk split -f chunk -o split
$ sed -i 's/,\(chunk\|[0-9]\+\)$//' split/stdin*.csv

You can check the outputs:

$ cat split/stdin-1.csv
a,b,c,d
1,2,3,4
2,3,4,5
$ cat split/stdin-2.csv
a,b,c,d
3,4,5,6
4,5,6,7
$ cat split/stdin-3.csv
a,b,c,d
5,6,7,8

You can change this part '($row+1)/2' by replacing 1 and 2 to other positive integers, e.g, '($row+2)/3'.
The idea is to:

csvtk grep -p '.*' -r -n -N : add a column row with row numbers
csvtk mutate2 -n chunk0 -e '($row+1)/2' : obtain the chunk as a float number
csvtk mutate -f chunk0 -n chunk -p '([0-9]+).*$' : this is equivalent to ceiling($chunk0).
csvtk cut -f -row,-chunk0 : remove the extra row and chunk0 columns.
csvtk split -f chunk -o split : split the input file.
As all split files have an extra column chunk, we remove this column using sed -i 's/,$chunk\|[0-9]\+$$//' split/stdin*.csv

shenwei356 added the new feature label Jan 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature request: csvtk split by number of lines per chunk #122

feature request: csvtk split by number of lines per chunk #122

avilella commented Jan 19, 2021

san-r commented Jul 25, 2021

zhanxw commented Sep 28, 2024 •

edited

Loading

feature request: csvtk split by number of lines per chunk #122

feature request: csvtk split by number of lines per chunk #122

Comments

avilella commented Jan 19, 2021

san-r commented Jul 25, 2021

zhanxw commented Sep 28, 2024 • edited Loading

zhanxw commented Sep 28, 2024 •

edited

Loading