Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: csvtk split by number of lines per chunk #122

Open
avilella opened this issue Jan 19, 2021 · 2 comments
Open

feature request: csvtk split by number of lines per chunk #122

avilella opened this issue Jan 19, 2021 · 2 comments

Comments

@avilella
Copy link

This is a feature request for the csvtk split command to have and additional --nlines option so that it behaves similarly to the GNU utils split --lines (https://www.gnu.org/software/coreutils/manual/html_node/split-invocation.html) but deals with the headers in a nice way.

E.g. we have a file with 5 entries:
a,b,c,d
1,2,3,4
2,3,4,5
3,4,5,6
4,5,6,7
5,6,7,8

We run csvtk split --nlines 2, which produces chunks of 2 entries per line:
##file1
a,b,c,d
1,2,3,4
2,3,4,5
##file2
a,b,c,d
3,4,5,6
4,5,6,7
##file3
a,b,c,d
5,6,7,8

Thanks in advance

@san-r
Copy link

san-r commented Jul 25, 2021

I need to use this feature when working with very large csv files, which I usually keep compressed with gzip or zstd (which supports significantly faster decompression speed). For the moment, I use xsv from https://github.com/BurntSushi/xsv which does exactly what has been asked above. However, it outputs uncompressed csv chunks only. I haven't figured out a way to output chunks compressed with gzip or zstd. This feature would be a very useful addition to csvtk.

@zhanxw
Copy link

zhanxw commented Sep 28, 2024

Here is a work around.

E.g. put 2 lines per output file.

Input file

$ cat a.csv
a,b,c,d
1,2,3,4
2,3,4,5
3,4,5,6
4,5,6,7
5,6,7,8

Command

$ cat a.csv |csvtk grep -p '.*' -r -n -N |csvtk mutate2 -n chunk0 -e '($row+1)/2' |csvtk mutate -f chunk0 -n chunk -p '([0-9]+).*$'  |csvtk cut -f -row,-chunk0 |csvtk split -f chunk -o split
$ sed -i 's/,\(chunk\|[0-9]\+\)$//' split/stdin*.csv

You can check the outputs:

$ cat split/stdin-1.csv
a,b,c,d
1,2,3,4
2,3,4,5
$ cat split/stdin-2.csv
a,b,c,d
3,4,5,6
4,5,6,7
$ cat split/stdin-3.csv
a,b,c,d
5,6,7,8

You can change this part '($row+1)/2' by replacing 1 and 2 to other positive integers, e.g, '($row+2)/3'.
The idea is to:

  1. csvtk grep -p '.*' -r -n -N : add a column row with row numbers
  2. csvtk mutate2 -n chunk0 -e '($row+1)/2' : obtain the chunk as a float number
  3. csvtk mutate -f chunk0 -n chunk -p '([0-9]+).*$' : this is equivalent to ceiling($chunk0).
  4. csvtk cut -f -row,-chunk0 : remove the extra row and chunk0 columns.
  5. csvtk split -f chunk -o split : split the input file.
    As all split files have an extra column chunk, we remove this column using sed -i 's/,\(chunk\|[0-9]\+\)$//' split/stdin*.csv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants