Skip to content

A cross-platform, efficient and practical CSV/TSV toolkit in Golang (a fork with our "robusta" customizations)

License

Notifications You must be signed in to change notification settings

RobustaStudio/csvtk

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

csvtk - a cross-platform, efficient and practical CSV/TSV toolkit

Introduction

Similar to FASTA/Q format in field of Bioinformatics, CSV/TSV formats are basic and ubiquitous file formats in both Bioinformatics and data sicence.

People usually use spreadsheet softwares like MS Excel to do process table data. However it's all by clicking and typing, which is not automatically and time-consuming to repeat, especially when we want to apply similar operations with different datasets or purposes.

You can also accomplish some CSV/TSV manipulations using shell commands, but more codes are needed to handle the header line. Shell commands do not support selecting columns with column names either.

csvtk is convenient for rapid data investigation and also easy to be integrated into analysis pipelines. It could save you much time of writing Python/R scripts.

With the latest updates (RobustaVersion), you can use javascript to hack/workaround csv files!

Robusta Notes & Changes

  • Notes:

    • This repo is a fork of the main repo with some our fixes.
  • Updates

    • 11-Oct-2018
      • Forced the parser to ignore the malformed csv line(s)
      • Added a new sub-command replace2, so we can replace using custom javascript function on the fly ;)

Table of Contents

Features

  • Cross-platform (Linux/Windows/Mac OS X/OpenBSD/FreeBSD)
  • Light weight and out-of-the-box, no dependencies, no compilation, no configuration
  • Fast, multiple-CPUs supported
  • Practical functions supported by N subcommands
  • Support STDIN and gziped input/output file, easy being used in pipe
  • Most of the subcommands support unselecting fields and fuzzy fields, e.g. -f "-id,-name" for all fields except "id" and "name", -F -f "a.*" for all fields with prefix "a.".
  • Support some common plots (see usage)
  • Seamlessly support for data with meta line (e.g., sep=,) of separator declaration used by MS Excel

Subcommands

32 subcommands in total.

Information

  • headers print headers
  • stats summary of CSV file
  • stats2 summary of selected digital fields

Format conversion

  • pretty convert CSV to readable aligned table
  • csv2tab convert CSV to tabular format
  • tab2csv convert tabular format to CSV
  • space2tab convert space delimited format to CSV
  • transpose transpose CSV data
  • csv2md convert CSV to markdown format
  • xlsx2csv convert XLSX to CSV format

Set operations

  • head print first N records
  • concat concatenate CSV/TSV files by rows
  • sample sampling by proportion
  • cut select parts of fields
  • grep grep data by selected fields with patterns/regular expressions
  • uniq unique data without sorting
  • freq frequencies of selected fields
  • inter intersection of multiple files
  • filter filter rows by values of selected fields with arithmetic expression
  • filter2 filter rows by awk-like arithmetic/string expressions
  • join join multiple CSV files by selected fields
  • split split CSV/TSV into multiple files according to column values
  • splitxlsx split XLSX sheet into multiple sheets according to column values
  • collapse collapse one field with selected fields as keys

Edit

  • rename rename column names
  • rename2 rename column names by regular expression
  • replace replace data of selected fields by regular expression
  • replace2 replace data of selected fields using javascript function.
  • mutate create new columns from selected fields by regular expression
  • mutate2 create new column from selected fields by awk-like arithmetic/string expressions
  • gather gather columns into key-value pairs

Ordering

  • sort sort by selected fields

Ploting

  • plot see usage
    • plot hist histogram
    • plot box boxplot
    • plot line line plot and scatter plot

Misc

  • version print version information and check for update
  • genautocomplete generate shell autocompletion script

Installation

Download Page

csvtk is implemented in Go programming language, executable binary files for most popular operating systems are freely available in release page.

Method 1: Download binaries

Just download compressed executable file of your operating system, and decompress it with tar -zxvf *.tar.gz command or other tools. And then:

  1. For Linux-like systems

    1. If you have root privilege simply copy it to /usr/local/bin:

       sudo cp csvtk /usr/local/bin/
      
    2. Or add the current directory of the executable file to environment variable PATH:

       echo export PATH=\$PATH:\"$(pwd)\" >> ~/.bashrc
       source ~/.bashrc
      
  2. For windows, just copy csvtk.exe to C:\WINDOWS\system32.

Method 2: Install via conda Install-with-conda Anaconda Cloud downloads

conda install -c bioconda csvtk

Method 3: For Go developer

go get -u github.com/shenwei356/csvtk/csvtk

Method 4: For ArchLinux AUR users

yaourt -S csvtk

Bash-completion

Note: The current version supports Bash only. This should work for *nix systems with Bash installed.

Howto:

  1. run: csvtk genautocomplete

  2. create and edit ~/.bash_completion file if you don't have it.

     nano ~/.bash_completion
    

    add the following:

     for bcfile in ~/.bash_completion.d/* ; do
       . $bcfile
     done
    

Compared to csvkit

csvkit

Features csvtk csvkit Note
Read Gzip Yes Yes read gzip files
Fields ranges Yes Yes e.g. -f 1-4,6
Unselect fileds Yes -- e.g. -1 for excluding first column
Fuzzy fields Yes -- e.g. ab* for columns with name prefix "ab"
Reorder fields Yes Yes it means -f 1,2 is different from -f 2,1
Rename columns Yes -- rename with new name(s) or from existed names
Sort by multiple keys Yes Yes bash sort like operations
Sort by number Yes -- e.g. -k 1:n
Multiple sort Yes -- e.g. -k 2:r -k 1:nr
Pretty output Yes Yes convert CSV to readable aligned table
Unique data Yes -- unique data of selected fields
frequency Yes -- frequencies of selected fields
Sampling Yes -- sampling by proportion
Mutate fields Yes -- create new columns from selected fields
Repalce Yes -- replace data of selected fields

Similar tools:

  • csvkit - A suite of utilities for converting to and working with CSV, the king of tabular file formats. http://csvkit.rtfd.org/
  • xsv - A fast CSV toolkit written in Rust.
  • miller - Miller is like sed, awk, cut, join, and sort for name-indexed data such as CSV and tabular JSON http://johnkerl.org/miller
  • tsv-utils-dlang - Command line utilities for tab-separated value files written in the D programming language.

Examples

More examples and tutorial.

Attention

  1. The CSV parser requires all the lines have same number of fields/columns. Even lines with spaces will cause error.
  2. By default, csvtk thinks your files have header row, if not, switch flag -H on.
  3. Column names better be unique.
  4. By default, lines starting with # will be ignored, if the header row starts with #, please assign flag -C another rare symbol, e.g. '$'.
  5. By default, csvtk handles CSV files, use flag -t for tab-delimited files.
  6. If " exists in tab-delimited files, use flag -l.

Examples

  1. Pretty result

     $ csvtk pretty names.csv
     id   first_name   last_name   username
     11   Rob          Pike        rob
     2    Ken          Thompson    ken
     4    Robert       Griesemer   gri
     1    Robert       Thompson    abc
     NA   Robert       Abel        123
    
  2. Summary of selected digital fields: num, sum, min, max, mean, stdev (stat2)

     $ cat digitals.tsv
     4       5       6
     1       2       3
     7       8       0
     8       1,000   4
    
     $ cat digitals.tsv | csvtk stat2 -t -H -f 1-3
     field   num     sum   min     max     mean    stdev
     1         4      20     1       8        5     3.16
     2         4   1,015     2   1,000   253.75   497.51
     3         4      13     0       6     3.25      2.5
    
  3. Select fields/columns (cut)

    • By index: csvtk cut -f 1,2
    • By names: csvtk cut -f first_name,username
    • Unselect: csvtk cut -f -1,-2 or csvtk cut -f -first_name
    • Fuzzy fields: csvtk cut -F -f "*_name,username"
    • Field ranges: csvtk cut -f 2-4 for column 2,3,4 or csvtk cut -f -3--1 for discarding column 1,2,3
    • All fields: csvtk cut -F -f "*"
  4. Search by selected fields (grep) (matched parts will be highlighted as red)

    • By exactly matching: csvtk grep -f first_name -p Robert -p Rob
    • By regular expression: csvtk grep -f first_name -r -p Rob
    • By pattern list: csvtk grep -f first_name -P name_list.txt
    • Remore rows containing missing data (NA): csvtk grep -F -f "*" -r -p "^$" -v
  5. Rename column names (rename and rename2)

    • Setting new names: csvtk rename -f A,B -n a,b or csvtk rename -f 1-3 -n a,b,c
    • Replacing with original names by regular express: cat ../testdata/c.csv | ./csvtk rename2 -F -f "*" -p "(.*)" -r 'prefix_$1' for adding prefix to all column names.
  6. Edit data with regular expression (replace)

    • Remove Chinese charactors: csvtk replace -F -f "*_name" -p "\p{Han}+" -r ""
  7. Create new column from selected fields by regular expression (mutate)

    • In default, copy a column: csvtk mutate -f id
    • Extract prefix of data as group name (get "A" from "A.1" as group name): csvtk mutate -f sample -n group -p "^(.+?)\."
  8. Sort by multiple keys (sort)

    • By single column : csvtk sort -k 1 or csvtk sort -k last_name
    • By multiple columns: csvtk sort -k 1,2 or csvtk sort -k 1 -k 2 or csvtk sort -k last_name,age
    • Sort by number: csvtk sort -k 1:n or csvtk sort -k 1:nr for reverse number
    • Complex sort: csvtk sort -k region -k age:n -k id:nr
  9. Join multiple files by keys (join)

    • All files have same key column: csvtk join -f id file1.csv file2.csv
    • Files have different key columns: csvtk join -f "username;username;name" names.csv phone.csv adress.csv -k
  10. Filter by numbers (filter)

    • Single field: csvtk filter -f "id>0"
    • Multiple fields: csvtk filter -f "1-3>0"
    • Using --any to print record if any of the field satisfy the condition: csvtk filter -f "1-3>0" --any
    • fuzzy fields: csvtk filter -F -f "A*!=0"
  11. Filter rows by awk-like arithmetic/string expressions (filter2)

    • Using field index: csvtk filter2 -f '$3>0'
    • Using column names: csvtk filter2 -f '$id > 0'
    • Both arithmetic and string expressions: csvtk filter2 -f '$id > 3 || $username=="ken"'
    • More complicated: csvtk filter2 -H -t -f '$1 > 2 && $2 % 2 == 0'
  12. Ploting

    • plot histogram with data of the second column: csvtk -t plot hist testdata/grouped_data.tsv.gz -f 2 | display histogram.png
    • plot boxplot with data of the "GC Content" (third) column, group information is the "Group" column. csvtk -t plot box testdata/grouped_data.tsv.gz -g "Group" -f "GC Content" --width 3 | display boxplot.png
    • plot horiz boxplot with data of the "Length" (second) column, group information is the "Group" column. csvtk -t plot box testdata/grouped_data.tsv.gz -g "Group" -f "Length" --height 3 --width 5 --horiz --title "Horiz box plot" | display boxplot2.png
    • plot line plot with X-Y data csvtk -t plot line testdata/xy.tsv -x X -y Y -g Group | display lineplot.png
    • plot scatter plot with X-Y data csvtk -t plot line testdata/xy.tsv -x X -y Y -g Group --scatter | display scatter.png

Acknowledgements

We are grateful to Zhiluo Deng and Li Peng for suggesting features and reporting bugs.

Contact

create an issue to report bugs, propose new functions or ask for help.

Or leave a comment.

License

MIT License

About

A cross-platform, efficient and practical CSV/TSV toolkit in Golang (a fork with our "robusta" customizations)

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Go 98.2%
  • Shell 1.8%