A simple command-line tool to see the difference between two CSV files.
This tool reports in the following style, and you can choose how to report.
- Report the number of differences and line numbers
- Report diff marks along with the contents of each CSV line
- You can choose the following report styles
- Horizontal (Side-by-side) display style
- Vertical display style
- You can choose to report only the lines with differences or all lines
- You can choose the following report styles
๐ด DEMO
- Why csv-diff?
- Feature
- Requirements
- Installation
- Run
- How to use
- Notices
- Known Issues
- Contributing
- License
The diff
command that compares files is unaware of key columns (like primary keys in a database).
Therefore, it may give undesired results in detecting differences in CSV files that have key columns.
For example, consider comparing the contents of tables in two databases that are inaccessible to each other.
One way is to output each database's data as a CSV file and compare it.
In this case, the diff
command does not pay attention to the key columns, so lines with different keys may be compared.
It is not possible to make an accurate judgment of the difference with the key in mind.
This tool, on the other hand, recognizes key columns and detects differences. Specify the key columns as an argument at the time of execution. You can get the comparison result you want.
- CSV delimiter, line feed character, presence/absence of header, etc. are automatically determined (can be specified)
- Make a comparison after matching with the key columns
- You can specify columns that are not compared
- Differences can be displayed side-by-side (more suitable when the number of columns is small)
- Differences can be displayed in vertical order (more suitable when the number of columns is large)
- Differences are indicated by the following marks, which we call DIFF-MARK
!
: There is a difference<
: Exists only on the left side>
: Exists only on the right side
- It is also possible to display only the number of differences and the line number with the difference
- It is possible to compare one file with commas and one file with tabs
- Low memory consumption
- Only Python standard modules are used and provided as a single file, so it is easy to install even on an isolated environment
-
Python2.7.18 or later
If you want to use it with Python3, please use csv-diff-python3.
- Must be sorted by key columns
pip install git+https://github.com/blue-monk/csv-diff-python2
It may be safer to install it on a virtual environment created with virtualenv.
Place csvdiff.py
in any directory on the machine where Python 2.7 is installed.
It will be easier to use if you place it in a directory defined on PATH.
$ csvdiff2 -h
$ python csvdiff.py -h
or
$ chmod +x csvdiff.py
$ ./csvdiff.py -h
See the Wiki for more details.
$ ./csvdiff.py -h
Here is one example with the following sample data in appendix/csv_samples/
.
See the Wiki for more details.
Suppose the keys are the 0th column and the 2nd column.
-
sample_lhs.csv
head1, head2, head3, head4, head5 key1-2, value1-2, key2-2, value2-2, 20201224T035908 key1-3, value1-3, key2-3, value2-3, 20201224T180527 key1-4, value1-4, key2-4, value2-4, 20201225T104851 key1-5, value1-5, key2-5, value2-5, 20201225T142142
-
sample_rhs.csv
head1, head2, head3, head4, head5 key1-1, value1-1, key2-1, value2-1, 20210108T142358 key1-2, value1-3, key2-2, value2-z, 20210108T174216 key1-4, value1-4, key2-4, value2-4, 20210109T090245 key1-5, value1-v, key2-5, value2-5, 20210109T111231
To view the contents of different lines, Use the -d
(--show-difference-only
) option.
If you also want to see the number of differences, put the -c
option (--show-count
).
$ ../../src/csvdiff2/csvdiff.py sample_lhs.csv sample_rhs.csv -k 0,2 -dc
============ Report ============
* Differences
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
sample_lhs.csv sample_rhs.csv Column indices with difference
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 2 ['key1-1', 'value1-1', 'key2-1', 'value2-1', '20210108T142358']
2 ['key1-2', 'value1-2', 'key2-2', 'value2-2', '20201224T035908'] ! 3 ['key1-2', 'value1-3', 'key2-2', 'value2-z', '20210108T174216'] @ [1, 3, 4]
3 ['key1-3', 'value1-3', 'key2-3', 'value2-3', '20201224T180527'] <
4 ['key1-4', 'value1-4', 'key2-4', 'value2-4', '20201225T104851'] ! 4 ['key1-4', 'value1-4', 'key2-4', 'value2-4', '20210109T090245'] @ [4]
5 ['key1-5', 'value1-5', 'key2-5', 'value2-5', '20201225T142142'] ! 5 ['key1-5', 'value1-v', 'key2-5', 'value2-5', '20210109T111231'] @ [1, 4]
* Count & Row number
same lines : 0
left side only (<): 1 :-- Row Numbers -->: [3]
right side only (>): 1 :-- Row Numbers -->: [2]
with differences (!): 3 :-- Row Number Pairs -->: [(2, 3), (4, 4), (5, 5)]
-
Differences are indicated by the following DIFF-MARKs
!
: There is a difference<
: Exists only on the left side>
: Exists only on the right side
-
The number displayed before each CSV line data is the line number of the actual file
- line number is 1 based
-
For rows with differences, the column indices with differences will be displayed after
@
- column index is 0 based
-
For large amounts of data
In the case of a horizontal report,
it takes longer than a vertical report because all lines are scanned in advance to collect information for report formatting.
For large amounts of data, consider vertical reports.
-
Problems with reporting Japanese characters (and similar multibyte characters, maybe)
For example,
in the case of a UTF-8 CSV file containing Japanese,
the Japanese part will be displayed in UTF-8 byte string representation in the report.
For now, I'm not sure how to handle Japanese in Python2.
However, it seems that the difference judgment is performed without any problem. -
Workaround for only one line
If the CSV file contains only one line, it will be recognized as a header.
You need to specify the option-H n
to be recognized as CSV without a header.
Python 2.7 reached the end of its life on January 1st, 2020.
We don't intend to do any maintenance on the python2 version.
Consider using csv-diff-python3.
However, if it is a minor bug, it may be addressed, so please raise it in the issue.
csv-diff-python2 is released under the MIT license. Please read the LICENSE file for more information.