Skip to content

Commit

Permalink
Rewrote uniq, frequency and select commands. Updated stats command to…
Browse files Browse the repository at this point in the history
… support CSV and JSON lines. Added conversion of CSV and JSON lines to Parquet
  • Loading branch information
Ivan Begtin committed Jan 29, 2022
1 parent dcdd9bf commit a129cd3
Show file tree
Hide file tree
Showing 16 changed files with 12,304 additions and 266 deletions.
2 changes: 1 addition & 1 deletion .idea/datum.iml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 0 additions & 1 deletion .idea/misc.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 5 additions & 1 deletion HISTORY.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,15 @@
History
=======

1.0.10 (2022-01-29)
-------------------
* Added encoding and delimiter detection for commands: uniq, select, frequency and headers. Completely rewrote these functions. If options for encoding and delimiter set, they override detected. If not set, detected delimiter and encoding used.
* Added support of .parquet files to convert to. It's done in a simpliest way using pandas "to_parquet" function.

1.0.9 (2022-01-18)
------------------
* Added support for CSV and BSON files for "stats" command


1.0.8 (2021-07-14)
------------------
* Replaced json with orjson for some operations. Keep looking on performance changes and going to replace or json lib calls to orjson
Expand Down
19 changes: 15 additions & 4 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Main features

* Common data operations against CSV, JSON lines and BSON files
* Built-in data filtering
* Conversion between CSV, JSONl, BSON, XML, XLS, XLSX file types
* Conversion between CSV, JSONl, BSON, XML, XLS, XLSX, Parquet file types
* Low memory footprint
* Support for compressed datasets
* Advanced statistics calculations
Expand Down Expand Up @@ -176,7 +176,8 @@ Commands

Frequency command
-----------------
Field value frequency calculator. Returns frequency table for certain field
Field value frequency calculator. Returns frequency table for certain field.
This command autodetects delimiter and encoding of CSV files and encoding of JSON lines files by default. You may override it providng "-d" delimiter and "-e" encoding parameters

Get frequencies of values for field *GovSystem* in the list of Russian federal government domains from `govdomains repository <https://github.com/infoculture/govdomains/tree/master/refined>`_

Expand All @@ -192,6 +193,7 @@ Uniq command

Returns all unique files of certain field(s). Accepts parameter *fields* with comma separated fields to gets it unique values.
Provide single field name to get unique values of this field or provide list of fields to get combined unique values.
This command autodetects delimiter and encoding of CSV files and encoding of JSON lines files by default. You may override it providng "-d" delimiter and "-e" encoding parameters


Returns all unique values of field *regions* in selected JSONl file
Expand All @@ -210,7 +212,7 @@ Returns all unique combinations of fields *status* and *regions* in selected JSO
Convert command
---------------

Converts data from one format to another.
Converts data from one format to another. Supports most common data files
Supports conversions:

* XML to JSON lines
Expand All @@ -221,6 +223,8 @@ Supports conversions:
* CSV to BSON
* XLS to BSON
* JSON lines to CSV
* CSV to Parquet
* JSON lines to Parquet

Conversion between XML and JSON lines require flag *tagname* with name of tag which should be converted into single JSON record.

Expand All @@ -236,6 +240,12 @@ Converts JSON lines file roszdravvendors_final.jsonl to CSV file roszdravvendors
$ undatum convert examples/roszdravvendors_final.jsonl examples/roszdravvendors_final.csv
Converts CSV file feddomains.csv to Parquet file feddomains.parquet

.. code-block:: bash
$ undatum convert examples/feddomains.csv examples/feddomains.parquet
Validate command
----------------
Expand All @@ -260,6 +270,7 @@ Headers command
---------------
Returns fieldnames of the file. Supports CSV, JSON, BSON file types.
For CSV file it takes first line of the file and for JSON lines and BSON files it processes number of records provided as *limit* parameter with default value 10000.
This command autodetects delimiter and encoding of CSV files and encoding of JSON lines files by default. You may override it providng "-d" delimiter and "-e" encoding parameters

Returns headers of JSON lines file with top 10 000 records (default value)

Expand Down Expand Up @@ -403,4 +414,4 @@ Data types
JSONl
-----
JSON lines is a replacement to CSV and JSON files, with JSON flexibility and ability to process data line by line, without loading everithing into memory.
JSON lines is a replacement to CSV and JSON files, with JSON flexibility and ability to process data line by line, without loading everything into memory.
3,424 changes: 3,424 additions & 0 deletions examples/budgetgovru-fbpgu.jsonl

Large diffs are not rendered by default.

1,000 changes: 1,000 additions & 0 deletions examples/trudvac_final_s.jsonl

Large diffs are not rendered by default.

Loading

0 comments on commit a129cd3

Please sign in to comment.