Utility to inspect Parquet files.
parquet-tools
support following methods to install:
- Download pre-built binaries
- brew install on Mac
- Container image
- Install from source
- Prebuilt packages
Once it is installed you can refer to usage page for details of how to use the tool.
This project is inspired by:
- parquet-go/parquet-tools: https://github.com/xitongsys/parquet-go/tree/master/tool/parquet-tools/
- Python parquet-tools: https://pypi.org/project/parquet-tools/
- Java parquet-tools: https://mvnrepository.com/artifact/org.apache.parquet/parquet-tools
- Makefile: https://github.com/cisco-sso/kdk/blob/master/Makefile
Some test cases are from:
- https://registry.opendata.aws/binding-db/
- https://github.com/xitongsys/parquet-go/tree/master/example/
- https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet
- https://azure.microsoft.com/en-us/services/open-datasets/catalog/
- https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
Tools used:
- https://golang.org/
- https://github.com/golangci/golangci-lint
- https://github.com/jstemmer/go-junit-report
- https://circleci.com/
TODO list is tracked as enhancement in issues.
- parquet-tools
- Installation and Usage of parquet-tools
You can choose one of the installation methods from below, the functionality will be mostly the same.
Good for people who are familiar with Go, you need 1.21 or newer version.
$ go install github.com/hangxie/parquet-tools
it will install latest stable version of parquet-tools
to $GOPATH/bin, if you do not set GOPATH
environment variable explicitly, then its default value can be obtained by running go env GOPATH
, usually it is go/
directory under your home directory.
parquet-tools
installed from source will not report proper version and build time, so if you run parquet-tools version
, it will just give you an empty line, all other functions are not affected.
Install specific version by go install
may not work, as go.mod contains replace
from time to time to address issues that are not taken by upstream.
Good for people do not want to build and all other installation approach do not work.
Go to release page, pick the release and platform you want to run, download the corresponding gz/zip file, extract it to your local disk, make sure the execution bit is set if you are running on Linux or Mac, then run the program.
For Windows 10 on ARM (like Surface Pro X), use either windows-arm64 or windows-386 build, if you are in Windows Insider program, windows-amd64 build should work too.
Mac user can use Homebrew to install, it is not part of core formula yet but you can run:
$ brew uninstall parquet-tools
$ brew tap hangxie/tap
$ brew install go-parquet-tools
parquet-tools
installed by brew is a similar tool built by Java, however, it is deprecated, since both packages install same parquet-tools
utility so you need to remove one before installing the other one.
Whenever you want to upgrade to latest version which you should:
$ brew upgrade go-parquet-tools
Starting from v1.22.1 go-parquet-tools will be installed from bottles by default, which can dramatically reduce time to install in environments with slow connection. The binary build comes from release of this repository. If you do not feel comfortable with prebuilt and still want to install from source, you can use --build-from-source
flag:
$ brew install --build-from-source go-parquet-tools
same flag is needed for reinstall
, also brew upgrade
may bring bottles so you want to run uninstall
followed by install
.
Container image supports amd64, arm64, and arm/v7, it is hosted in two registries:
You can pull the image from either location:
$ docker run --rm hangxie/parquet-tools version
v1.22.2
$ podman run --rm ghcr.io/hangxie/parquet-tools version
v1.22.2
RPM and deb package can be found on release page, only amd64/x86_64 and arm64/aarch64 arch are available at this moment, download the proper package and run corresponding installation command:
- On Debian/Ubuntu:
$ sudo dpkg -i parquet-tools_1.22.2_amd64.deb
Preparing to unpack parquet-tools_1.22.2_amd64.deb ...
Unpacking parquet-tools (1.22.2) ...
Setting up parquet-tools (1.22.2) ...
- On CentOS/Fedora:
$ sudo rpm -Uhv parquet-tools-1.22.2-1.x86_64.rpm
Verifying... ################################# [100%]
Preparing... ################################# [100%]
Updating / installing...
1:parquet-tools-1.22.2-1 ################################# [100%]
parquet-tools
provides help information through -h
flag, whenever you are not sure about parameter for a command, just add -h
to the end of the line then it will give you all available options, for example:
$ parquet-tools meta -h
Usage: parquet-tools meta <uri>
Prints the metadata.
Arguments:
<uri> URI of Parquet file.
Flags:
-h, --help Show context-sensitive help.
--http-multiple-connection (HTTP URI only) use multiple HTTP connection.
--http-ignore-tls-error (HTTP URI only) ignore TLS error.
--http-extra-headers=KEY=VALUE,... (HTTP URI only) extra HTTP headers.
--object-version=STRING (S3 URI only) object version.
--anonymous (S3 and Azure only) object is publicly accessible.
-b, --base64 Encode min/max value.
Most commands can output JSON format result which can be processed by utilities like jq or JSON parser online.
parquet-tools
can read and write parquet files from these locations:
- file system
- AWS Simple Storage Service (S3) bucket
- Google Cloud Storage (GCS) bucket
- Azure Storage Container
parquet-tools
can read parquet files from these locations:
- HTTP/HTTPS URL
you need to have proper permission on the file you are going to process.
For files from file system, you can specify file://
scheme or just ignore it:
$ parquet-tools row-count testdata/good.parquet
4
$ parquet-tools row-count file://testdata/good.parquet
4
$ parquet-tools row-count file://./testdata/good.parquet
4
Use full S3 URL to indicate S3 object location, it starts with s3://
. You need to make sure you have permission to read or write the S3 object, the easiest way to verify that is using AWS cli:
$ aws sts get-caller-identity
{
"UserId": "REDACTED",
"Account": "123456789012",
"Arn": "arn:aws:iam::123456789012:user/redacted"
}
aws s3 ls s3://daylight-openstreetmap/parquet/osm_features/release=v1.46/type=way/20240506_151445_00143_nanmw_fb5fe2f1-fec8-494f-8c2e-0feb15cedff0
2024-05-06 08:33:48 362267322 20240506_151445_00143_nanmw_fb5fe2f1-fec8-494f-8c2e-0feb15cedff0
$ parquet-tools row-count s3://daylight-openstreetmap/parquet/osm_features/release=v1.46/type=way/20240506_151445_00143_nanmw_fb5fe2f1-fec8-494f-8c2e-0feb15cedff0
2405462
If an S3 object is publicly accessible and you do not have AWS credential, you can use --anonymous
flag to bypass AWS authentation:
$ aws sts get-caller-identity
Unable to locate credentials. You can configure credentials by running "aws configure".
$ aws s3 --no-sign-request ls s3://daylight-openstreetmap/parquet/osm_features/release=v1.46/type=way/20240506_151445_00143_nanmw_fb5fe2f1-fec8-494f-8c2e-0feb15cedff0
2024-05-06 08:33:48 362267322 20240506_151445_00143_nanmw_fb5fe2f1-fec8-494f-8c2e-0feb15cedff0
$ parquet-tools row-count --anonymous s3://daylight-openstreetmap/parquet/osm_features/release=v1.46/type=way/20240506_151445_00143_nanmw_fb5fe2f1-fec8-494f-8c2e-0feb15cedff0
2405462
Optionally, you can specify object version by using --object-version
when you perform read operation (like cat, row-count, schema, etc.) from S3, parquet-tools
will access current version if this parameter is omitted, if version for the S3 object does not exist or bucket does not have version enabled, parquet-tools
will report error:
$ parquet-tools row-count s3://daylight-openstreetmap/parquet/osm_features/release=v1.46/type=way/20240506_151445_00143_nanmw_fb5fe2f1-fec8-494f-8c2e-0feb15cedff0 --object-version non-existent-version
parquet-tools: error: failed to open S3 object [s3://daylight-openstreetmap/parquet/osm_features/release=v1.46/type=way/20240506_151445_00143_nanmw_fb5fe2f1-fec8-494f-8c2e-0feb15cedff0] version [non-existent-version]: operation error S3: HeadObject, https response error StatusCode: 400, RequestID: 75GZZ1W5M4KMAK1H, HostID: hgDGBOolDqLgH+CHRuZU+dXZXv4CB+mmSpjEfGxF5fLnKhNkJCWEAZBSS0kbT/k2gFotuoWNLX+zaWNWzHR49w==, api error BadRequest: Bad Request
According to HeadObject and GetObject, status code for non-existent object or version will be 403 instead of 404 if the caller does not have permission to
ListBucket
, or return 400 if bucket does not have version enabled.
Thanks to parquet-go-source, parquet-tools
loads only necessary data from S3 bucket, for most cases it is footer only, so it is much more faster than downloading the file from S3 bucket and run parquet-tools
on a local file. Size of the S3 object used in above sample is more than 4GB, but the row-count
command takes just several seconds to finish.
Use full gsutil URI to point to GCS object location, it starts with gs://
. You need to make sure you have permission to read or write to the GSC object, either use application default or GOOGLE_APPLICATION_CREDENTIALS, you can refer to Google Cloud document for more details.
$ export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service/account/key.json
$ parquet-tools import -s testdata/csv.source -m testdata/csv.schema gs://REDACTED/csv.parquet
$ parquet-tools row-count gs://REDACTED/csv.parquet
7
Similar to S3, parquet-tools
downloads only necessary data from GCS bucket.
parquet-tools
uses the HDFS URL format:
- starts with
wasbs://
(wasb://
is not supported), followed by - container as user name, followed by
- storage account as host, followed by
- blob name as path
for example:
wasbs://laborstatisticscontainer@azureopendatastorage.blob.core.windows.net/lfs/part-00000-tid-6312913918496818658-3a88e4f5-ebeb-4691-bfb6-e7bd5d4f2dd0-63558-c000.snappy.parquet
means the parquet file is at:
- storage account
azureopendatastorage
- container
laborstatisticscontainer
- blob
lfs/part-00000-tid-6312913918496818658-3a88e4f5-ebeb-4691-bfb6-e7bd5d4f2dd0-63558-c000.snappy.parquet
parquet-tools
uses AZURE_STORAGE_ACCESS_KEY
environment varialbe to identity access:
$ AZURE_STORAGE_ACCESS_KEY=REDACTED parquet-tools import -s testdata/csv.source -m testdata/csv.schema wasbs://REDACTED@REDACTED.blob.core.windows.net/test/csv.parquet
$ AZURE_STORAGE_ACCESS_KEY=REDACTED parquet-tools row-count wasbs://REDACTED@REDACTED.blob.core.windows.net/test/csv.parquet
7
If the blob is publicly accessible, either unset AZURE_STORAGE_ACCESS_KEY
or use --anonymous
option to indicate that anonymous access is expected:
$ AZURE_STORAGE_ACCESS_KEY= parquet-tools row-count wasbs://laborstatisticscontainer@azureopendatastorage.blob.core.windows.net/lfs/part-00000-tid-6312913918496818658-3a88e4f5-ebeb-4691-bfb6-e7bd5d4f2dd0-63558-c000.snappy.parquet
6582726
$ parquet-tools row-count --anonymous wasbs://laborstatisticscontainer@azureopendatastorage.blob.core.windows.net/lfs/part-00000-tid-6312913918496818658-3a88e4f5-ebeb-4691-bfb6-e7bd5d4f2dd0-63558-c000.snappy.parquet
6582726
Similar to S3 and GCS, parquet-tools
downloads only necessary data from blob.
parquet-tools
supports URI with http
or https
scheme, the remote server needs to support Range header, particularly with unit of bytes
.
HTTP endpoint does not support write operation so it cannot be used as destination of import
command.
These options can be used along with HTTP endpoints:
--http-multiple-connection
will enable dedicated transport for concurrent requests,parquet-tools
will establish multiple TCP connections to remote server. This may or may not have performance impact depends on how remote server handles concurrent connections, it is recommended to leave it to defaultfalse
value for all commands exceptcat
, and test performance carefully withcat
command.--http-extra-headers
in the format ofkey1=value1,key2=value2,...
, they will be used as extra HTTP headers, a use case is to use them for authentication/authorization that is required by remote server.--http-ignore-tls-error
will ignore TLS errors.
$ parquet-tools row-count https://azureopendatastorage.blob.core.windows.net/laborstatisticscontainer/lfs/part-00000-tid-6312913918496818658-3a88e4f5-ebeb-4691-bfb6-e7bd5d4f2dd0-63558-c000.snappy.parquet
6582726
$ parquet-tools size https://dpla-provider-export.s3.amazonaws.com/2021/04/all.parquet/part-00000-471427c6-8097-428d-9703-a751a6572cca-c000.snappy.parquet
4632041101
Similar to S3 and other remote endpoints, parquet-tools
downloads only necessary data from remote server through Range header.
parquet-tools
will use HTTP/2 if remote server supports this, however you can disable this if things are not working well by setting environment variable GODEBUG
to http2client=0
:
$ parquet-tools row-count https://huggingface.co/datasets/laion/laion2B-en/resolve/main/part-00047-5114fd87-297e-42b0-9d11-50f1df323dfa-c000.snappy.parquet
2022/09/05 09:54:52 protocol error: received DATA after END_STREAM
2022/09/05 09:54:52 protocol error: received DATA after END_STREAM
2022/09/05 09:54:53 protocol error: received DATA after END_STREAM
2022/09/05 09:54:53 protocol error: received DATA after END_STREAM
2022/09/05 09:54:53 protocol error: received DATA after END_STREAM
2022/09/05 09:54:53 protocol error: received DATA after END_STREAM
2022/09/05 09:54:53 protocol error: received DATA after END_STREAM
2022/09/05 09:54:53 protocol error: received DATA after END_STREAM
2022/09/05 09:54:53 protocol error: received DATA after END_STREAM
18141856
$ GODEBUG=http2client=0 parquet-tools row-count https://huggingface.co/datasets/laion/laion2B-en/resolve/main/part-00047-5114fd87-297e-42b0-9d11-50f1df323dfa-c000.sn
appy.parquet
18141856
parquet-tools
can read and write files under HDFS with schema hdfs://username@hostname:port/path/to/file
, if username
is not provided then current OS user will be used.
$ parquet-tools import -f jsonl -m testdata/jsonl.schema -s testdata/jsonl.source hdfs://localhost:9000/temp/good.parquet
parquet-tools: error: failed to create JSON writer: failed to open HDFS source [hdfs://localhost:9000/temp/good.parquet]: create /temp/good.parquet: permission denied
$ parquet-tools import -f jsonl -m testdata/jsonl.schema -s testdata/jsonl.source hdfs://root@localhost:9000/temp/good.parquet
$ parquet-tools row-count hdfs://localhost:9000/temp/good.parquet
7
Similar to cloud storage, parquet-tools
downloads only necessary data from HDFS.
cat
command outputs data in parquet file, it supports JSON, JSONL, CSV, and TSV format. Due to most parquet files are rather large, you should use row-count
command to have a rough idea how many rows are there in the parquet file, then use --skip
, --limit
and --sample-ratio
flags to reduces the output to a certain level, these flags can be used together.
There are two parameters that you probably will never touch:
--read-page-size
tells how many rowsparquet-tools
needs to read from the parquet file every time, you can play with it if you hit performance or resource problem.--skip-page-size
tells how many rowsparquet-tools
need to skip at a time if--skip
is specified, you can play with it if you hit memory issue, read xitongsys/parquet-go#545 for more details.
$ parquet-tools cat --format jsonl testdata/good.parquet
{"Shoe_brand":"nike","Shoe_name":"air_griffey"}
{"Shoe_brand":"fila","Shoe_name":"grant_hill_2"}
{"Shoe_brand":"steph_curry","Shoe_name":"curry7"}
You can set --fail-on-int96
option to fail cat
command for parquet files contain fields with INT96 type, which is deprecated, default value for this option is false
so you can still read INT96 type, but this behavior may change in the future.
$ parquet-tools cat --fail-on-int96 testdata/all-types.parquet
parquet-tools: error: field Int96 has type INT96 which is not supported
exit status 1
$ parquet-tools cat testdata/all-types.parquet
[{"Bool":true,"ByteArray":"ByteArray-0","Date":1640995200,...
--skip
is similar to OFFSET in SQL, parquet-tools
will skip this many rows from beginning of the parquet file before applying other logics.
$ parquet-tools cat --skip 2 --format jsonl testdata/good.parquet
{"Shoe_brand":"steph_curry","Shoe_name":"curry7"}
parquet-tools
will not report error if --skip
is greater than total number of rows in parquet file.
$ parquet-tools cat --skip 20 testdata/good.parquet
[]
There is no standard for CSV and TSV format, parquet-tools
utilizes Go's encoding/csv
module to maximize compatibility, however, there is no guarantee that output can be interpretted by other utilities, especially if they are from other programming laguages.
$ parquet-tools cat -f csv testdata/good.parquet
Shoe_brand,Shoe_name
nike,air_griffey
fila,grant_hill_2
steph_curry,curry7
nil
values will be presented as empty string:
$ parquet-tools cat -f csv --limit 2 testdata/int96-nil-min-max.parquet
Utf8,Int96
UTF8-0,
UTF8-1,
By default CSV and TSV output contains a header line with field names, you can use --no-header
option to remove it from output.
$ parquet-tools cat -f csv --no-header testdata/good.parquet
nike,air_griffey
fila,grant_hill_2
steph_curry,curry7
CSV and TSV do not support parquet files with complex schema:
$ parquet-tools cat -f csv testdata/all-types.parquet
parquet-tools: error: field [Map] is not scalar type, cannot output in csv format
exit status 1
--limit
is similar to LIMIT in SQL, or head
in Linux shell, parquet-tools
will stop running after this many rows outputs.
$ parquet-tools cat --limit 2 testdata/good.parquet
[{"Shoe_brand":"nike","Shoe_name":"air_griffey"},{"Shoe_brand":"fila","Shoe_name":"grant_hill_2"}]
--sample-ratio
enables sampling, the ration is a number between 0.0 and 1.0 inclusively. 1.0
means output everything in the parquet file, while 0.0
means nothing. If you want to have 1 rows out of every 10 rows, use 0.1
.
This feature picks rows in parquet file randomly, so only 0.0
and 1.0
will output deterministic result, all other ratio may generate data set less or more than you want.
$ parquet-tools cat --sample-ratio 0.34 testdata/good.parquet
[{"Shoe_brand":"nike","Shoe_name":"air_griffey"}]
$ parquet-tools cat --sample-ratio 0.34 testdata/good.parquet
[]
$ parquet-tools cat --sample-ratio 0.34 testdata/good.parquet
[{"Shoe_brand":"steph_curry","Shoe_name":"curry7"}]
$ parquet-tools cat --sample-ratio 0.34 testdata/good.parquet
[{"Shoe_brand":"nike","Shoe_name":"air_griffey"},{"Shoe_brand":"fila","Shoe_name":"grant_hill_2"}]
$ parquet-tools cat --sample-ratio 0.34 testdata/good.parquet
[{"Shoe_brand":"fila","Shoe_name":"grant_hill_2"}]
$ parquet-tools cat --sample-ratio 1.0 testdata/good.parquet
[{"Shoe_brand":"nike","Shoe_name":"air_griffey"},{"Shoe_brand":"fila","Shoe_name":"grant_hill_2"},{"Shoe_brand":"steph_curry","Shoe_name":"curry7"}]
$ parquet-tools cat --sample-ratio 0.0 testdata/good.parquet
[]
--skip
, --limit
and --sample-ratio
can be used together to achieve certain goals, for example, to get the 3rd row from the parquet file:
$ parquet-tools cat --skip 2 --limit 1 testdata/good.parquet
[{"Shoe_brand":"steph_curry","Shoe_name":"curry7"}]
cat
supports two output formats, one is the default JSON format that wraps all JSON objects into an array, this works perfectly with small output and is compatible with most JSON toolchains, however, since almost all JSON libraries load full JSON into memory to parse and process, this will lead to memory pressure if you dump a huge amount of data.
$ parquet-tools cat testdata/good.parquet
[{"Shoe_brand":"nike","Shoe_name":"air_griffey"},{"Shoe_brand":"fila","Shoe_name":"grant_hill_2"},{"Shoe_brand":"steph_curry","Shoe_name":"curry7"}]
cat
also supports line delimited JSON streaming format format by specifying --format jsonl
, allows reader of the output to process in a streaming manner, which will greatly reduce the memory footprint. Note that there is always a newline by end of the output.
When you want to filter data, use JSONL format output and pipe to jq
.
$ parquet-tools cat --format jsonl testdata/good.parquet
{"Shoe_brand":"nike","Shoe_name":"air_griffey"}
{"Shoe_brand":"fila","Shoe_name":"grant_hill_2"}
{"Shoe_brand":"steph_curry","Shoe_name":"curry7"}
You can read data line by line and parse every single line as a JSON object if you do not have a toolchain to process JSONL format.
import
command creates a parquet file based from data in other format. The target file can be on local file system or cloud storage object like S3, you need to have permission to write to target location. Existing file or cloud storage object will be overwritten.
The command takes 3 parameters, --source
tells which file (file system only) to load source data, --format
tells format of the source data file, it can be json
, jsonl
or csv
, --schema
points to the file holds schema. Optionally, you can use --compression
to specify compression codec (UNCOMPRESSED/SNAPPY/GZIP/LZ4/LZ4_RAW/ZSTD), default is "SNAPPY". If CSV file contains a header line, you can use --skip-header
to skip the first line of CSV file.
Each source data file format has its own dedicated schema format:
- CSV: you can refer to sample in this repo.
- JSON: you can refer to sample in this repo.
- JSONL: use same schema as JSON format.
You cannot import INT96 data at this moment, more details can be found at #149.
$ parquet-tools import -f csv -s testdata/csv.source -m testdata/csv.schema /tmp/csv.parquet
$ parquet-tools row-count /tmp/csv.parquet
7
$ parquet-tools import -f json -s testdata/json.source -m testdata/json.schema -z GZIP /tmp/json.parquet
$ parquet-tools row-count /tmp/json.parquet
1
As most JSON processing utilities, the whole JSON file needs to be loaded to memory and is treated as single object, so memory footprint may be significant if you try to load a large JSON file. You should use JSONL format if you deal with large amount of data.
JSONL is line-delimited JSON streaming format, use JSONL if you want to load multiple JSON objects into parquet.
$ parquet-tools import -f jsonl -s testdata/jsonl.source -m testdata/jsonl.schema /tmp/jsonl.parquet
$ parquet-tools row-count /tmp/jsonl.parquet
7
merge
command merge several parquet files with same schema to one parquet file, all source files and target files can be from and to different storage locations.
$ parquet-tools merge --sources testdata/good.parquet,testdata/good.parquet /tmp/doubled.parquet
$ parquet-tools cat -f jsonl testdata/good.parquet
{"Shoe_brand":"nike","Shoe_name":"air_griffey"}
{"Shoe_brand":"fila","Shoe_name":"grant_hill_2"}
{"Shoe_brand":"steph_curry","Shoe_name":"curry7"}
$ parquet-tools cat -f jsonl /tmp/doubled.parquet
{"Shoe_brand":"nike","Shoe_name":"air_griffey"}
{"Shoe_brand":"nike","Shoe_name":"air_griffey"}
{"Shoe_brand":"fila","Shoe_name":"grant_hill_2"}
{"Shoe_brand":"steph_curry","Shoe_name":"curry7"}
{"Shoe_brand":"fila","Shoe_name":"grant_hill_2"}
{"Shoe_brand":"steph_curry","Shoe_name":"curry7"}
You can use --read-page-size
to configure how many rows will be read from source file and write to target file each time, you can also use --compression
to specify compression codec (UNCOMPRESSED/SNAPPY/GZIP/LZ4/LZ4_RAW/ZSTD) for target parquet file, default is "SNAPPY". Other read options like --http-multiple-connection
, --http-ignore-tls-error
, --http-extra-headers
, --object-version
, and --anonymous
can still be used, but since they are applied to all source files, some of them may not make sense, eg --object-version
.
You can set --fail-on-int96
option to fail merge
command for parquet files contain fields with INT96 type, which is deprecated, default value for this option is false
so you can still read INT96 type, but this behavior may change in the future.
meta
command shows meta data of every row group in a parquet file.
Note that MinValue and MaxValue always show value with base type instead of converted type, i.e. INT32 instead of UINT_8. The --base64
flag applies to column with type BYTE_ARRAY
or FIXED_LEN_BYTE_ARRAY
only, it tells parquet-tools
to output base64 encoded MinValue and MaxValue of a column, otherwise those values will be shown as UTF8 string.
$ parquet-tools meta testdata/good.parquet
{"NumRowGroups":1,"RowGroups":[{"NumRows":3,"TotalByteSize":438,"Columns":[{"PathInSchema":["Shoe_brand"],"Type":"BYTE_ARRAY","Encodings":["RLE","BIT_PACKED","PLAIN"],"CompressedSize":269,"UncompressedSize":194,"NumValues":3,"NullCount":0,"MaxValue":"steph_curry","MinValue":"fila","CompressionCodec":"GZIP"},{"PathInSchema":["Shoe_name"],"Type":"BYTE_ARRAY","Encodings":["RLE","BIT_PACKED","PLAIN"],"CompressedSize":319,"UncompressedSize":244,"NumValues":3,"NullCount":0,"MaxValue":"grant_hill_2","MinValue":"air_griffey","CompressionCodec":"GZIP"}]}]}
$ parquet-tools meta --base64 testdata/good.parquet
{"NumRowGroups":1,"RowGroups":[{"NumRows":3,"TotalByteSize":438,"Columns":[{"PathInSchema":["Shoe_brand"],"Type":"BYTE_ARRAY","Encodings":["RLE","BIT_PACKED","PLAIN"],"CompressedSize":269,"UncompressedSize":194,"NumValues":3,"NullCount":0,"MaxValue":"c3RlcGhfY3Vycnk=","MinValue":"ZmlsYQ==","CompressionCodec":"GZIP"},{"PathInSchema":["Shoe_name"],"Type":"BYTE_ARRAY","Encodings":["RLE","BIT_PACKED","PLAIN"],"CompressedSize":319,"UncompressedSize":244,"NumValues":3,"NullCount":0,"MaxValue":"Z3JhbnRfaGlsbF8y","MinValue":"YWlyX2dyaWZmZXk=","CompressionCodec":"GZIP"}]}]}
Note that MinValue, MaxValue and NullCount are optional, if they do not show up in output then it means parquet file does not have that section.
You can set --fail-on-int96
option to fail meta
command for parquet files contain fields with INT96 type, which is deprecated, default value for this option is false
so you can still read INT96 type, but this behavior may change in the future.
row-count
command provides total number of rows in the parquet file:
$ parquet-tools row-count testdata/good.parquet
4
schema
command shows schema of the parquet file in different formats.
JSON format schema can be used directly in parquet-go based golang program like this example:
$ parquet-tools schema testdata/good.parquet
{"Tag":"name=Parquet_go_root","Fields":[{"Tag":"name=Shoe_brand, type=BYTE_ARRAY, convertedtype=UTF8"},{"Tag":"name=Shoe_name, type=BYTE_ARRAY, convertedtype=UTF8"}]}
Default setting will be ignored to make output shorter, eg
- convertedtype=LIST
- convertedtype=MAP
- repetitiontype=REQUIRED
- type=STRUCT
Raw format is the schema directly dumped from parquet file, all other formats are derived from raw format.
$ parquet-tools schema --format raw testdata/good.parquet
{"repetition_type":"REQUIRED","name":"Parquet_go_root","num_children":2,"children":[{"type":"BYTE_ARRAY","type_length":0,"repetition_type":"REQUIRED","name":"Shoe_brand","converted_type":"UTF8","scale":0,"precision":0,"field_id":0,"logicalType":{"STRING":{}}},{"type":"BYTE_ARRAY","type_length":0,"repetition_type":"REQUIRED","name":"Shoe_name","converted_type":"UTF8","scale":0,"precision":0,"field_id":0,"logicalType":{"STRING":{}}}]}
go struct format generate go struct definition snippet that can be used in go:
$ parquet-tools schema --format go testdata/good.parquet | gofmt
type Parquet_go_root struct {
Shoe_brand string `parquet:"name=Shoe_brand, type=BYTE_ARRAY, convertedtype=UTF8"`
Shoe_name string `parquet:"name=Shoe_name, type=BYTE_ARRAY, convertedtype=UTF8"`
}
based on your use case, type Parquet_go_root
may need to be renamed.
parquet-go does not support composite type as map key or value in go struct tag as for now so parquet-tools
will report error if there is such a field, you can still output in raw or JSON format:
$ parquet-tools schema -f go testdata/map-composite-value.parquet
parquet-tools: error: go struct does not support composite type as map value in field [Parquet_go_root.Scores]
exit status 1
$ parquet-tools schema testdata/map-composite-value.parquet
{"Tag":"name=Parquet_go_root","Fields":[{"Tag":"name=Name, type=BYTE_ARRAY, convertedtype=UTF8"},{"Tag":"name=Age, type=INT32"},{"Tag":"name=Id, type=INT64"},{"Tag":"name=Weight, type=FLOAT"},{"Tag":"name=Sex, type=BOOLEAN"},{"Tag":"name=Classes, type=LIST","Fields":[{"Tag":"name=Element, type=BYTE_ARRAY, convertedtype=UTF8"}]},{"Tag":"name=Scores, type=MAP","Fields":[{"Tag":"name=Key, type=BYTE_ARRAY, convertedtype=UTF8"},{"Tag":"name=Value, type=LIST","Fields":[{"Tag":"name=Element, type=FLOAT"}]}]},{"Tag":"name=Friends, type=LIST","Fields":[{"Tag":"name=Element","Fields":[{"Tag":"name=Name, type=BYTE_ARRAY, convertedtype=UTF8"},{"Tag":"name=Id, type=INT64"}]}]},{"Tag":"name=Teachers, repetitiontype=REPEATED","Fields":[{"Tag":"name=Name, type=BYTE_ARRAY, convertedtype=UTF8"},{"Tag":"name=Id, type=INT64"}]}]}
CSV format is the schema that can be used to import from CSV files:
$ parquet-tools schema --format csv testdata/csv-good.parquet
name=Id, type=INT64
name=Name, type=BYTE_ARRAY, convertedtype=UTF8
name=Age, type=INT32
name=Temperature, type=FLOAT
name=Vaccinated, type=BOOLEAN
since CSV is a flat 2D format, we cannot generate CSV schema for nested or optional columns:
$ parquet-tools schema -f csv testdata/csv-optional.parquet
parquet-tools: error: CSV does not support optional column
exit status 1
$ parquet-tools schema -f csv testdata/csv-nested.parquet
parquet-tools: error: CSV supports flat schema only
exit status 1
parquet-go
package uses "PARGO_PREFIX_"
to deal with field names starting with non-alphabetic characters, hence output schema will also have this prefix. To restore origin field name, you can specify option --pargo-prefix
with value of "PARGO_PREFIX_"
, this applies to all output formats.
$ parquet-tools schema -f csv testdata/pargo-prefix.parquet
name=PARGO_PREFIX__shoe_brand, type=BYTE_ARRAY, convertedtype=UTF8
name=Shoe_name, type=BYTE_ARRAY, convertedtype=UTF8
$ parquet-tools schema -f csv --pargo-prefix PARGO_PREFIX_ testdata/pargo-prefix.parquet
name=_shoe_brand, type=BYTE_ARRAY, convertedtype=UTF8
name=Shoe_name, type=BYTE_ARRAY, convertedtype=UTF8
You need to change filed name to start with uppercase alphabetic character if you use this with go struct, otherwise the field will not be exported
$ parquet-tools schema -f go testdata/pargo-prefix.parquet | gofmt
type Parquet_go_root struct {
PARGO_PREFIX__shoe_brand string `parquet:"name=PARGO_PREFIX__shoe_brand, type=BYTE_ARRAY, convertedtype=UTF8"`
Shoe_name string `parquet:"name=Shoe_name, type=BYTE_ARRAY, convertedtype=UTF8"`
}
$ parquet-tools schema -f go --pargo-prefix PARGO_PREFIX_ testdata/pargo-prefix.parquet | gofmt
type Parquet_go_root struct {
_shoe_brand string `parquet:"name=_shoe_brand, type=BYTE_ARRAY, convertedtype=UTF8"`
Shoe_name string `parquet:"name=Shoe_name, type=BYTE_ARRAY, convertedtype=UTF8"`
}
shell-completions
updates shell's rcfile with proper shell completions setting, this is an experimental feature at this moment, only bash is tested.
To install shell completions. run:
$ parquet-tools shell-completions
You will not get output if everything runs well, you can check shell's rcfile, for example, .bash_profile
or .bashrc
for bash, to see what it added.
This command will return error if the same line is in shell's rcfile already.
To uninstall shell completions, run:
$ parquet-tools shell-completions --uninstall
You will not get output if everything runs well, you can check shell's rcfile, for example, .bash_profile
or .bashrc
for bash, to see what it removed.
This command will return error if the line does not exist in shell rcfile.
Hit <TAB>
key in command line when you need hint or want to auto complete current option.
size
command provides various size information, it can be raw data (compressed) size, uncompressed data size, or footer (meta data) size.
$ parquet-tools size testdata/good.parquet
588
$ parquet-tools size --query footer --json testdata/good.parquet
{"Footer":323}
$ parquet-tools size -q all -j testdata/good.parquet
{"Raw":588,"Uncompressed":438,"Footer":323}
split
command distributes data in source file into multiple parquet files, number of output files is either --file-count
parameter, or total number of rows in source file divided by --record-count
parameter.
Name of output files is determined by --name-format
and will be used by fmt.Sprintf
, default value is result-%06d.parquet
which means output files will be under current directory with name result-000000.parquet
, result-000001.parquet
, etc., you can use any of file locations that support write operation, eg S3, or HDFS.
Other useful parameters include:
--fail-on-int96
to fail the command if source parquet file contains INT96 fields--compression
to specify compression codec for output files, defailt isSNAPPY
--read-page-size
to tell how many rows will be read per batch from source
$ parquet-tools row-count testdata/all-types.parquet
10
$ parquet-tools split --file-count 3 testdata/all-types.parquet
$ parquet-tools row-count result-000000.parquet
4
$ parquet-tools row-count result-000001.parquet
3
$ parquet-tools row-count result-000002.parquet
3
$ parquet-tools row-count testdata/all-types.parquet
10
$ parquet-tools split --record-count 3 --name-format %d.parquet testdata/all-types.parquet
$ parquet-tools row-count 0.parquet
3
$ parquet-tools row-count 1.parquet
3
$ parquet-tools row-count 2.parquet
3
$ parquet-tools row-count 3.parquet
1
version
command provides version, build time, git hash, and source of the executable, it will be quite helpful when you are troubleshooting a problem from this tool itself. Source of the executable can be "source" (or "") which means it was built from source code, or "github" indicates it was from github release (include container images and deb/rpm packages as they share the same build result), or "bottle" if it was from homebrew bottles.
$ parquet-tools version
v1.22.2
$ parquet-tools version -bgs
v1.22.2
2024-09-09T20:36:44+00:00
0bcba77
bottle
$ parquet-tools version --build-time --json
{"Version":"v1.22.2","BuildTime":"2024-09-09T20:36:44+00:00","GitHash":"0bcba77","Source":"bottle"}
$ parquet-tools version -j
{"Version":"v1.22.2"}