Features

Converts CSV files into Parquet.
Creates Partitions

Usage

Project Status: Dev Testing - Seems to be working, but needs more testing. Do not use in production.

Designed to run on a Hadoop edge node.

Examples

java -jar toparquet.jar -if myfile1.csv myfile2.csv -o myparquet

java -jar toparquet.jar -ip folder/mydir/ -o myparquet -s myschema.par

More Information

java -jar toparquet.jar -h

Advanced/Json Configuration Guide

Development Readme

Goals

The motivation of this project is to allow an in-memory data processing framework, such as Spark, to be able to process a large file that might not fit into a node's memory if it were a CSV. In some unfortunate circumstances, spark might not have enough memory to do the conversion itself.

Ability to quickly and easily convert large files into parquet with minimal memory usage.
Preprocess files to be more easily used by spark processing.

Non-Goals

This project does not intend to become a data transformation / processing tool in its own right.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
.gitignore		.gitignore
config.md		config.md
license.txt		license.txt
pom.xml		pom.xml
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Features

Usage

Examples

More Information

Goals

Non-Goals

Feature Roadmap

About

Releases

Packages

Languages

License

brentcodes/toparquet

Folders and files

Latest commit

History

Repository files navigation

Features

Usage

Examples

More Information

Goals

Non-Goals

Feature Roadmap

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages