Karps - Optimizing library for Spark dataframes.

This project is an exploration vehicle for developing composable, robust and fast data pipelines over Apache Spark while maximizing user productivity. It consists in multiple sub-projects:

a specification to describe data pipelines in a language-agnostic manner, and a communication protocol to submit these pipelines to Spark. The specification is currently specified in this repository, using Protocol Buffers 3 ( which is also compatible with JSON).
a thin python client that emulates a subset of the API for Spark and Pandas.
a serving library, called karps-server, that implements this specification on top of Spark. It is written in Scala and is loaded as a standard Spark package.
an optimizing compiler, that takes descriptions of data pipeline in the above specification and produces improved, lower-level representations more amenable to Spark or Pandas.
a Haskell client, which generates such computation graphs using a DSL. This is the reference client.

There is also a separate set of utilities to visualize such pipelines using Jupyter notebooks, IPython and IHaskell.

The name is a play on a tasty fish of the family Cyprinidae, and an anagram of Spark. The programming model is strongly influenced by the TensorFlow project and follows a similar design.

This is a preview, the API may (will) change in the future.

Server part overview

This is the server part for the Karps project. It implement a gRPC server on top of Spark and serves requests to evaluate graphs of computations. Users should not build graphs themselves and are strongly encouraged to use the python client or the Haskell client. Developers can look at the implementation of the Python client to add support for another language.

User instructions

To run the server, you need:

a distribution of Spark >= 2.1
the karps compiler in you PATH. If you are using MacOs or Linux, the easiest is to download the latest prebuilt version following the instructions here.

The simplest way to run the server is to use the published Spark package:

Developer instructions

The spark package has been tested with Spark 2.0, Spark 2.1 and Spark 2.2. Due to some bugs in Spark 2.0, using Spark 2.1 is strongly recommended.

./build/sbt ks_testing/assembly && $SPARK_HOME/bin/spark-submit \
    ./target/testing/scala-2.11/ks_testing-assembly-0.2.0.jar --name karps-server\
     --class org.karps.Boot --master "local[1]" -v

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
haskell		haskell
project		project
python		python
src/main		src/main
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Karps - Optimizing library for Spark dataframes.

Server part overview

User instructions

Developer instructions

About

Releases

Packages

Contributors 2

Languages

License

tjhunter/karps

Folders and files

Latest commit

History

Repository files navigation

Karps - Optimizing library for Spark dataframes.

Server part overview

User instructions

Developer instructions

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages