Skip to content

Releases: zero-one-group/geni

v0.0.31 - Spark Doc Scraper

07 Oct 06:06
3d41d6a
Compare
Choose a tag to compare
Pre-release
  • Spark Doc Scraper: scripts/scrape-spark-docs.clj is able to scrape the relevant docs for the four modules.
  • Partial Docstrings: docstrings are available for core.column and ml.regression namespaces.

v0.0.30 - Some Basic Support for Spark Streaming

30 Sep 01:10
84fca65
Compare
Choose a tag to compare
  • Basic Spark Streaming functionalities: added some low-hanging fruits in terms of JavaDStream and JavaStreamingContext methods.
  • More robust Spark Streaming testing function: now expects an :expected key and automatically retries to make the test less flaky.

v0.0.29 - Start of Spark Streaming Support

23 Sep 00:06
613b5cf
Compare
Choose a tag to compare
  • DStream Testing Function: a more reliable and repeatable way to test Spark Streaming's StreamingContext and DStream methods.
  • Automated Version Bump: done with Babashka.
  • Updated Contributing Guide: thanks to @erp12 for pointing out certain gotchas on the guide.

v0.0.27 - Excel Support and Version Bumps

17 Sep 01:16
95a6e90
Compare
Choose a tag to compare
  • Excel Support: basic functions read-xlsx! and write-xlsx! are now available backed by zero.one/fxl.
  • Version Bumps: for Spark and nrepl to the latest version.
  • Install CI steps: Dockerless installs are now tested on Ubuntu and macOS.

v0.0.26 - Better RDDs, EDN Support and Data-Oriented Schemas

09 Sep 01:55
59c16f2
Compare
Choose a tag to compare
  • Schema option for read functions: all read functions now support a :schema option, which can be an actual Spark schema or its data-oriented version.
  • Basic support for EDN: read-edn! and write-edn! are now available with an added dependency of metosin/jsonista. The functions may not be performant, but can come in handy for small-data compositions.
  • More RDD functions: this closes the RDD function gaps to sparkplug and adds variadicity to functions that take in more than one RDDs.
  • RDD name unmangling: this follows sparkplug model of unmangling RDD names after each transformation.
  • Version bump for dependencies: nrepl bumped to 0.8.1.

v0.0.25 - RDD Serialisation Model and More Methods

02 Sep 02:16
e9d0eb4
Compare
Choose a tag to compare
  • RDD Function Serialisation Model: changed from the sparkling model to the sparkplug model. Slack user @g on clojurians/geni mentioned that the sparkplug model results in fewer serialisation problems than the sparkling one.
  • More RDD Methods: added methods related to partitioners and JavaSparkContextMethods.
  • Community Guidelines: added a code of conduct and an issue template.
  • Design Goals Docs: first draft of the design goal outlining some of the main focuses of the project.

v0.0.24 - Basic RDD and PairRDD Support

26 Aug 04:15
eea20f8
Compare
Choose a tag to compare
Pre-release
  • RDD and PairRDD Support: basic actions and transformations are supported, but it will require AOT compilations to pass serialisable functions to RDD's higher-order functions. Therefore, the RDD REPL experience is rather poor.
  • Isolated Docker Runs: all Docker operations on the Makefile now runs on a temporary directory, so that there are no race conditions in writing to the target directory. This means that make ci --jobs 3 is now possible on a single machine.

v0.0.23 - Basic RDD Support + Spark ML Cookbook

19 Aug 03:22
be42842
Compare
Choose a tag to compare

Preliminary RDD support with only certain transformations completed and completion of two parts of the cookbook for Spark ML.

  • Basic RDD support: mainly basic transformations such as map, reduce, map-to-pair and reduce-by-key. The main challenge has been doing serialisation of functions which are mainly taken from Sparkling and sparkplug.
  • Spark ML cookbook: added two chapters on Spark ML pipelines and ported customer segmentation blog post with non-negative matrix factorisation.
  • Better Geni CLI: new --submit command-line argument to emulate spark-submit.
  • Better CI steps: automated Geni CLI tests to avoid manual testing of the Geni REPL.
  • Completed benchmark results: added results from dplyr, data.table, tablecloth and tech.ml.dataset.

v0.0.22 - Basic Geni CLI + Namespace Alignments

11 Aug 13:17
6ae6f5d
Compare
Choose a tag to compare

Better getting-started experience with the new geni command and better alignment of Geni namespaces with Spark packages.

  • New geni script with install instructions and a new asciinema screencast. This will be the main way to use Geni for small, one-off analyses and throwaway scripts.
  • Created another layer of namespaces with zero-one.geni.core and zero-one.geni.ml. The idea is that the core namespaces should refer to only Spark SQL and the ml namespaces refer to Spark ML. This will help the mapping of Geni functions to the original Spark functions.
  • Added a simple benchmark piece that compares the performance of Pandas vs. Geni on a particular problem.
  • An asciinema screencast for the downloading the uberjar and interacting with the Geni REPL.

v0.0.21 - First Alpha Release

06 Aug 06:40
34d1d99
Compare
Choose a tag to compare
Pre-release

Initial alpha release documented here on cljdoc.

The release includes an uberjar that should provide a Geni REPL (i.e. a Clojure spark-shell) within seconds. Download the uberjar, and simply try out the REPL with java -jar geni-repl-uberjar-0.0.21.jar! An nREPL server is automatically started with an .nrepl-port file, so that common Clojure text editors should be able to jack in automatically.

The initial namespace automatically requires:

(require '[zero-one.geni.core :as g]
         '[zero-one.geni.ml :as ml])

so that functions such as g/read-csv! and ml/logistic-regression are immediately available.

The Spark session is available as a Clojure Future object, which can be dereferenced with @spark. To see the full default spark config, invoke (g/spark-conf @spark)!