Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should Geni have support for Delta Lake or should that be a separate library? #314

Open
erp12 opened this issue Feb 23, 2021 · 6 comments
Open

Comments

@erp12
Copy link
Collaborator

erp12 commented Feb 23, 2021

Delta Lake brings a lot of crucial features into the Spark ecosystem. Some of the highlights include:

  1. ACID transactions.
  2. Time travel between data versions.
  3. Safe in-place updates, inserts, and upserts.
  4. Storage optimization.

In some toy projects, I have begun using Delta with Geni using Scala interop. I haven't worked out all the issues yet, but I think I am getting close to something that could be useful to others.

It would be great to get people's thoughts on the best way to manage Delta support. My specific question is: Would it be a good idea to add Delta Lake as an optional dep (like we do with xgboost) and implement the Delta Lake API directly in Geni?

If yes, I can move my code into a PR. If no, I can create a geni compatible Clojure lib specifically for Delta lake.

Some things to consider:

  • Delta is technically a separate project from Spark. It has separate versioning and their APIs don't move in sync. Adding Delta to Geni would be another dependency to manage.
  • Geni is already using with-dynamic-import when reaching into optional dependencies, but that is usually for 1 or 2 functions. Adding Delta would create a set of functions to mirror the public API of an entire Scala library. It could be a little awkward to with-dymanic-import so much stuff. My editor has a hard time handling it, but maybe that is just me.
  • The primary way to use Delta is via a new "delta" format for DataFrame readers/writers. Geni's provides a nice data-driven way to create readers and writers. If Delta support was provided by a separate library, it wouldn't be possible to create delta reader/writers using the Geni API and the Delta lib wouldn't be able to benefit from Geni's utilities for configuring readers/writers.

Personally, I think option 1 is best (integrate into Geni) but I don't want to create more work for other people if I am one of the only people who would benefit. :)

@anthony-khong
Copy link
Member

Hi @erp12, thank you for the issue! Sorry for the delayed reply, I only just realised that you wrote this issue. It must have fallen between the cracks!

I'm not super familiar with Delta Lake, but it honestly sounds like a really good idea and slots it nicely with the general Spark API, so it makes sense to support it as an optional dependency!

As for with-dynamic-import, on suggestion would be to develop it without making it optional, then at the last minute, just add the with-dynamic-import. But you still run into the same problem when maintaining it.

I also wonder if it's possible to add an optional parent namespace, and have the dynamic import in another namespace..? 🤔 For instance, instead of having the zero-one.geni.ml.xgb namespace define the functions, we have zero-one.geni.optional.xgb do the regular import and functions, then have zero-one.geni.ml.xgb require zero-one.geni.optional.xgb wrapped in dynamic import. Not sure if this is even possible, because importing is quite funky.

@erp12
Copy link
Collaborator Author

erp12 commented Apr 22, 2021

I have finished a branch that provides support of Delta. Unfortunately, Delta 0.8 is incompatible with Spark 3.1.x which prevents me from testing some of the core functionality without downgrading geni to spark 3.0.2.

I'm not sure what the best way forward is. We could revisit this after the next Delta release, or we could disable the incompatible functionality and leave a todo for re-enabling in the future.

@anthony-khong
Copy link
Member

Hmm, honestly I'm not so sure myself if Geni should stay toe-to-toe with the latest version of Spark. I wonder if we should keep tabs on the latest, most compatible version of Spark with the ecosystem. @behrica brought this up as well previously with potential incompatibility with Spark NLP before they supported Spark 3.

So, I'm not against downgrading to Spark 3.0.2 if that means unlocking awesome other features from the ecosystem. I'd be very interested in what you and other Geni users think about this!

@dakra
Copy link

dakra commented Apr 23, 2021

Very recently I started looking at switching our parquet lake to delta, so I appreciate this change very much.

Maybe it makes sense to have the Spark version the highest possible that works with all of the features that Geni provides (I guess mainly Spark NLP and Delta then?!) ?
So, only if all libs can work with Spark 3.1 we upgrade.

Not too relevant for Geni but AWS EMR currently ships with max Spark 3.0.1 and since most of my other coworkers use pyspark we mainly develop and test against that.

@erp12
Copy link
Collaborator Author

erp12 commented Apr 25, 2021

There are definitely trade-offs, and I'm not confident enough in any particular approach to make a concrete pitch.

In my opinion, the appropriate strategy is different depending on the scope of Geni. What problems is Geni trying to solve vs not trying to solve? I will share my interpretation based on my reading of the design goals, the "why" page, and the Scicloj interview with @anthony-khong.

Geni is:

  • DataFrame's for Clojure, built on Spark.
  • A "fully loaded" workstation for data scientists.
    • Interactive work with fast feedback.
    • Feature parity with pandas, numpy, R, tech.ml.dataset, and of course Spark. :)
  • Idiomatic for Clojure users.
    • REPL friendly.
    • no ugly interop or Scala knowledge needed.
    • Better leverage of data as an interface.

Geni is not:

  • A plain "Spark for Clojure" wrapper.
  • Tech for data engineering or DataOps.

Feel free to correct me on this interpretation! Assuming I am correct that the primary goal of Geni is to do data science, I would say that Geni does not need to stay in sync with Spark, as long as it is built against a Spark version that enables its data science features. Furthermore, it might make sense to keep the Spark version as stable as possible so that data scientists can safely upgrade Geni without needing to upgrade their infrastructure.

Alternatively, if the primary goal of Geni is simply providing "Spark for Clojure" then it is important to be agnostic to the specific Spark version. At a minimum, the Spark versions that have a decent market share should be supported.


Stepping back from the question of Spark versions, I think there is a related topic that could use some more clarity as the project matures. Data science is a broad field, and it means different things to different people. In addition to stating the kind of person Geni is going to help (data scientists), can we describe the specific kinds of tasks that Geni is trying to help people do?

For example, when I first opened this issue I wasn't sure if the functionality that Delta provides makes sense for Geni or not. I know that Delta (and the similar Iceberg project) have become important to the Spark ecosystem (especially for data engineering and DataOps), but do the problems that are solved by these technologies overlap with the problems Geni is trying to solve?

It seems like "should Geni do ___?" is a common question. In addition to this Delta discussion, we have open discussions for GraphFrames (#321) and Spark NLP (#323). Also, Geni has already been integrated with other with non-spark tools like tech.ml.dataset, Excel, and XGBoost.

Most tools pick a narrower, less fuzzy, scope than "data science" in general. Even within the Spark project they decompose into independent modules based on problem domain (SparkSQL, mllib, streaming, graphx). That said, Geni doesn't have to do the same. I don't think it is bad thing if Geni is an all-in-one, batteries included, kitchen-sink of a platform that provides a bunch of stuff that is typically useful to people who call themselves data scientists. There are lots of choices with lots of trade-offs that will inform future development.

@anthony-khong I would be interested to hear what your vision for Geni's scope. What problems is Geni trying to solve, and what problems is Geni not trying to solve? I know that might seem like a vague question, and I can provide more specific probing questions if it would be helpful. :)

@dakra
Copy link

dakra commented May 25, 2021

jFYI Delta has released version 1.0 (https://delta.io/news/delta-lake-1-0-0-released/) with support for Spark 3.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants