Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel computations #723

Open
koperagen opened this issue Jun 4, 2024 · 0 comments
Open

Parallel computations #723

koperagen opened this issue Jun 4, 2024 · 0 comments
Labels
help wanted Extra attention is needed, feel free to help :) research This requires a deeper dive to gather a better understanding
Milestone

Comments

@koperagen
Copy link
Collaborator

Object properties sometimes can be heavy or lazily computed and overall conversion take minutes for somewhat big lists
One can write this fairly simple code to speed up the conversion

 val df = runBlocking {
      list
          .chunked(workload)
          .map {
              async(Dispatchers.IO) { it.toDataFrame() }
          }.awaitAll().concat()
  }

Although code is simple, it seems hard to properly make this parallelism part of the toDataFrame implementation.
Only list.toDataFrame(maxDepth = int) and list.toDataFrame { properties(maxDepth = int) { } } are side effect free, and it's (mostly) safe to split the list in chunks, run conversion in parallel and concat results. But even computation of the properties can be not parallel friendly. And then there is a question how workload is split and so on.

add and convert can be heavy and involve IO too. For this i have something like this in mind

fun DataFrame<*>.awaitAll(selector: ColumnSelector<*, Deferred<*>>) = runBlocking {
    val column = getColumn(selector)
    val values = column.toList().awaitAll()
    replace(selector).with(values.toColumn(column.name(), infer = Infer.Type))
}

Usage:

val df = runBlocking {
    otherDf.add("col") {
        async(Dispatchers.IO) {
            heavyCompute()
        }
    }.awaitAll { "col"() }
}

These two approaches can speed up dataframe code significantly in certain scenarios, so we can give them some visibility in the documentation.

@koperagen koperagen added research This requires a deeper dive to gather a better understanding help wanted Extra attention is needed, feel free to help :) labels Jun 4, 2024
@zaleslaw zaleslaw added this to the Backlog milestone Jul 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed, feel free to help :) research This requires a deeper dive to gather a better understanding
Projects
None yet
Development

No branches or pull requests

2 participants