PERF: Try width and length caches before materializing all partition lengths/widths in Modin frame #4493

mvashishtha · 2022-05-25T10:46:52Z

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): macOS Monterey 12.2.1
Modin version (modin.__version__): 0f70e82
Python Version: 3.9.12

Describe the problem

Currently I see two points in the Modin frame where we compute either all partition lengths for the first column of partitions, or all column widths for the first row of partitions:

modin/modin/core/dataframe/pandas/dataframe/dataframe.py

Line 2380 in c736def

def get_axis_lengths(partitions, axis):
modin/modin/core/dataframe/pandas/dataframe/dataframe.py

Line 2152 in c736def

def get_len(part):

The first line, in _copartition, recently caused single-threaded execution for a frame with partitions of Decimal objects. Each frame had a transpose on the queue. Executing a multiply then caused the widths to be computed serially, so each partition's call queue was drained in sequence. The result was that serially, each partition slowly put the transpose result in the object store. (The objects took a while to put in the object store because Decimal data is slow to serialize.) However, in this case the lengths and widths were cache, so there was no need to compute lengths and widths at all.

Attached is a ray timeline and here is an image of the single threaded execution for the transpose in the middle (from a similar script).

Reproduction script

import numpy as np
import modin.pandas as pd
from decimal import Decimal

height = 50_000
width = 751
im = pd.DataFrame(np.random.randint(0, 2, size=(height, width)))
im = im.applymap(lambda x: Decimal(str(x)))
weight_vec = pd.Series(np.random.rand(height)).apply(lambda x: Decimal(str(x)))
print(im.T.multiply(weight_vec))

The text was updated successfully, but these errors were encountered:

mvashishtha · 2022-05-25T11:24:09Z

It turns out that in the case of 1) reindexed_base has unknown axis lengths because we might have to add elements along the axis to align with the other frame e.g. for

import pandas as pd

A =  pd.DataFrame([[1]])
B = pd.DataFrame([[2]], index=['b'])
print(A+B)

the new reindexed_base has length 2 ([0, 'b']) instead of 1

To fix the single-threadedness there we will need #4494. I don't see an easy fix. We could maybe some extra code for the case where we don't expect the union with the other frames' indices to change the partition sizes.

For 2) I think we really can use self._column_widths and self._row_lengths. I will make a PR for that.

…frame. Signed-off-by: mvashishtha <mahesh@ponder.io>

Co-authored-by: Devin Petersohn <devin-petersohn@users.noreply.github.com> Co-authored-by: Yaroslav Igoshev <Poolliver868@mail.ru> Signed-off-by: mvashishtha <mahesh@ponder.io>

mvashishtha added Performance 🚀 Performance related issues and pull requests. Internals Internal modin functionality labels May 25, 2022

mvashishtha mentioned this issue May 25, 2022

PERF: get all partition widths/lengths in parallel instead of serially. #4494

Closed

mvashishtha mentioned this issue May 25, 2022

PERF-#4493: Use partition size caches more in Modin dataframe. #4495

Merged

8 tasks

mvashishtha self-assigned this May 27, 2022

This was referenced May 27, 2022

PERF: Improve ray performance on Decimal data #4506

Open

BUG: Ray remote tasks internally blocking on object IDs of their dependencies can lead to deadlock #4507

Closed

mvashishtha pushed a commit to mvashishtha/modin that referenced this issue May 27, 2022

PERF-modin-project#4493: Use partition size caches more in Modin data…

83d786d

…frame. Signed-off-by: mvashishtha <mahesh@ponder.io>

YarShev closed this as completed in #4495 Jun 2, 2022

mvashishtha mentioned this issue Jul 21, 2022

PERF-#4494: Get partition widths/lengths in parallel instead of serially #4683

Draft

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Try width and length caches before materializing all partition lengths/widths in Modin frame #4493

PERF: Try width and length caches before materializing all partition lengths/widths in Modin frame #4493

mvashishtha commented May 25, 2022

mvashishtha commented May 25, 2022 •

edited

Loading

PERF: Try width and length caches before materializing all partition lengths/widths in Modin frame #4493

PERF: Try width and length caches before materializing all partition lengths/widths in Modin frame #4493

Comments

mvashishtha commented May 25, 2022

System information

Describe the problem

mvashishtha commented May 25, 2022 • edited Loading

mvashishtha commented May 25, 2022 •

edited

Loading