Skip to content

Commit

Permalink
FEAT-modin-project#2013: Merge remote-tracking branch 'upstream/maste…
Browse files Browse the repository at this point in the history
…r' into merge_asof/2013

Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>
  • Loading branch information
itamarst committed Dec 1, 2020
2 parents 69c1031 + f571f69 commit 7ce40e7
Show file tree
Hide file tree
Showing 23 changed files with 469 additions and 316 deletions.
5 changes: 0 additions & 5 deletions ci/teamcity/Dockerfile.modin-base

This file was deleted.

28 changes: 22 additions & 6 deletions ci/teamcity/Dockerfile.teamcity-ci
Original file line number Diff line number Diff line change
@@ -1,4 +1,12 @@
FROM modin-project/modin-base
# Create images from this container like this (in modin repo root):
#
# git rev-parse HEAD > ci/teamcity/git-rev
#
# tar cf ci/teamcity/modin.tar .
#
# docker build --build-arg ENVIRONMENT=environment.yml -t modin-project/teamcity-ci:${BUILD_NUMBER} -f ci/teamcity/Dockerfile.teamcity-ci ci/teamcity

FROM rayproject/ray:1.0.1

ARG ENVIRONMENT=environment.yml

Expand All @@ -10,13 +18,21 @@ WORKDIR /modin
# Make RUN commands use `bash --login`:
SHELL ["/bin/bash", "--login", "-c"]

RUN conda env create -f ${ENVIRONMENT}

# Initialize conda in bash config fiiles:
# Initialize conda in bash config files:
RUN conda init bash
ENV PATH /opt/conda/envs/modin/bin:$PATH
ENV PATH /root/anaconda3/envs/modin/bin:$PATH

RUN conda update python -y
RUN conda env create -f ${ENVIRONMENT}
RUN conda install curl PyGithub

# Activate the environment, and make sure it's activated:
# The following line also removed conda initialization from
# ~/.bashrc so conda starts complaining that it should be
# initialized for bash. But it is necessary to do it because
# activation is not always executed when "docker exec" is used
# and then conda initialization overwrites PATH with its base
# environment where python doesn't have any packages installed.
RUN echo "conda activate modin" > ~/.bashrc
RUN echo "Make sure environment is activated"
RUN conda list
RUN conda list -n modin
2 changes: 1 addition & 1 deletion commitlint.config.js
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ module.exports = {
plugins: ['commitlint-plugin-jira-rules'],
extends: ['jira'],
rules: {
"header-max-length": [2, "always", 70],
"header-max-length": [2, "always", 88],
"signed-off-by": [2, "always", "Signed-off-by"],
"jira-task-id-max-length": [0, "always", 10],
"jira-task-id-project-key": [2, "always", ["FEAT", "DOCS", "FIX", "REFACTOR", "TEST"]],
Expand Down
69 changes: 67 additions & 2 deletions docs/comparisons/pandas.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,69 @@
Modin vs. Pandas
Modin vs. pandas
================

Coming Soon...
Modin exposes the pandas API through ``modin.pandas``, but it does not inherit the same
pitfalls and design decisions that make it difficult to scale. This page will discuss
how Modin's dataframe implementation differs from pandas, and how Modin scales pandas.

Scalablity of implementation
----------------------------

The pandas implementation is inherently single-threaded. This means that only one of
your CPU cores can be utilized at any given time. In a laptop, it would look something
like this with pandas:

.. image:: /img/pandas_multicore.png
:alt: pandas is single threaded!
:align: center
:scale: 80%

However, Modin's implementation enables you to use all of the cores on your machine, or
all of the cores in an entire cluster. On a laptop, it will look something like this:

.. image:: /img/modin_multicore.png
:alt: modin uses all of the cores!
:align: center
:scale: 80%

The additional utilization leads to improved performance, however if you want to scale
to an entire cluster, Modin suddenly looks something like this:

.. image:: /img/modin_cluster.png
:alt: modin works on a cluster too!
:align: center
:scale: 30%

Modin is able to efficiently make use of all of the hardware available to it!

Memory usage and immutability
-----------------------------

The pandas API contains many cases of "inplace" updates, which are known to be
controversial. This is due in part to the way pandas manages memory: the user may
think they are saving memory, but pandas is usually copying the data whether an
operation was inplace or not.

Modin allows for inplace semantics, but the underlying data structures within Modin's
implementation are immutable, unlike pandas. This immutability gives Modin the ability
to internally chain operators and better manage memory layouts, because they will not
be changed. This leads to improvements over pandas in memory usage in many common cases,
due to the ability to share common memory blocks among all dataframes.

Modin provides the inplace semantics by having a mutable pointer to the immutable
internal Modin dataframe. This pointer can change, but the underlying data cannot, so
when an inplace update is triggered, Modin will treat it as if it were not inplace and
just update the pointer to the resulting Modin dataframe.

API vs implementation
---------------------

It is well known that the pandas API contains many duplicate ways of performing the same
operation. Modin instead enforces that any one behavior have one and only one
implementation internally. This guarantee enables Modin to focus on and optimize a
smaller code footprint while still guaranteeing that it covers the entire pandas API.
Modin has an internal algebra, which is roughly 15 operators, narrowed down from the
original >200 that exist in pandas. The algebra is grounded in both practical and
theoretical work. Learn more in our `VLDB 2020 paper`_. More information about this
algebra can be found in the :doc:`../developer/architecture` documentation.

.. _VLDB 2020 paper: https://arxiv.org/abs/2001.00888
Binary file added docs/img/modin_cluster.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/modin_multicore.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/pandas_multicore.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 2 additions & 4 deletions docs/supported_apis/series_supported.rst
Original file line number Diff line number Diff line change
Expand Up @@ -474,10 +474,8 @@ the related section on `Defaulting to pandas`_.
+-----------------------------+---------------------------------+----------------------------------------------------+
| ``valid`` | D | |
+-----------------------------+---------------------------------+----------------------------------------------------+
| ``value_counts`` | Y | The indices of resulting object will be in |
| | | descending (ascending, if ascending=True) order for|
| | | equal values. |
| | | In pandas indices are located in random order. |
| ``value_counts`` | Y | The indices order of resulting object may differ |
| | | from pandas. |
+-----------------------------+---------------------------------+----------------------------------------------------+
| ``values`` | Y | |
+-----------------------------+---------------------------------+----------------------------------------------------+
Expand Down
6 changes: 2 additions & 4 deletions docs/supported_apis/utilities_supported.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,8 @@ default to pandas.
+---------------------------+---------------------------------+----------------------------------------------------+
| `pd.unique`_ | Y | |
+---------------------------+---------------------------------+----------------------------------------------------+
| ``pd.value_counts`` | Y | The indices of resulting object will be in |
| | | descending (ascending, if ascending=True) order for|
| | | equal values. |
| | | In pandas indices are located in random order. |
| ``pd.value_counts`` | Y | The indices order of resulting object may differ |
| | | from pandas. |
+---------------------------+---------------------------------+----------------------------------------------------+
| `pd.cut`_ | D | |
+---------------------------+---------------------------------+----------------------------------------------------+
Expand Down
2 changes: 1 addition & 1 deletion modin/backends/pandas/parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -221,7 +221,7 @@ def parse(fname, **kwargs):
ws = Worksheet(wb)
# Read the raw data
with ZipFile(fname) as z:
with z.open("xl/worksheets/{}.xml".format(sheet_name.lower())) as file:
with z.open("xl/worksheets/{}.xml".format(sheet_name)) as file:
file.seek(start)
bytes_data = file.read(end - start)

Expand Down
53 changes: 1 addition & 52 deletions modin/backends/pandas/query_compiler.py
Original file line number Diff line number Diff line change
Expand Up @@ -750,58 +750,7 @@ def reduce_func(df, *args, **kwargs):
if normalize:
result = result / df.squeeze(axis=1).sum()

result = result.sort_values(ascending=ascending) if sort else result

# We want to sort both values and indices of the result object.
# This function will sort indices for equal values.
def sort_index_for_equal_values(result, ascending):
"""
Sort indices for equal values of result object.
Parameters
----------
result : pandas.Series or pandas.DataFrame with one column
The object whose indices for equal values is needed to sort.
ascending : boolean
Sort in ascending (if it is True) or descending (if it is False) order.
Returns
-------
pandas.DataFrame
A new DataFrame with sorted indices.
"""
is_range = False
is_end = False
i = 0
new_index = np.empty(len(result), dtype=type(result.index))
while i < len(result):
j = i
if i < len(result) - 1:
while result[result.index[i]] == result[result.index[i + 1]]:
i += 1
if is_range is False:
is_range = True
if i == len(result) - 1:
is_end = True
break
if is_range:
k = j
for val in sorted(
result.index[j : i + 1], reverse=not ascending
):
new_index[k] = val
k += 1
if is_end:
break
is_range = False
else:
new_index[j] = result.index[j]
i += 1
return pandas.DataFrame(
result, index=new_index, columns=["__reduced__"]
)

return sort_index_for_equal_values(result, ascending)
return result.sort_values(ascending=ascending) if sort else result

return MapReduceFunction.register(
map_func, reduce_func, axis=0, preserve_index=False
Expand Down
8 changes: 4 additions & 4 deletions modin/config/envvars.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
import warnings
from packaging import version

from .pubsub import Parameter, _TYPE_PARAMS
from .pubsub import Parameter, _TYPE_PARAMS, ExactStr


class EnvironmentVariable(Parameter, type=str, abstract=True):
Expand Down Expand Up @@ -112,7 +112,7 @@ class IsRayCluster(EnvironmentVariable, type=bool):
varname = "MODIN_RAY_CLUSTER"


class RayRedisAddress(EnvironmentVariable, type=str):
class RayRedisAddress(EnvironmentVariable, type=ExactStr):
"""
What Redis address to connect to when running in Ray cluster
"""
Expand Down Expand Up @@ -142,7 +142,7 @@ class Memory(EnvironmentVariable, type=int):
varname = "MODIN_MEMORY"


class RayPlasmaDir(EnvironmentVariable, type=str):
class RayPlasmaDir(EnvironmentVariable, type=ExactStr):
"""
Path to Plasma storage for Ray
"""
Expand All @@ -158,7 +158,7 @@ class IsOutOfCore(EnvironmentVariable, type=bool):
varname = "MODIN_OUT_OF_CORE"


class SocksProxy(EnvironmentVariable, type=str):
class SocksProxy(EnvironmentVariable, type=ExactStr):
"""
SOCKS proxy address if it is needed for SSH to work
"""
Expand Down
12 changes: 12 additions & 0 deletions modin/config/pubsub.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,23 @@ class TypeDescriptor(typing.NamedTuple):
help: str


class ExactStr(str):
"""
To be used in type params where no transformations are needed
"""


_TYPE_PARAMS = {
str: TypeDescriptor(
decode=lambda value: value.strip().title(),
normalize=lambda value: value.strip().title(),
verify=lambda value: True,
help="a case-insensitive string",
),
ExactStr: TypeDescriptor(
decode=lambda value: value,
normalize=lambda value: value,
verify=lambda value: True,
help="a string",
),
bool: TypeDescriptor(
Expand Down
10 changes: 5 additions & 5 deletions modin/config/test/test_envvars.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
import os
import pytest

from modin.config.envvars import EnvironmentVariable, _check_vars
from modin.config.envvars import EnvironmentVariable, _check_vars, ExactStr


@pytest.fixture
Expand All @@ -25,9 +25,9 @@ def make_unknown_env():
del os.environ[varname]


@pytest.fixture
def make_custom_envvar():
class CustomVar(EnvironmentVariable, type=str):
@pytest.fixture(params=[str, ExactStr])
def make_custom_envvar(request):
class CustomVar(EnvironmentVariable, type=request.param):
""" custom var """

default = 10
Expand All @@ -40,7 +40,7 @@ class CustomVar(EnvironmentVariable, type=str):
@pytest.fixture
def set_custom_envvar(make_custom_envvar):
os.environ[make_custom_envvar.varname] = " custom "
yield "Custom"
yield "Custom" if make_custom_envvar.type is str else " custom "
del os.environ[make_custom_envvar.varname]


Expand Down
9 changes: 4 additions & 5 deletions modin/engines/base/frame/axis_partition.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,12 @@
# ANY KIND, either express or implied. See the License for the specific language
# governing permissions and limitations under the License.

from abc import ABC
import pandas
from modin.data_management.utils import split_result_of_axis_func_pandas

NOT_IMPLMENTED_MESSAGE = "Must be implemented in child class"


class BaseFrameAxisPartition(object): # pragma: no cover
class BaseFrameAxisPartition(ABC): # pragma: no cover
"""An abstract class that represents the Parent class for any `ColumnPartition` or `RowPartition` class.
This class is intended to simplify the way that operations are performed.
Expand Down Expand Up @@ -73,7 +72,7 @@ def apply(
-------
A list of `BaseFramePartition` objects.
"""
raise NotImplementedError(NOT_IMPLMENTED_MESSAGE)
pass

def shuffle(self, func, lengths, **kwargs):
"""Shuffle the order of the data in this axis based on the `lengths`.
Expand All @@ -86,7 +85,7 @@ def shuffle(self, func, lengths, **kwargs):
-------
A list of RemotePartition objects split by `lengths`.
"""
raise NotImplementedError(NOT_IMPLMENTED_MESSAGE)
pass

# Child classes must have these in order to correctly subclass.
instance_type = None
Expand Down
Loading

0 comments on commit 7ce40e7

Please sign in to comment.