FEAT-modin-project#2013: Merge remote-tracking branch 'upstream/maste…

…r' into merge_asof/2013 Signed-off-by: Itamar Turner-Trauring <itamar@itamarst.org>
itamarst · Dec 1, 2020 · 7ce40e7 · 7ce40e7
2 parents 69c1031 + f571f69
commit 7ce40e7
Show file tree

Hide file tree

Showing 23 changed files with 469 additions and 316 deletions.
diff --git a/ci/teamcity/Dockerfile.modin-base b/ci/teamcity/Dockerfile.modin-base
diff --git a/ci/teamcity/Dockerfile.teamcity-ci b/ci/teamcity/Dockerfile.teamcity-ci
@@ -1,4 +1,12 @@
-FROM modin-project/modin-base
+# Create images from this container like this (in modin repo root):
+#
+# git rev-parse HEAD > ci/teamcity/git-rev
+#
+# tar cf ci/teamcity/modin.tar .
+#
+# docker build --build-arg ENVIRONMENT=environment.yml -t modin-project/teamcity-ci:${BUILD_NUMBER} -f ci/teamcity/Dockerfile.teamcity-ci ci/teamcity
+
+FROM rayproject/ray:1.0.1
 
 ARG ENVIRONMENT=environment.yml
 
@@ -10,13 +18,21 @@ WORKDIR /modin
 # Make RUN commands use `bash --login`:
 SHELL ["/bin/bash", "--login", "-c"]
 
-RUN conda env create -f ${ENVIRONMENT}
-
-# Initialize conda in bash config fiiles:
+# Initialize conda in bash config files:
 RUN conda init bash
-ENV PATH /opt/conda/envs/modin/bin:$PATH
+ENV PATH /root/anaconda3/envs/modin/bin:$PATH
+
+RUN conda update python -y
+RUN conda env create -f ${ENVIRONMENT}
+RUN conda install curl PyGithub
 
 # Activate the environment, and make sure it's activated:
+# The following line also removed conda initialization from
+# ~/.bashrc so conda starts complaining that it should be
+# initialized for bash. But it is necessary to do it because
+# activation is not always executed when "docker exec" is used
+# and then conda initialization overwrites PATH with its base
+# environment where python doesn't have any packages installed.
 RUN echo "conda activate modin" > ~/.bashrc
 RUN echo "Make sure environment is activated"
-RUN conda list
+RUN conda list -n modin
diff --git a/commitlint.config.js b/commitlint.config.js
@@ -2,7 +2,7 @@ module.exports = {
 	plugins: ['commitlint-plugin-jira-rules'],
 	extends: ['jira'],
 	rules: {
-		"header-max-length": [2, "always", 70],
+		"header-max-length": [2, "always", 88],
 		"signed-off-by": [2, "always", "Signed-off-by"],
 		"jira-task-id-max-length": [0, "always", 10],
 		"jira-task-id-project-key": [2, "always", ["FEAT", "DOCS", "FIX", "REFACTOR", "TEST"]],

diff --git a/docs/comparisons/pandas.rst b/docs/comparisons/pandas.rst
@@ -1,4 +1,69 @@
-Modin vs. Pandas
+Modin vs. pandas
 ================
 
-Coming Soon...
+Modin exposes the pandas API through ``modin.pandas``, but it does not inherit the same
+pitfalls and design decisions that make it difficult to scale. This page will discuss
+how Modin's dataframe implementation differs from pandas, and how Modin scales pandas.
+
+Scalablity of implementation
+----------------------------
+
+The pandas implementation is inherently single-threaded. This means that only one of
+your CPU cores can be utilized at any given time. In a laptop, it would look something
+like this with pandas:
+
+.. image:: /img/pandas_multicore.png
+   :alt: pandas is single threaded!
+   :align: center
+   :scale: 80%
+
+However, Modin's implementation enables you to use all of the cores on your machine, or
+all of the cores in an entire cluster. On a laptop, it will look something like this:
+
+.. image:: /img/modin_multicore.png
+   :alt: modin uses all of the cores!
+   :align: center
+   :scale: 80%
+
+The additional utilization leads to improved performance, however if you want to scale
+to an entire cluster, Modin suddenly looks something like this:
+
+.. image:: /img/modin_cluster.png
+   :alt: modin works on a cluster too!
+   :align: center
+   :scale: 30%
+
+Modin is able to efficiently make use of all of the hardware available to it!
+
+Memory usage and immutability
+-----------------------------
+
+The pandas API contains many cases of "inplace" updates, which are known to be
+controversial. This is due in part to the way pandas manages memory:  the user may
+think they are saving memory, but pandas is usually copying the data whether an
+operation was inplace or not.
+
+Modin allows for inplace semantics, but the underlying data structures within Modin's
+implementation are immutable, unlike pandas. This immutability gives Modin the ability
+to internally chain operators and better manage memory layouts, because they will not
+be changed. This leads to improvements over pandas in memory usage in many common cases,
+due to the ability to share common memory blocks among all dataframes.
+
+Modin provides the inplace semantics by having a mutable pointer to the immutable
+internal Modin dataframe. This pointer can change, but the underlying data cannot, so
+when an inplace update is triggered, Modin will treat it as if it were not inplace and
+just update the pointer to the resulting Modin dataframe.
+
+API vs implementation
+---------------------
+
+It is well known that the pandas API contains many duplicate ways of performing the same
+operation. Modin instead enforces that any one behavior have one and only one
+implementation internally. This guarantee enables Modin to focus on and optimize a
+smaller code footprint while still guaranteeing that it covers the entire pandas API.
+Modin has an internal algebra, which is roughly 15 operators, narrowed down from the
+original >200 that exist in pandas. The algebra is grounded in both practical and
+theoretical work. Learn more in our `VLDB 2020 paper`_. More information about this
+algebra can be found in the :doc:`../developer/architecture` documentation.
+
+.. _VLDB 2020 paper: https://arxiv.org/abs/2001.00888
diff --git a/docs/img/modin_cluster.png b/docs/img/modin_cluster.png
diff --git a/docs/img/modin_multicore.png b/docs/img/modin_multicore.png
diff --git a/docs/img/pandas_multicore.png b/docs/img/pandas_multicore.png
diff --git a/docs/supported_apis/series_supported.rst b/docs/supported_apis/series_supported.rst
@@ -474,10 +474,8 @@ the related section on `Defaulting to pandas`_.
 +-----------------------------+---------------------------------+----------------------------------------------------+
 | ``valid``                   | D                               |                                                    |
 +-----------------------------+---------------------------------+----------------------------------------------------+
-| ``value_counts``            | Y                               | The indices of resulting object will be in         |
-|                             |                                 | descending (ascending, if ascending=True) order for|
-|                             |                                 | equal values.                                      |
-|                             |                                 | In pandas indices are located in random order.     |
+| ``value_counts``            | Y                               | The indices order of resulting object may differ   |
+|                             |                                 | from pandas.                                       |
 +-----------------------------+---------------------------------+----------------------------------------------------+
 | ``values``                  | Y                               |                                                    |
 +-----------------------------+---------------------------------+----------------------------------------------------+

diff --git a/docs/supported_apis/utilities_supported.rst b/docs/supported_apis/utilities_supported.rst
@@ -21,10 +21,8 @@ default to pandas.
 +---------------------------+---------------------------------+----------------------------------------------------+
 | `pd.unique`_              | Y                               |                                                    |
 +---------------------------+---------------------------------+----------------------------------------------------+
-| ``pd.value_counts``       | Y                               | The indices of resulting object will be in         |
-|                           |                                 | descending (ascending, if ascending=True) order for|
-|                           |                                 | equal values.                                      |
-|                           |                                 | In pandas indices are located in random order.     |
+| ``pd.value_counts``       | Y                               | The indices order of resulting object may differ   |
+|                           |                                 | from pandas.                                       |
 +---------------------------+---------------------------------+----------------------------------------------------+
 | `pd.cut`_                 | D                               |                                                    |
 +---------------------------+---------------------------------+----------------------------------------------------+

diff --git a/modin/backends/pandas/parsers.py b/modin/backends/pandas/parsers.py
@@ -221,7 +221,7 @@ def parse(fname, **kwargs):
         ws = Worksheet(wb)
         # Read the raw data
         with ZipFile(fname) as z:
-            with z.open("xl/worksheets/{}.xml".format(sheet_name.lower())) as file:
+            with z.open("xl/worksheets/{}.xml".format(sheet_name)) as file:
                 file.seek(start)
                 bytes_data = file.read(end - start)
 

diff --git a/modin/backends/pandas/query_compiler.py b/modin/backends/pandas/query_compiler.py
@@ -750,58 +750,7 @@ def reduce_func(df, *args, **kwargs):
             if normalize:
                 result = result / df.squeeze(axis=1).sum()
 
-            result = result.sort_values(ascending=ascending) if sort else result
-
-            # We want to sort both values and indices of the result object.
-            # This function will sort indices for equal values.
-            def sort_index_for_equal_values(result, ascending):
-                """
-                Sort indices for equal values of result object.
-
-                Parameters
-                ----------
-                result : pandas.Series or pandas.DataFrame with one column
-                    The object whose indices for equal values is needed to sort.
-                ascending : boolean
-                    Sort in ascending (if it is True) or descending (if it is False) order.
-
-                Returns
-                -------
-                pandas.DataFrame
-                    A new DataFrame with sorted indices.
-                """
-                is_range = False
-                is_end = False
-                i = 0
-                new_index = np.empty(len(result), dtype=type(result.index))
-                while i < len(result):
-                    j = i
-                    if i < len(result) - 1:
-                        while result[result.index[i]] == result[result.index[i + 1]]:
-                            i += 1
-                            if is_range is False:
-                                is_range = True
-                            if i == len(result) - 1:
-                                is_end = True
-                                break
-                    if is_range:
-                        k = j
-                        for val in sorted(
-                            result.index[j : i + 1], reverse=not ascending
-                        ):
-                            new_index[k] = val
-                            k += 1
-                        if is_end:
-                            break
-                        is_range = False
-                    else:
-                        new_index[j] = result.index[j]
-                    i += 1
-                return pandas.DataFrame(
-                    result, index=new_index, columns=["__reduced__"]
-                )
-
-            return sort_index_for_equal_values(result, ascending)
+            return result.sort_values(ascending=ascending) if sort else result
 
         return MapReduceFunction.register(
             map_func, reduce_func, axis=0, preserve_index=False

diff --git a/modin/config/envvars.py b/modin/config/envvars.py
@@ -16,7 +16,7 @@
 import warnings
 from packaging import version
 
-from .pubsub import Parameter, _TYPE_PARAMS
+from .pubsub import Parameter, _TYPE_PARAMS, ExactStr
 
 
 class EnvironmentVariable(Parameter, type=str, abstract=True):
@@ -112,7 +112,7 @@ class IsRayCluster(EnvironmentVariable, type=bool):
     varname = "MODIN_RAY_CLUSTER"
 
 
-class RayRedisAddress(EnvironmentVariable, type=str):
+class RayRedisAddress(EnvironmentVariable, type=ExactStr):
     """
     What Redis address to connect to when running in Ray cluster
     """
@@ -142,7 +142,7 @@ class Memory(EnvironmentVariable, type=int):
     varname = "MODIN_MEMORY"
 
 
-class RayPlasmaDir(EnvironmentVariable, type=str):
+class RayPlasmaDir(EnvironmentVariable, type=ExactStr):
     """
     Path to Plasma storage for Ray
     """
@@ -158,7 +158,7 @@ class IsOutOfCore(EnvironmentVariable, type=bool):
     varname = "MODIN_OUT_OF_CORE"
 
 
-class SocksProxy(EnvironmentVariable, type=str):
+class SocksProxy(EnvironmentVariable, type=ExactStr):
     """
     SOCKS proxy address if it is needed for SSH to work
     """

diff --git a/modin/config/pubsub.py b/modin/config/pubsub.py
@@ -22,11 +22,23 @@ class TypeDescriptor(typing.NamedTuple):
     help: str
 
 
+class ExactStr(str):
+    """
+    To be used in type params where no transformations are needed
+    """
+
+
 _TYPE_PARAMS = {
     str: TypeDescriptor(
         decode=lambda value: value.strip().title(),
         normalize=lambda value: value.strip().title(),
         verify=lambda value: True,
+        help="a case-insensitive string",
+    ),
+    ExactStr: TypeDescriptor(
+        decode=lambda value: value,
+        normalize=lambda value: value,
+        verify=lambda value: True,
         help="a string",
     ),
     bool: TypeDescriptor(

diff --git a/modin/config/test/test_envvars.py b/modin/config/test/test_envvars.py
@@ -14,7 +14,7 @@
 import os
 import pytest
 
-from modin.config.envvars import EnvironmentVariable, _check_vars
+from modin.config.envvars import EnvironmentVariable, _check_vars, ExactStr
 
 
 @pytest.fixture
@@ -25,9 +25,9 @@ def make_unknown_env():
     del os.environ[varname]
 
 
-@pytest.fixture
-def make_custom_envvar():
-    class CustomVar(EnvironmentVariable, type=str):
+@pytest.fixture(params=[str, ExactStr])
+def make_custom_envvar(request):
+    class CustomVar(EnvironmentVariable, type=request.param):
         """ custom var """
 
         default = 10
@@ -40,7 +40,7 @@ class CustomVar(EnvironmentVariable, type=str):
 @pytest.fixture
 def set_custom_envvar(make_custom_envvar):
     os.environ[make_custom_envvar.varname] = "  custom  "
-    yield "Custom"
+    yield "Custom" if make_custom_envvar.type is str else "  custom  "
     del os.environ[make_custom_envvar.varname]
 
 

diff --git a/modin/engines/base/frame/axis_partition.py b/modin/engines/base/frame/axis_partition.py
@@ -11,13 +11,12 @@
 # ANY KIND, either express or implied. See the License for the specific language
 # governing permissions and limitations under the License.
 
+from abc import ABC
 import pandas
 from modin.data_management.utils import split_result_of_axis_func_pandas
 
-NOT_IMPLMENTED_MESSAGE = "Must be implemented in child class"
 
-
-class BaseFrameAxisPartition(object):  # pragma: no cover
+class BaseFrameAxisPartition(ABC):  # pragma: no cover
     """An abstract class that represents the Parent class for any `ColumnPartition` or `RowPartition` class.
 
     This class is intended to simplify the way that operations are performed.
@@ -73,7 +72,7 @@ def apply(
         -------
             A list of `BaseFramePartition` objects.
         """
-        raise NotImplementedError(NOT_IMPLMENTED_MESSAGE)
+        pass
 
     def shuffle(self, func, lengths, **kwargs):
         """Shuffle the order of the data in this axis based on the `lengths`.
@@ -86,7 +85,7 @@ def shuffle(self, func, lengths, **kwargs):
         -------
             A list of RemotePartition objects split by `lengths`.
         """
-        raise NotImplementedError(NOT_IMPLMENTED_MESSAGE)
+        pass
 
     # Child classes must have these in order to correctly subclass.
     instance_type = None