Skip to content

Commit

Permalink
REFACTOR-#4513: Fix spelling mistakes in docs and docstrings (#4514)
Browse files Browse the repository at this point in the history
Co-authored-by: Rehan Sohail Durrani <rdurrani@berkeley.edu>
Signed-off-by: jeffreykennethli <jkli@ponder.io>
  • Loading branch information
jeffreykennethli and RehanSD authored Jun 6, 2022
1 parent c1d5dbd commit 57e29bc
Show file tree
Hide file tree
Showing 43 changed files with 152 additions and 151 deletions.
2 changes: 1 addition & 1 deletion docs/development/contributing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ want to review in order to get started.

Also, feel free to join the discussions on the `developer mailing list`_.

If you want a quick guide to getting your development enviroment setup, please
If you want a quick guide to getting your development environment setup, please
use `the contributing instructions on GitHub`_.

Certificate of Origin
Expand Down
2 changes: 1 addition & 1 deletion docs/development/partition_api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ from raw futures objects.
Partition IPs
-------------
For finer grained placement control, Modin also provides an API to get the IP addresses of the nodes that hold each partition.
You can pass the partitions having needed IPs to your function. It can help with minimazing of data movement between nodes.
You can pass the partitions having needed IPs to your function. It can help with minimizing of data movement between nodes.

Partition API implementations
-----------------------------
Expand Down
2 changes: 1 addition & 1 deletion docs/flow/modin/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Modin Configuration Settings

To adjust Modin's default behavior, you can set the value of Modin
configs by setting an environment variable or by using the
``modin.config`` API. To list all avaliable configs in Modin, please
``modin.config`` API. To list all available configs in Modin, please
run ``python -m modin.config`` to print all
Modin configs with descriptions.

Expand Down
2 changes: 1 addition & 1 deletion docs/flow/modin/core/dataframe/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
Core Modin Dataframe Objects
============================

Modin paritions data to scale efficiently.
Modin partitions data to scale efficiently.
To keep track of everything a few key classes are introduced: ``Dataframe``, ``Partition``, ``AxisPartiton`` and ``PartitionManager``.

* ``Dataframe`` is the class conforming to Dataframe Algebra.
Expand Down
2 changes: 1 addition & 1 deletion docs/flow/modin/core/dataframe/pandas/dataframe.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ The class serves as the intermediate level
between ``pandas`` query compiler and conforming partition manager. All queries formed
at the query compiler layer are ingested by this class and then conveyed jointly with the stored partitions
into the partition manager for processing. Direct partitions manipulation by this class is prohibited except
cases if an operation is striclty private or protected and called inside of the class only. The class provides
cases if an operation is strictly private or protected and called inside of the class only. The class provides
significantly reduced set of operations that fit plenty of pandas operations.

Main tasks of :py:class:`~modin.core.dataframe.pandas.dataframe.dataframe.PandasDataframe` are storage of partitions, manipulation with labels of axes and
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Partition manager can apply user-passed (arbitrary) function in different modes:

* block-wise (apply a function to individual block partitions):

* optinally accepting partition indices along each axis
* optionally accepting partition indices along each axis
* optionally accepting an item to be split so parts of it would be sent to each partition

* along a full axis (apply a function to an entire column or row made up of block partitions when user function needs information about the whole axis)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ PandasOnDaskDataframePartitionManager
"""""""""""""""""""""""""""""""""""""

This class is the specific implementation of :py:class:`~modin.core.dataframe.pandas.partitioning.partition_manager.PandasDataframePartitionManager`
using Dask as the execution engine. This class is responsible for partition manipulation and applying a funcion to
using Dask as the execution engine. This class is responsible for partition manipulation and applying a function to
block/row/column partitions.

Public API
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ PandasOnPython Dataframe implementation
This page describes implementation of :doc:`Modin PandasDataframe Objects </flow/modin/core/dataframe/pandas/index>`
specific for `PandasOnPython` execution. Since Python engine doesn't allow computation parallelization,
operations on partitions are performed sequentially. The absence of parallelization doesn't give any
perfomance speed-up, so ``PandasOnPython`` is used for testing purposes only.
performance speed-up, so ``PandasOnPython`` is used for testing purposes only.

* :doc:`PandasOnPythonDataframe <dataframe>`
* :doc:`PandasOnPythonDataframePartition <partitioning/partition>`
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ PandasOnPythonDataframePartition

The class is specific implementation of :py:class:`~modin.core.dataframe.pandas.partitioning.partition_manager.PandasDataframePartitionManager`
using Python as the execution engine. This class is responsible for partitions manipulation and applying
a funcion to block/row/column partitions.
a function to block/row/column partitions.

Public API
----------
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ PandasOnRayDataframePartitionManager
""""""""""""""""""""""""""""""""""""

This class is the specific implementation of :py:class:`~modin.core.execution.ray.generic.partitioning.GenericRayDataframePartitionManager`
using Ray distributed engine. This class is responsible for partition manipulation and applying a funcion to
using Ray distributed engine. This class is responsible for partition manipulation and applying a function to
block/row/column partitions.

Public API
Expand Down
37 changes: 18 additions & 19 deletions docs/flow/modin/core/io/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,34 +6,33 @@ IO Module Description
Dispatcher Classes Workflow Overview
''''''''''''''''''''''''''''''''''''

Call from ``read_*`` function of execution-specific IO class (for example, ``PandasOnRayIO`` for
Ray engine and pandas storage format) is forwarded to the ``_read`` function of file
Calls from ``read_*`` functions of execution-specific IO classes (for example, ``PandasOnRayIO`` for
Ray engine and pandas storage format) are forwarded to the ``_read`` function of the file
format-specific class (for example ``CSVDispatcher`` for CSV files), where function parameters are
preprocessed to check if they are supported (otherwise default pandas implementation
is used) and compute some metadata common for all partitions. Then file is splitted
into chunks (mechanism of splitting is described below) and using this data, tasks
are launched on the remote workers. After remote tasks are finished, additional
results postprocessing is performed, and new query compiler with imported data will
preprocessed to check if they are supported (defaulting to pandas if not)
and common metadata is computed for all partitions. The file is then split
into chunks (splitting mechanism described below) and the data is used to launch tasks
on the remote workers. After the remote tasks finish, additional
postprocessing is performed on the results, and a new query compiler with the imported data will
be returned.

Data File Splitting Mechanism
'''''''''''''''''''''''''''''

Modin file splitting mechanism differs depending on the data format type:
Modin's file splitting mechanism differs depending on the data format type:

* text format type - file is splitted into bytes according user specified needs.
* text format type - the file is split into bytes according to user specified arguments(?).
In the simplest case, when no row related parameters (such as ``nrows`` or
``skiprows``) are passed, data chunks limits (start and end bytes) are derived
by just roughly dividing the file size by the number of partitions (chunks can
``skiprows``) are passed, data chunk limits (start and end bytes) are derived
by dividing the file size by the number of partitions (chunks can
slightly differ between each other because usually end byte may occurs inside a
line and in that case the last byte of the line should be used instead of initial
value). In other cases the same splitting into bytes is used, but chunks sizes are
value). In other cases the same splitting mechanism is used, but chunks sizes are
defined according to the number of lines that each partition should contain.

* columnar store type - file is splitted by even distribution of columns that should
be read between chunks.
* columnar store type - the file is split so that each chunk contains approximately the same number of columns.

* SQL type - chunking is obtained by wrapping initial SQL query into query that
* SQL type - chunking is obtained by wrapping initial SQL query with a query that
specifies initial row offset and number of rows in the chunk.

After file splitting is complete, chunks data is passed to the parser functions
Expand Down Expand Up @@ -121,10 +120,10 @@ of ``header`` and ``skiprows`` parameters:
df = pandas.read_csv(StringIO(data), skiprows=[2, 3, 4], header=2)
In the examples above list-like ``skiprows`` values are fixed and ``header`` is varied. In the first
example with no ``header`` provided, rows 2, 3, 4 are skipped and row 0 is considered as a header.
In the second example ``header == 1``, so 0th row is skipped and the next available row is
considered as a header. The third example shows the case when ``header`` and ``skiprows`` parameters
values are intersected - in this case skipped rows are dropped first and only then ``header`` is got
example with no ``header`` provided, rows 2, 3, 4 are skipped and row 0 is considered as the header.
In the second example ``header == 1``, so the zeroth row is skipped and the next available row is
considered the header. The third example illustrates when the ``header`` and ``skiprows`` parameters
values are both present - in this case ``skiprows`` rows are dropped first and then the ``header`` is derived
from the remaining rows (rows before header are skipped too).

In the examples above only list-like ``skiprows`` and integer ``header`` parameters are considered,
Expand Down
5 changes: 3 additions & 2 deletions docs/flow/modin/core/storage_formats/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,9 @@ limited to the objects that conform to pandas API. There are formats that are ab
SQL-like databases (:doc:`OmniSci storage format </flow/modin/experimental/core/storage_formats/omnisci/index>`)
inside Modin Dataframe's partitions.

An honor of converting high-level pandas API calls to the ones that are understandable
by the corresponding execution implementation belongs to the Query Compiler (QC) object.
The storage format + execution engine (Ray, Dask, etc.) form the execution backend.
The Query Compiler (QC) converts high-level pandas API calls to queries that are understood
by the execution backend.

.. _query_compiler_def:

Expand Down
7 changes: 3 additions & 4 deletions docs/flow/modin/core/storage_formats/pandas/parsers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,9 @@ and util functions for handling parsing results. ``PandasParser`` is base class
classes with pandas storage format, that contains methods common for all child classes. Other
module classes implement ``parse`` function that performs parsing of specific format data
basing on the chunk information computed in the ``modin.core.io`` module. After
chunk data parsing is completed, resulting ``DataFrame``-s will be splitted into smaller
``DataFrame``-s according to ``num_splits`` parameter, data type and number or
rows/columns in the parsed chunk, and then these frames and some additional metadata will
be returned.
the chunk is parsed, the resulting ``DataFrame``-s will be split into smaller
``DataFrame``-s according to the ``num_splits`` parameter, data type, or number of
rows/columns in the parsed chunk. These frames, along with some additional metadata, are then returned.

.. note::
If you are interested in the data parsing mechanism implementation details, please refer
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -108,14 +108,14 @@ e.g. validating a parameter from the query and defining specific intermediate va
to provide more context to the query compiler.

The :py:class:`~modin.experimental.core.storage_formats.omnisci.query_compiler.DFAlgQueryCompiler`
is responsible for reducing the recieved query to the pre-defined Dataframe algebra operators
and pass their execution to the
is responsible for reducing the query to the pre-defined Dataframe algebra operators
and triggering execution on the
:py:class:`~modin.experimental.core.execution.native.implementations.omnisci_on_native.dataframe.dataframe.OmnisciOnNativeDataframe`.

When :py:class:`~modin.experimental.core.execution.native.implementations.omnisci_on_native.dataframe.dataframe.OmnisciOnNativeDataframe`
recieves a query it determines whether the operation requires data materialization
or can be performed lazily. Depending on that the operation is either appended to a
lazy computation tree or executed.
When the :py:class:`~modin.experimental.core.execution.native.implementations.omnisci_on_native.dataframe.dataframe.OmnisciOnNativeDataframe`
receives a query, it determines whether the operation requires data materialization
or whether it can be performed lazily. The operation is then either appended to a
lazy computation tree or executed immediately.

Lazy execution
""""""""""""""
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
:orphan:

IO module Description For Pandas-on-Ray Excecution
""""""""""""""""""""""""""""""""""""""""""""""""""
IO module Description For Pandas-on-Ray Execution
"""""""""""""""""""""""""""""""""""""""""""""""""

High-Level Module Overview
''''''''''''''''''''''''''
Expand All @@ -25,8 +25,8 @@ statement as follows:
Submodules Description
''''''''''''''''''''''

``modin.experimental.core.execution.ray.implementations.pandas_on_ray`` module is used mostly for storing utils and
functions for experimanetal IO class:
The ``modin.experimental.core.execution.ray.implementations.pandas_on_ray`` module primarily houses utils and
functions for the experimental IO class:

* ``io.py`` - submodule containing IO class and parse functions, which are responsible
for data processing on the workers.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,10 @@ by the pandas creator, pandas internal architecture is not optimal and sometimes
needs up to ten times more memory than the original dataset size
(note, that pandas rule of thumb: `have 5 to 10 times as much RAM as the size of your
dataset`). In order to fix this issue (or at least to reduce needed memory amount and
needed data copying), ``PyArrow-on-Ray`` module was added. Due to optimized architecture
of PyArrow Tables, number of needed copies can be decreased `down to zero
needed data copying), ``PyArrow-on-Ray`` module was added. Due to the optimized architecture
of PyArrow Tables, `no additional copies are needed
<https://arrow.apache.org/docs/python/pandas.html#zero-copy-series-conversions>`_ in some
corner cases, that can signifficantly improve Modin performance. The downside of this approach
is that PyArrow and pandas do not support the same APIs and some functions/parameters can have
incompatibilities or output different results, so for now ``PyArrow-on-Ray`` engine is
corner cases, which can significantly improve Modin performance. The downside of this approach
is that PyArrow and pandas do not support the same APIs and some functions/parameters may have
different signatures or output different results, so for now the ``PyArrow-on-Ray`` engine is
under development and marked as experimental.
7 changes: 4 additions & 3 deletions docs/flow/modin/experimental/xgboost.rst
Original file line number Diff line number Diff line change
Expand Up @@ -48,9 +48,10 @@ Internal functions :py:func:`~modin.experimental.xgboost.xgboost_ray._train` and
Training
********

1. The data is passed to :py:func:`~modin.experimental.xgboost.xgboost_ray._train`
function as a :py:class:`~modin.experimental.xgboost.DMatrix` object. Using an iterator of
:py:class:`~modin.experimental.xgboost.DMatrix`, lists of ``ray.ObjectRef`` with row partitions of Modin DataFrame are exctracted. Example:
1. The data is passed to the :py:func:`~modin.experimental.xgboost.xgboost_ray._train`
function as a :py:class:`~modin.experimental.xgboost.DMatrix` object. Lists of ``ray.ObjectRef``
corresponding to row partitions of Modin DataFrames are extracted by iterating over the
:py:class:`~modin.experimental.xgboost.DMatrix`. Example:

.. code-block:: python
Expand Down
4 changes: 2 additions & 2 deletions docs/getting_started/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -137,8 +137,8 @@ create the large dataframe, while pandas took close to a minute.
Faster ``apply`` over a single column
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The performance benefits of Modin becomes aparent when we operate on large
gigabyte-scale datasets. For example, let's say that we want to round up the number
The performance benefits of Modin become apparent when we operate on large
gigabyte-scale datasets. Let's say we want to round up values
across a single column via the ``apply`` operation.

.. code-block:: python
Expand Down
Loading

0 comments on commit 57e29bc

Please sign in to comment.