diff --git a/docs/flow/modin/core/execution/dask/implementations/pandas_on_dask/index.rst b/docs/flow/modin/core/execution/dask/implementations/pandas_on_dask/index.rst index fe4ec1a423b..641051a9c91 100644 --- a/docs/flow/modin/core/execution/dask/implementations/pandas_on_dask/index.rst +++ b/docs/flow/modin/core/execution/dask/implementations/pandas_on_dask/index.rst @@ -1,10 +1,36 @@ :orphan: +PandasOnDask Execution +====================== + +Queries that perform data transformation, data ingress or data egress using the `pandas on Dask` execution +pass through the Modin components detailed below. + +To enable `pandas on Dask` execution, please refer to the usage section in :doc:`pandas on Dask `. + +Data Transformation +''''''''''''''''''' + +.. image:: /img/pandas_on_dask_data_transform.svg + :align: center + +When a user calls any :py:class:`~modin.pandas.dataframe.DataFrame` API, a query starts forming at the `API` layer +to be executed at the `Execution` layer. The `API` layer is responsible for processing the query appropriately, +for example, determining whether the final result should be a ``DataFrame`` or ``Series`` object. This layer is also responsible for sanitizing the input to the +:py:class:`~modin.core.storage_formats.pandas.query_compiler.PandasQueryCompiler`, e.g. validating a parameter from the query +and defining specific intermediate values to provide more context to the query compiler. +The :py:class:`~modin.core.storage_formats.pandas.query_compiler.PandasQueryCompiler` is responsible for +processing the query, received from the :py:class:`~modin.pandas.dataframe.DataFrame` `API` layer, +to determine how to apply it to a subset of the data - either cell-wise or along an axis-wise partition backed by the `pandas` +storage format. The :py:class:`~modin.core.storage_formats.pandas.query_compiler.PandasQueryCompiler` maps the query to one of the :doc:`Core Algebra Operators ` of +the :py:class:`~modin.core.execution.dask.implementations.pandas_on_dask.dataframe.dataframe.PandasOnDaskDataframe` which inherits +generic functionality from the :py:class:`~modin.core.dataframe.pandas.dataframe.dataframe.PandasDataframe`. + PandasOnDask Dataframe implementation -===================================== +------------------------------------- -This page describes the implementation of :doc:`Modin PandasDataframe Objects ` -specific for `PandasOnDask` execution. +Modin implements ``Dataframe``, ``PartitionManager``, ``AxisPartition`` and ``Partition`` classes +specifically for the `PandasOnDask` execution. * :doc:`PandasOnDaskDataframe ` * :doc:`PandasOnDaskDataframePartition ` @@ -18,3 +44,40 @@ specific for `PandasOnDask` execution. partitioning/partition partitioning/axis_partition partitioning/partition_manager + + +Data Ingress +'''''''''''' + +.. image:: /img/pandas_on_dask_data_ingress.svg + :align: center + +Data Egress +''''''''''' + +.. image:: /img/pandas_on_dask_data_egress.svg + :align: center + + +When a user calls any IO function from the ``modin.pandas.io`` module, the `API` layer queries the +:py:class:`~modin.core.execution.dispatching.factories.dispatcher.FactoryDispatcher` which defines a factory specific for +the execution, namely, the :py:class:`~modin.core.execution.dispatching.factories.factories.PandasOnDaskFactory`. The factory, in turn, +exposes the :py:class:`~modin.core.execution.dask.implementations.pandas_on_dask.io.PandasOnDaskIO` class +whose responsibility is to perform a parallel read/write from/to a file. + +When reading data from a CSV file, for example, the :py:class:`~modin.core.execution.dask.implementations.pandas_on_dask.io.io.PandasOnDaskIO` class forwards +the user query to the :meth:`~modin.core.io.text.CSVDispatcher._read` method of :py:class:`~modin.core.io.text.CSVDispatcher`, where the query's parameters are preprocessed +to check if they are supported by the execution (defaulting to pandas if they are not) and computes some metadata +common for all partitions to be read. Then, the file is split into row chunks, and this data is used to launch remote tasks on the Dask workers +via the :meth:`~modin.core.execution.dask.common.task_wrapper.DaskTask.deploy` method of :py:class:`~modin.core.execution.dask.common.task_wrapper.DaskTask`. +On each Dask worker, the :py:class:`~modin.core.storage_formats.pandas.parsers.PandasCSVParser` parses data. +After the remote tasks are finished, additional result postprocessing is performed, +and a new query compiler with the data read is returned. + +When writing data to a CSV file, for example, the :py:class:`~modin.core.execution.dask.implementations.pandas_on_dask.io.PandasOnDaskIO` processes +the user query to execute it on Dask workers. Then, the :py:class:`~modin.core.execution.dask.implementations.pandas_on_dask.io.PandasOnDaskIO` asks the +:py:class:`~modin.core.execution.dask.implementations.pandas_on_dask.io.PandasOnDaskDataframe` to decompose the data into row-wise partitions +that will be written into the file in parallel in Dask workers. + +.. note:: + Currently, data egress uses default `pandas` implementation for `pandas on Dask` execution. \ No newline at end of file diff --git a/docs/img/pandas_on_dask_data_egress.svg b/docs/img/pandas_on_dask_data_egress.svg new file mode 100644 index 00000000000..93ff933e2a7 --- /dev/null +++ b/docs/img/pandas_on_dask_data_egress.svg @@ -0,0 +1,4 @@ + + + +
Query for IO class
Query for IO class
PandasOnDask Core Modin Dataframe
PandasOnDask Core Modin Dataframe
New file with data written
New file with data...

...
…......
pandas API
pandas API
Factory Dispatcher
Factory Dispatcher
PandasOnDaskFactory
PandasOnDaskFactory
PandasOnDaskIO
PandasOnDaskIO
PandasOnDaskIO
PandasOnDaskIO
Dataframe
Dataframe
Partition Manager
Partition Ma...
Axis Partition
Axis Partiti...
Partition
Partition
Query for
Dispatcher
Query for...
Viewer does not support full SVG 1.1
\ No newline at end of file diff --git a/docs/img/pandas_on_dask_data_ingress.svg b/docs/img/pandas_on_dask_data_ingress.svg new file mode 100644 index 00000000000..f90031f3259 --- /dev/null +++ b/docs/img/pandas_on_dask_data_ingress.svg @@ -0,0 +1,4 @@ + + + +
New Modin Dataframe with data read
New Modin Dataframe...

...
…......
pandas API
pandas API
Factory Dispatcher
Factory Dispatcher
PandasOnDaskFactory
PandasOnDaskFactory
PandasOnDaskIO
PandasOnDaskIO
PandasOnDaskIO
PandasOnDaskIO
Pandas file-specific Parser
Pandas file-sp...
Remote DaskTask
Remote DaskTas...
File-specific Dispatcher
File-specific...
Query for
Dispatcher
Query for...
Query for IO class
Query for IO class
Viewer does not support full SVG 1.1
\ No newline at end of file diff --git a/docs/img/pandas_on_dask_data_transform.svg b/docs/img/pandas_on_dask_data_transform.svg new file mode 100644 index 00000000000..721ef27108c --- /dev/null +++ b/docs/img/pandas_on_dask_data_transform.svg @@ -0,0 +1,4 @@ + + + +
pandas API
pandas API
Base
Query Compiler
Base...
pandas
Query Compiler
pandas...

Base Modin Dataframe classes

Base Modin Dataframe classes
Dataframe
Dataframe
Partition
Manager
Partition...
Axis Partition
Axis Part...
Partition
Partition

Modin PandasDataframe classes

Modin PandasDataframe class...
Dataframe
Dataframe
Partition
Manager
Partition...
Axis Partition
Axis Part...
Partition
Partition

User Query

User Query

Query for QC

Query for QC

Query for Execution

Query for...

Execution

Execution
Dask Worker
Dask Worker
Dask Worker
Dask Worker
Dask Worker
Dask Worker
Dask Worker
Dask Worker
Dask Worker
Dask Worker
Dask Worker
Dask Worker
Dataframe
Dataframe
Partition
Manager
Partition...
Axis Partition
Axis Part...
Partition
Partition

Modin PandasOnDaskDataframe classes

Modin PandasOnDaskDataframe cla...
Viewer does not support full SVG 1.1
\ No newline at end of file