Skip to content

Commit

Permalink
DOCS-#3610: Add section about handling of read_csv skiprows param…
Browse files Browse the repository at this point in the history
…eter (#3611)

Co-authored-by: Yaroslav Igoshev <Poolliver868@mail.ru>
Co-authored-by: Vasily Litvinov <vasilij.n.litvinov@intel.com>
Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>
  • Loading branch information
3 people authored Nov 29, 2021
1 parent edc1608 commit 045eb93
Showing 1 changed file with 94 additions and 0 deletions.
94 changes: 94 additions & 0 deletions docs/flow/modin/core/io/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -97,3 +97,97 @@ classes for reading files of different formats.
* sql - directory for storing SQL dispatcher class

* ``sql_dispatcher.py`` - class for reading SQL queries or database tables.

Handling ``skiprows`` Parameter
'''''''''''''''''''''''''''''''

Handling ``skiprows`` parameter by pandas import functions can be very tricky, especially
for ``read_csv`` function because of interconnection with ``header`` parameter. In this section
the techniques of ``skiprows`` processing by both pandas and Modin are covered.

Processing ``skiprows`` by pandas
=================================

Let's consider a simple snippet with ``pandas.read_csv`` in order to understand interconnection
of ``header`` and ``skiprows`` parameters:

.. code-block:: python
import pandas
from io import StringIO
data = """0
1
2
3
4
5
6
7
8
"""
# `header` parameter absence is equivalent to `header="infer"` or `header=0`
# rows 1, 5, 6, 7, 8 are read with header "0"
df = pandas.read_csv(StringIO(data), skiprows=[2, 3, 4])
# rows 5, 6, 7, 8 are read with header "1", row 0 is skipped additionally
df = pandas.read_csv(StringIO(data), skiprows=[2, 3, 4], header=1)
# rows 6, 7, 8 are read with header "5", rows 0, 1 are skipped additionally
df = pandas.read_csv(StringIO(data), skiprows=[2, 3, 4], header=2)
In the examples above list-like ``skiprows`` values are fixed and ``header`` is varied. In the first
example with no ``header`` provided, rows 2, 3, 4 are skipped and row 0 is considered as a header.
In the second example ``header == 1``, so 0th row is skipped and the next available row is
considered as a header. The third example shows the case when ``header`` and ``skiprows`` parameters
values are intersected - in this case skipped rows are dropped first and only then ``header`` is got
from the remaining rows (rows before header are skipped too).

In the examples above only list-like ``skiprows`` and integer ``header`` parameters are considered,
but the same logic is applicable for other types of the parameters.

Processing ``skiprows`` by Modin
================================

As it can be seen, skipping rows in the pandas import functions is complicated and distributing
this logic across multiple workers can complicate it even more. Thus in some rare corner cases
default pandas implementation is used in Modin to avoid excessive Modin code complication.

Modin uses two techniques for skipping rows:

1) During file partitioning (setting file limits that should be read by each partition)
exact rows can be excluded from partitioning scope, thus they won't be read at all and can be
considered as skipped. This is the most effective way of skipping rows since it doesn't require
any actual data reading and postprocessing, but in this case ``skiprows`` parameter can be an
integer only. When it is possible Modin always uses this approach.

2) Rows for skipping can be dropped after full dataset import. This is more expensive way since
it requires extra IO work and postprocessing afterwards, but ``skiprows`` parameter can be of any
non-integer type supported by ``pandas.read_csv``.

In some cases, if ``skiprows`` is uniformly distributed array (e.g. [1, 2, 3]), ``skiprows`` can be
"squashed" and represented as an integer to make a fastpath by skipping these rows during file partitioning
(using the first option). But if there is a gap between the first row for skipping
and the last line of the header (that will be skipped too since header is read by each partition
to ensure metadata is defined properly), then this gap should be assigned for reading first
by assigning the first partition to read these rows by setting ``pre_reading`` parameter.

Let's consider an example of skipping rows during partitioning when ``header="infer"`` and
``skiprows=[3, 4, 5]``. In this specific case fastpath can be done since ``skiprows`` is uniformly
distributed array, so we can "squash" it to an integer and set "partitioning" skiprows to 3. But
if no additional action is done, these three rows will be skipped right after header line,
that corresponds to ``skiprows=[1, 2, 3]``. To avoid this discrepancy, we need to assign the first
partition to read data between header line and the first row for skipping by setting special
``pre_reading`` parameter to 2. Then, after the skipping of rows considered to be skipped during
partitioning, the rest data will be divided between the rest of partitions, see rows assignment
below:

.. code-block::
0 - header line (skip during partitioning)
1 - pre reading (assign to read by the first partition)
2 - pre reading (assign to read by the first partition)
3 - "partitioning" skiprows (skip during partitioning)
4 - "partitioning" skiprows (skip during partitioning)
5 - "partitioning" skiprows (skip during partitioning)
6 - data to partition (divide between the rest of partitions)
7 - data to partition (divide between the rest of partitions)

0 comments on commit 045eb93

Please sign in to comment.