Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: error_bad_lines and warn_bad_lines for read_csv #40413

Merged
merged 30 commits into from
May 28, 2021
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
4e867e9
Deprecate
lithomas1 Mar 13, 2021
d230035
stacklevel/warnings fixes
lithomas1 Mar 13, 2021
ce5bf29
Fixes
lithomas1 Mar 13, 2021
8629f87
Fix tests
lithomas1 Mar 13, 2021
06f87a1
Doc fixes for green
lithomas1 Mar 13, 2021
edeef7e
Merge branch 'master' into depr-bad-lines
lithomas1 Mar 15, 2021
f806a4d
Merge branch 'master' into depr-bad-lines
lithomas1 Mar 16, 2021
5b08a88
Update test_common_basic.py
lithomas1 Mar 16, 2021
af3fd15
Update test_common_basic.py
lithomas1 Mar 16, 2021
f70f34e
Merge branch 'master' into depr-bad-lines
lithomas1 Mar 28, 2021
0c76180
Merge branch 'master' of https://github.com/pandas-dev/pandas into de…
lithomas1 Apr 1, 2021
a0406b5
Address Code Review
lithomas1 Apr 6, 2021
89fdc70
oops
lithomas1 Apr 6, 2021
f7265a3
Merge branch 'master' into depr-bad-lines
lithomas1 Apr 27, 2021
1e20b53
Address code review
lithomas1 Apr 28, 2021
2e79f9a
Update io.rst
lithomas1 Apr 28, 2021
fe7541c
Address code review
lithomas1 Apr 29, 2021
772c13f
manual pre-commit
lithomas1 Apr 29, 2021
d00e601
Merge branch 'master' into depr-bad-lines
lithomas1 May 5, 2021
e267aa4
Consolidate
lithomas1 May 23, 2021
e724d0b
Clarify behavior
lithomas1 May 23, 2021
fdef68e
Merge branch 'master' into depr-bad-lines
lithomas1 May 23, 2021
a220293
Merge branch 'master' into depr-bad-lines
lithomas1 May 23, 2021
9b8468a
typing
lithomas1 May 23, 2021
a6af9aa
Fix failed test
lithomas1 May 24, 2021
2f70edc
Clean code
lithomas1 May 24, 2021
93c37df
Merge branch 'master' into depr-bad-lines
lithomas1 May 27, 2021
cf3201c
Update v1.3.0.rst
lithomas1 May 27, 2021
4911b27
Update readers.py
lithomas1 May 27, 2021
f28316b
Fix stacklevel
lithomas1 May 28, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 29 additions & 1 deletion doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -349,10 +349,38 @@ error_bad_lines : boolean, default ``True``
returned. If ``False``, then these "bad lines" will dropped from the
``DataFrame`` that is returned. See :ref:`bad lines <io.bad_lines>`
below.

.. deprecated:: 1.3
The ``on_bad_lines`` parameter takes precedence over this parameter
when specified and should be used instead to specify behavior upon
encountering a bad line instead.
warn_bad_lines : boolean, default ``True``
If error_bad_lines is ``False``, and warn_bad_lines is ``True``, a warning for
each "bad line" will be output.

.. deprecated:: 1.3
The ``on_bad_lines`` parameter takes precedence over this parameter
when specified and should be used instead to specify behavior upon
encountering a bad line instead.
on_bad_lines : {{None, 'error', 'warn', 'skip'}}, default ``None``
Specifies what to do upon encountering a bad line (a line with too many fields).
lithomas1 marked this conversation as resolved.
Show resolved Hide resolved
Allowed values are :

- ``None``, default option, defers to ``error_bad_lines`` and ``warn_bad_lines``.
lithomas1 marked this conversation as resolved.
Show resolved Hide resolved

Note: This option is only present for backwards-compatibility reasons and will
be removed after the removal of ``error_bad_lines`` and ``warn_bad_lines``.
lithomas1 marked this conversation as resolved.
Show resolved Hide resolved
Please do not specify it explicitly.

- 'error', raise an Exception when a bad line is encountered.
lithomas1 marked this conversation as resolved.
Show resolved Hide resolved
- 'warn', raise a warning when a bad line is encountered and skip that line.
lithomas1 marked this conversation as resolved.
Show resolved Hide resolved
- 'skip', skip bad lines without raising or warning when they are encountered.

This parameter takes precedence over parameters ``error_bad_lines`` and ``warn_bad_lines``
lithomas1 marked this conversation as resolved.
Show resolved Hide resolved
if specified.

.. versionadded:: 1.3

.. _io.dtypes:

Specifying column data types
Expand Down Expand Up @@ -1244,7 +1272,7 @@ You can elect to skip bad lines:

.. code-block:: ipython

In [29]: pd.read_csv(StringIO(data), error_bad_lines=False)
In [29]: pd.read_csv(StringIO(data), on_bad_lines="warn")
Skipping line 3: expected 3 fields, saw 4

Out[29]:
Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -480,6 +480,7 @@ Deprecations
- Deprecated casting ``datetime.date`` objects to ``datetime64`` when used as ``fill_value`` in :meth:`DataFrame.unstack`, :meth:`DataFrame.shift`, :meth:`Series.shift`, and :meth:`DataFrame.reindex`, pass ``pd.Timestamp(dateobj)`` instead (:issue:`39767`)
- Deprecated :meth:`.Styler.set_na_rep` and :meth:`.Styler.set_precision` in favour of :meth:`.Styler.format` with ``na_rep`` and ``precision`` as existing and new input arguments respectively (:issue:`40134`, :issue:`40425`)
- Deprecated allowing partial failure in :meth:`Series.transform` and :meth:`DataFrame.transform` when ``func`` is list-like or dict-like and raises anything but ``TypeError``; ``func`` raising anything but a ``TypeError`` will raise in a future version (:issue:`40211`)
- Deprecated arguments ``error_bad_lines`` and ``warn_bad_lines`` in :meth:`pd.read_csv` in favor of ``on_bad_lines`` (:issue:`15122`)
- Deprecated support for ``np.ma.mrecords.MaskedRecords`` in the :class:`DataFrame` constructor, pass ``{name: data[name] for name in data.dtype.names}`` instead (:issue:`40363`)
- Deprecated the use of ``**kwargs`` in :class:`.ExcelWriter`; use the keyword argument ``engine_kwargs`` instead (:issue:`40430`)

Expand Down
21 changes: 9 additions & 12 deletions pandas/_libs/parsers.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,11 @@ cdef extern from "parser/tokenizer.h":

enum: ERROR_OVERFLOW

ctypedef enum BadLineHandleMethod:
ERROR,
WARN,
SKIP

ctypedef void* (*io_callback)(void *src, size_t nbytes, size_t *bytes_read,
int *status, const char *encoding_errors)
ctypedef int (*io_cleanup)(void *src)
Expand Down Expand Up @@ -201,8 +206,7 @@ cdef extern from "parser/tokenizer.h":
int usecols

int expected_fields
int error_bad_lines
int warn_bad_lines
BadLineHandleMethod on_bad_lines

# floating point options
char decimal
Expand Down Expand Up @@ -351,8 +355,7 @@ cdef class TextReader:
thousands=None,
dtype=None,
usecols=None,
bint error_bad_lines=True,
bint warn_bad_lines=True,
on_bad_lines = ERROR,
bint na_filter=True,
na_values=None,
na_fvalues=None,
Expand Down Expand Up @@ -436,9 +439,7 @@ cdef class TextReader:
raise ValueError('Only length-1 comment characters supported')
self.parser.commentchar = ord(comment)

# error handling of bad lines
self.parser.error_bad_lines = int(error_bad_lines)
self.parser.warn_bad_lines = int(warn_bad_lines)
self.parser.on_bad_lines = on_bad_lines

self.skiprows = skiprows
if skiprows is not None:
Expand All @@ -455,8 +456,7 @@ cdef class TextReader:

# XXX
if skipfooter > 0:
self.parser.error_bad_lines = 0
self.parser.warn_bad_lines = 0
self.parser.on_bad_lines = SKIP
lithomas1 marked this conversation as resolved.
Show resolved Hide resolved

self.delimiter = delimiter
self.delim_whitespace = delim_whitespace
Expand Down Expand Up @@ -571,9 +571,6 @@ cdef class TextReader:
kh_destroy_str_starts(self.false_set)
self.false_set = NULL

def set_error_bad_lines(self, int status):
self.parser.error_bad_lines = status

def _set_quoting(self, quote_char, quoting):
if not isinstance(quoting, int):
raise TypeError('"quoting" must be an integer')
Expand Down
7 changes: 3 additions & 4 deletions pandas/_libs/src/parser/tokenizer.c
Original file line number Diff line number Diff line change
Expand Up @@ -93,8 +93,7 @@ void parser_set_default_options(parser_t *self) {
self->allow_embedded_newline = 1;

self->expected_fields = -1;
self->error_bad_lines = 0;
self->warn_bad_lines = 0;
self->on_bad_lines = ERROR;

self->commentchar = '#';
self->thousands = '\0';
Expand Down Expand Up @@ -457,7 +456,7 @@ static int end_line(parser_t *self) {
self->line_fields[self->lines] = 0;

// file_lines is now the actual file line number (starting at 1)
if (self->error_bad_lines) {
if (self->on_bad_lines == ERROR) {
self->error_msg = malloc(bufsize);
snprintf(self->error_msg, bufsize,
"Expected %d fields in line %" PRIu64 ", saw %" PRId64 "\n",
Expand All @@ -468,7 +467,7 @@ static int end_line(parser_t *self) {
return -1;
} else {
// simply skip bad lines
if (self->warn_bad_lines) {
if (self->on_bad_lines == WARN) {
// pass up error message
msg = malloc(bufsize);
snprintf(msg, bufsize,
Expand Down
9 changes: 7 additions & 2 deletions pandas/_libs/src/parser/tokenizer.h
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,12 @@ typedef enum {
QUOTE_NONE
} QuoteStyle;

typedef enum {
ERROR,
WARN,
SKIP
} BadLineHandleMethod;

typedef void *(*io_callback)(void *src, size_t nbytes, size_t *bytes_read,
int *status, const char *encoding_errors);
typedef int (*io_cleanup)(void *src);
Expand Down Expand Up @@ -136,8 +142,7 @@ typedef struct parser_t {
int usecols; // Boolean: 1: usecols provided, 0: none provided

int expected_fields;
int error_bad_lines;
int warn_bad_lines;
BadLineHandleMethod on_bad_lines;

// floating point options
char decimal;
Expand Down
25 changes: 25 additions & 0 deletions pandas/io/parsers/base_parser.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from collections import defaultdict
import csv
import datetime
from enum import Enum
import itertools
from typing import (
Any,
Expand Down Expand Up @@ -114,6 +115,11 @@


class ParserBase:
class BadLineHandleMethod(Enum):
ERROR = 0
WARN = 1
SKIP = 2

def __init__(self, kwds):

self.names = kwds.get("names")
Expand Down Expand Up @@ -202,6 +208,25 @@ def __init__(self, kwds):

self.handles: Optional[IOHandles] = None

# Bad line handling
on_bad_lines = kwds.get("on_bad_lines")
if on_bad_lines is not None:
if on_bad_lines == "error":
self.on_bad_lines = self.BadLineHandleMethod.ERROR
elif on_bad_lines == "warn":
self.on_bad_lines = self.BadLineHandleMethod.WARN
elif on_bad_lines == "skip":
self.on_bad_lines = self.BadLineHandleMethod.SKIP
else:
raise ValueError(f"Argument {on_bad_lines} is invalid for on_bad_lines")
else:
if kwds.get("error_bad_lines"):
lithomas1 marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these need a deprecation warning

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_deprecated_defaults handles this for us.

self.on_bad_lines = self.BadLineHandleMethod.ERROR
elif kwds.get("warn_bad_lines"):
self.on_bad_lines = self.BadLineHandleMethod.WARN
else:
self.on_bad_lines = self.BadLineHandleMethod.SKIP

def _open_handles(self, src: FilePathOrBuffer, kwds: Dict[str, Any]) -> None:
"""
Let the readers open IOHanldes after they are done with their potential raises.
Expand Down
16 changes: 12 additions & 4 deletions pandas/io/parsers/c_parser_wrapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,18 @@ def __init__(self, src: FilePathOrBuffer, **kwds):
# open handles
self._open_handles(src, kwds)
assert self.handles is not None
for key in ("storage_options", "encoding", "memory_map", "compression"):

# Have to pass int, would break tests using TextReader directly otherwise :(
kwds["on_bad_lines"] = self.on_bad_lines.value

for key in (
"storage_options",
"encoding",
"memory_map",
"compression",
"error_bad_lines",
"warn_bad_lines",
):
kwds.pop(key, None)
if self.handles.is_mmap and hasattr(self.handles.handle, "mmap"):
# error: Item "IO[Any]" of "Union[IO[Any], RawIOBase, BufferedIOBase,
Expand Down Expand Up @@ -155,9 +166,6 @@ def _set_noconvert_columns(self):
for col in noconvert_columns:
self._reader.set_noconvert(col)

def set_error_bad_lines(self, status):
self._reader.set_error_bad_lines(int(status))

def read(self, nrows=None):
try:
data = self._reader.read(nrows)
Expand Down
26 changes: 15 additions & 11 deletions pandas/io/parsers/python_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,9 +75,6 @@ def __init__(self, f: Union[FilePathOrBuffer, List], **kwds):
self.quoting = kwds["quoting"]
self.skip_blank_lines = kwds["skip_blank_lines"]

self.warn_bad_lines = kwds["warn_bad_lines"]
self.error_bad_lines = kwds["error_bad_lines"]

self.names_passed = kwds["names"] or None

self.has_index_names = False
Expand Down Expand Up @@ -664,10 +661,11 @@ def _next_line(self):

def _alert_malformed(self, msg, row_num):
"""
Alert a user about a malformed row.
Alert a user about a malformed row, depending on value of
`self.on_bad_lines` enum.

If `self.error_bad_lines` is True, the alert will be `ParserError`.
If `self.warn_bad_lines` is True, the alert will be printed out.
If `self.on_bad_lines` is ERROR, the alert will be `ParserError`.
If `self.on_bad_lines` is WARN, the alert will be printed out.

Parameters
----------
Expand All @@ -676,9 +674,9 @@ def _alert_malformed(self, msg, row_num):
Because this row number is displayed, we 1-index,
even though we 0-index internally.
"""
if self.error_bad_lines:
if self.on_bad_lines == self.BadLineHandleMethod.ERROR:
raise ParserError(msg)
elif self.warn_bad_lines:
elif self.on_bad_lines == self.BadLineHandleMethod.WARN:
base = f"Skipping line {row_num}: "
sys.stderr.write(base + msg + "\n")

Expand All @@ -699,7 +697,10 @@ def _next_iter_line(self, row_num):
assert self.data is not None
return next(self.data)
except csv.Error as e:
if self.warn_bad_lines or self.error_bad_lines:
if (
self.on_bad_lines == self.BadLineHandleMethod.ERROR
or self.on_bad_lines == self.BadLineHandleMethod.WARN
):
msg = str(e)

if "NULL byte" in msg or "line contains NUL" in msg:
Expand Down Expand Up @@ -896,11 +897,14 @@ def _rows_to_cols(self, content):
actual_len = len(l)

if actual_len > col_len:
if self.error_bad_lines or self.warn_bad_lines:
if (
self.on_bad_lines == self.BadLineHandleMethod.ERROR
or self.on_bad_lines == self.BadLineHandleMethod.WARN
):
row_num = self.pos - (content_len - i + footers)
bad_lines.append((row_num, actual_len))

if self.error_bad_lines:
if self.on_bad_lines == self.BadLineHandleMethod.ERROR:
break
else:
content.append(l)
Expand Down
40 changes: 37 additions & 3 deletions pandas/io/parsers/readers.py
Original file line number Diff line number Diff line change
Expand Up @@ -325,9 +325,38 @@
default cause an exception to be raised, and no DataFrame will be returned.
If False, then these "bad lines" will be dropped from the DataFrame that is
returned.

.. deprecated:: 1.3
The ``on_bad_lines`` parameter takes precedence over this parameter
when specified and should be used instead to specify behavior upon
encountering a bad line instead.
warn_bad_lines : bool, default True
If error_bad_lines is False, and warn_bad_lines is True, a warning for each
"bad line" will be output.

.. deprecated:: 1.3
lithomas1 marked this conversation as resolved.
Show resolved Hide resolved
The ``on_bad_lines`` parameter takes precedence over this parameter
when specified and should be used instead to specify behavior upon
encountering a bad line instead.
on_bad_lines : {{None, 'error', 'warn', 'skip'}}, default ``None``
Specifies what to do upon encountering a bad line (a line with too many fields).
Allowed values are :

- ``None``, default option, defer to ``error_bad_lines`` and ``warn_bad_lines``.
lithomas1 marked this conversation as resolved.
Show resolved Hide resolved

Note: This option is only present for backwards-compatibility reasons and will
be removed after the removal of ``error_bad_lines`` and ``warn_bad_lines``.
Please do not specify it explicitly.

- 'error', raise an Exception when a bad line is encountered.
- 'warn', raise a warning when a bad line is encountered and skip that line.
- 'skip', skip bad lines without raising or warning when they are encountered.

This parameter takes precedence over parameters
``error_bad_lines`` and ``warn_bad_lines`` if specified.

.. versionadded:: 1.3

delim_whitespace : bool, default False
Specifies whether or not whitespace (e.g. ``' '`` or ``'\t'``) will be
used as the sep. Equivalent to setting ``sep='\\s+'``. If this option
Expand Down Expand Up @@ -382,6 +411,7 @@
"memory_map": False,
"error_bad_lines": True,
lithomas1 marked this conversation as resolved.
Show resolved Hide resolved
"warn_bad_lines": True,
"on_bad_lines": None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah should remove error/warn_bad_lines from here

Copy link
Member Author

@lithomas1 lithomas1 Apr 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_get_options_with_defaults is really spaghetti-fied right now, so removing this would not the args not passed to the parser. I will try to clean up _get_options_with_defaults in a future PR if I have time.

"float_precision": None,
}

Expand All @@ -390,8 +420,8 @@
_c_unsupported = {"skipfooter"}
_python_unsupported = {"low_memory", "float_precision"}

_deprecated_defaults: Dict[str, Any] = {}
_deprecated_args: Set[str] = set()
_deprecated_defaults: Dict[str, Any] = {"error_bad_lines": True, "warn_bad_lines": True}
_deprecated_args: Set[str] = {"error_bad_lines", "warn_bad_lines"}


def validate_integer(name, val, min_val=0):
Expand Down Expand Up @@ -533,6 +563,8 @@ def read_csv(
# Error Handling
error_bad_lines=True,
warn_bad_lines=True,
# TODO: disallow and change None to 'error' in on_bad_lines in 2.0
on_bad_lines=None,
# Internal
delim_whitespace=False,
low_memory=_c_parser_defaults["low_memory"],
Expand Down Expand Up @@ -613,6 +645,8 @@ def read_table(
# Error Handling
error_bad_lines=True,
warn_bad_lines=True,
# TODO: disallow and change None to 'error' in on_bad_lines in 2.0
on_bad_lines=None,
encoding_errors: Optional[str] = "strict",
# Internal
delim_whitespace=False,
Expand Down Expand Up @@ -924,7 +958,7 @@ def _clean_options(self, options, engine):
f"The {arg} argument has been deprecated and will be "
"removed in a future version.\n\n"
)
warnings.warn(msg, FutureWarning, stacklevel=2)
warnings.warn(msg, FutureWarning, stacklevel=6)
else:
result[arg] = parser_default

Expand Down
Loading