usecols dooesn't help with unclean csv's #9549

harshnisar · 2015-02-25T10:54:56Z

So I have a lot of csv's which are clean till say 7 columns and have no missing values but have strings at random places starting after column 7.

I know it is clean till only 7. So, when I say usecols and list the 7 columns, I want it to ignore the other columns, probably truncate the remaining parts in the row when reading too. Shouldn't that be the functionality?
I don't want to skip over bad lines.

Is there a way in which I can force pandas to only read 7 columns and expect 7 rows while reading and hence not raise an exception?

Another method is to use names = range(35), an arbitrarily large number. But then I lose the real headers in my file and can't say what they are talking about. These columns are not fixed.

edit: It's my first issue report in a huge python package. Please bear if I didn't follow any protocol.

jreback · 2015-02-25T10:59:24Z

there were a couple of issues w.r.t. to this that have been fixed in master, the upcoming 0.16.0 release. want to give a try. otherwise, pls pd.show_versions() and a small reproducible example would be helpful.

harshnisar · 2015-02-25T12:03:22Z

!cat pandastrial.csv
A,B,C
1,2,3
3,4,5
1,2,4,5,1,6
1,2,3,,,1,
1,2,3
5,6,7

In [3]: df = pd.read_csv('pandastrial.csv', error_bad_lines = False)

Skipping line 4: expected 3 fields, saw 6
Skipping line 5: expected 3 fields, saw 7

In [5]: df = pd.read_csv('pandastrial.csv', usecols = ['A','B','C'])

CParserError: Error tokenizing data. C error: Expected 3 fields in line 4, saw 6

Basically, I know that there is an issue with other columns and I only trust three of them. How do I tell pandas to just read the first three values and truncate the rest?

INSTALLED VERSIONS

commit: None
python: 2.7.9.final.0
python-bits: 32
OS: Windows
OS-release: 7
machine: x86
processor: x86 Family 6 Model 69 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.15.2
nose: None
Cython: None
numpy: 1.9.1
scipy: 0.15.1
statsmodels: None
IPython: 2.3.1
sphinx: None
patsy: None
dateutil: 2.4.0
pytz: 2014.10
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.2
openpyxl: None
xlrd: 0.9.3
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: None
pymysql: None
psycopg2: None

harshnisar · 2015-02-25T12:05:49Z

I came up with this hack.
n_clean_cols is the first n columns I trust
maxcols is the maximum length of a bad row. Ideally infinity.

def clean_csv(filename, n_clean_cols, maxcols = 20):
    df = pd.read_csv(filename, names = range(maxcols), index_col = False)
    cols = df.columns
    rename_dict = dict([ (c, df[c][0]) for c in cols])
    df.rename(columns = rename_dict, inplace = True)

    #removing the first row which is no longer needed.
    # df = df.drop(df.head(1).index)
    df = df.ix[1:,:n_clean_cols]
    return df

harshnisar · 2015-02-26T10:18:34Z

@jreback ?

jreback · 2015-02-27T12:19:23Z

This looks like a good soln to me. I don't think usecols should be default ignore bad lines, which is what you are suggesting, its pretty error-prone to do that.

In [2]: read_csv(StringIO(data),error_bad_lines=False,usecols=['A','B','C'])
Skipping line 5: expected 3 fields, saw 6
Skipping line 6: expected 3 fields, saw 7

Out[2]: 
   A  B  C
0  1  2  3
1  3  4  5
2  1  2  3
3  5  6  7

selasley · 2015-02-27T19:04:26Z

I added a process_bad_lines option to the c parser for times when you want to get as much data out of a csv file with bad lines as possible without pre-processing the file. The doc string mentions the danger of using this option. Using it outputs WARNING lines for bad lines that are processed

pd.read_csv(StringIO(data),process_bad_lines=True)
WARNING! Processing bad line 3: expected 3 fields, saw 6
WARNING! Processing bad line 4: expected 6 fields, saw 7

Out[8]:
A B C
0 1 2 3
1 3 4 5
2 1 2 4
3 1 2 3
4 1 2 3
5 5 6 7

It's still easy to get bad data from bad lines with this option, so it's not a cure for bad data
pd.read_csv(StringIO(data),process_bad_lines=True, header=False, names=range(6))
0 1 2 3 4 5
0 1 2 3 3 4 5
1 3 4 5 1 2 4
2 1 2 4 5 1 6
3 1 2 3 NaN NaN 1
4 1 2 3 NaN NaN NaN
5 5 6 7 NaN NaN NaN

If this is an option that would be useful I can clean up the code and create a pull request.

jreback · 2015-02-27T22:44:03Z

how does this differ from my example above?

selasley · 2015-02-28T00:05:02Z

error_bad_lines and warn_bad_lines skip over bad rows. process_bad_lines attempts to extract data from the bad rows. So the dataframe in your example with error_bad_lines=False contains data from lines 1,2,5,6 while the dataframe with process_bad_rows=True has data from all 6 lines. I believe this is what the op was looking for. process_bad_lines would be useful when you know the data in the columns your are interested in are OK even on lines in the file that contain extra columns you don't care about.

jreback · 2015-02-28T00:09:25Z

I not add a new arg then instead
make error_bad_lines='skip' do this

selasley · 2015-02-28T00:26:08Z

OK, so something like this for the bad_lines options?

error_bad_lines : True, False, "process", default True
Lines with too many fields (e.g. a csv line with too many commas) will by
default cause an exception to be raised, and no DataFrame will be returned.
If False, then these "bad lines" will dropped from the DataFrame that is
returned. If "process", then these "bad lines" will processed and added to
the DataFrame that is returned. (Only valid with C parser)
warn_bad_lines : boolean, default True
If error_bad_lines is False or "process" and warn_bad_lines is True, a
warning for each "bad line" will be output. (Only valid with C parser).

jreback · 2015-02-28T00:37:57Z

I think I would change things a bit

why don't we make a new argument (and deprecate error_bad_lines, warn_bad_lines)

bad_lines='error'|'warn'|'skip'|'process'

default would be error (to replicate the current back compat)
or warn if error_bad_lines=False and warn_bad_lines=True

skip would simply skip them
process would only make a difference if usecols were specified and they were
then valid.

too confusing? (or less confusing)?

(obviously not all of these apply to c/python parsers but that is a separate issue)

harshnisar · 2015-03-03T14:37:05Z

@selasley Yes. That is what I wanted. My work involved checking csv's which over the years some people had edited and added comments etc. But the data was clean, the parts I wanted atleast.

I finally used the hack I have mentioned earlier. But it was a little slow and dirty. How did you implement it initially?

@jreback Your last comment makes sense. I didn't find it that confusing. Maybe something other than process. Maybe force ?

selasley · 2015-03-03T17:54:25Z

@harshnisar I am modifying the python, cython and c files for the parser and tokenizer so bad_lines will eventually be available in a release version of pandas.

What do you think about this

bad_lines : string, {'error', 'skip', or 'keep'}, default 'error'
    Specifies how to handle lines with too many fields in the input file,
    e.g. a csv line with too many commas.
    'error' will raise an exception when the first bad line is enocuntered.
    and will not return a DataFrame.
    'skip' will skip bad lines and continue processing the input file.
    'keep' will attempt to process the bad lines and include the data in
    the DataFrame that is returned. The 'keep' option requires the usecols
    option to specify the good columns in the input file.
    (Only valid with C parser)
error_bad_lines : deprecated, boolean, default True
    Lines with too many fields (e.g. a csv line with too many commas) will by
    default cause an exception to be raised, and no DataFrame will be returned.
    If False, then these "bad lines" will dropped from the DataFrame that is
    returned. (Only valid with C parser)
warn_bad_lines : boolean, default True
    If warn_bad_lines is True, a warning for each "bad line" will be output.
    (Only valid with C parser).

I vote to keep warn_bad_lines in order to turn warnings on or off with the skip and keep options.

jreback · 2015-03-04T11:27:51Z

@selasley why would you need warn_bad_lines in addition to the 6 options above?

maybe enumerate what the other 3, e.g. skipwarn/keepwarn/errorwarn mean/do? (I think the last one is a bit odd, maybe make it just warn)

harshnisar · 2015-03-04T14:25:08Z

@selasley
Why do we need to always specify usecols?
Can't we by default take the number of columns in the first row as the ones we want. This would go with the assumption that the first row is always clean.

The heuristic way would be to take the most frequent row length as the default. But is this an overkill?

Also, warning as a different argument seems more intuitive too. I vote for that too, if it counts :)

selasley · 2015-03-04T17:20:30Z

@jreback
My second post was a possible scheme for deprecating both error_bad_lines and warn_bad_lines when adding bad_lines. It seemed hacky to me even as I was writing it so I deleted it. I would like to keep warn_bad_lines if you don't have strong objections, deprecate error_bad_lines and add bad_lines with three options error, skip, keep.

@harshnisar
I misread jreback's post regarding usecols. The code I've written so far does not require it so I'll remove that section of the docstring for the keep option. Using the most frequent row length would require reading the file twice or keeping data for each row in memory. This could be a problem for very large files. If a user needs information from the extra columns in the bad rows they can do something like this for the data string used above

 read_csv(StringIO(data), bad_lines='keep',names=list('ABCDEF'),skiprows=1)) 
    A  B  C   D   E   F
 0  1  2  3 NaN NaN NaN
 1  3  4  5 NaN NaN NaN
 2  1  2  4   5   1   6
 3  1  2  3 NaN NaN   1
 4  1  2  3 NaN NaN NaN
 5  5  6  7 NaN NaN NaN

I'm still working on the code. For example, I would like to make header=False work as well as skiprows=1 when they are used with bad_lines='keep'

Closes pandas-devgh-9549.

jorisvandenbossche · 2017-01-12T22:18:30Z

@selasley I think the addition described here for processing bad lines would be very useful. Do you still want to make a PR for this? Or just push your code to a branch on your fork so somebody else could use it?

jorisvandenbossche · 2017-01-12T22:57:06Z

Closing this in favor of #15122, the discussion here was a bit buried under the original related but different issue report

Closes pandas-devgh-9549.

jreback added the IO CSV read_csv, to_csv label Feb 25, 2015

jreback added the API Design label Feb 28, 2015

jreback added this to the Next Major Release milestone Mar 4, 2015

jreback mentioned this issue Mar 25, 2015

"Bad" lines with too few fields #9729

Open

gfyoung added a commit to forking-repos/pandas that referenced this issue Jan 12, 2017

TST: Add another test for uneven CSV rows

303c340

Closes pandas-devgh-9549.

gfyoung mentioned this issue Jan 12, 2017

TST: Add another test for uneven CSV rows #15114

Merged

jorisvandenbossche modified the milestones: 0.20.0, Next Major Release Jan 12, 2017

jorisvandenbossche modified the milestones: Next Major Release, 0.20.0 Jan 12, 2017

jorisvandenbossche mentioned this issue Jan 12, 2017

in read_csv, warn_bad_lines and error_bad_lines have misleading names #6478

Closed

jorisvandenbossche closed this as completed in 66c0702 Jan 12, 2017

gfyoung mentioned this issue Jan 12, 2017

Add ability to process bad lines for read_csv #5686

Closed

jorisvandenbossche reopened this Jan 12, 2017

jorisvandenbossche mentioned this issue Jan 12, 2017

API/ENH: read_csv handling of bad lines (too many/few fields) #15122

Closed

jorisvandenbossche closed this as completed Jan 12, 2017

AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017

TST: Add another test for uneven CSV rows (pandas-dev#15114)

c605110

Closes pandas-devgh-9549.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

usecols dooesn't help with unclean csv's #9549

usecols dooesn't help with unclean csv's #9549

harshnisar commented Feb 25, 2015

jreback commented Feb 25, 2015

harshnisar commented Feb 25, 2015

harshnisar commented Feb 25, 2015

harshnisar commented Feb 26, 2015

jreback commented Feb 27, 2015

selasley commented Feb 27, 2015

jreback commented Feb 27, 2015

selasley commented Feb 28, 2015

jreback commented Feb 28, 2015

selasley commented Feb 28, 2015

jreback commented Feb 28, 2015

harshnisar commented Mar 3, 2015

selasley commented Mar 3, 2015

jreback commented Mar 4, 2015

harshnisar commented Mar 4, 2015

selasley commented Mar 4, 2015

jorisvandenbossche commented Jan 12, 2017

jorisvandenbossche commented Jan 12, 2017

usecols dooesn't help with unclean csv's #9549

usecols dooesn't help with unclean csv's #9549

Comments

harshnisar commented Feb 25, 2015

jreback commented Feb 25, 2015

harshnisar commented Feb 25, 2015

INSTALLED VERSIONS

harshnisar commented Feb 25, 2015

harshnisar commented Feb 26, 2015

jreback commented Feb 27, 2015

selasley commented Feb 27, 2015

jreback commented Feb 27, 2015

selasley commented Feb 28, 2015

jreback commented Feb 28, 2015

selasley commented Feb 28, 2015

jreback commented Feb 28, 2015

harshnisar commented Mar 3, 2015

selasley commented Mar 3, 2015

jreback commented Mar 4, 2015

harshnisar commented Mar 4, 2015

selasley commented Mar 4, 2015

jorisvandenbossche commented Jan 12, 2017

jorisvandenbossche commented Jan 12, 2017