Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

usecols dooesn't help with unclean csv's #9549

Closed
harshnisar opened this issue Feb 25, 2015 · 18 comments
Closed

usecols dooesn't help with unclean csv's #9549

harshnisar opened this issue Feb 25, 2015 · 18 comments
Labels
API Design IO CSV read_csv, to_csv

Comments

@harshnisar
Copy link
Contributor

So I have a lot of csv's which are clean till say 7 columns and have no missing values but have strings at random places starting after column 7.

I know it is clean till only 7. So, when I say usecols and list the 7 columns, I want it to ignore the other columns, probably truncate the remaining parts in the row when reading too. Shouldn't that be the functionality?
I don't want to skip over bad lines.

Is there a way in which I can force pandas to only read 7 columns and expect 7 rows while reading and hence not raise an exception?

Another method is to use names = range(35), an arbitrarily large number. But then I lose the real headers in my file and can't say what they are talking about. These columns are not fixed.

edit: It's my first issue report in a huge python package. Please bear if I didn't follow any protocol.

@jreback
Copy link
Contributor

jreback commented Feb 25, 2015

there were a couple of issues w.r.t. to this that have been fixed in master, the upcoming 0.16.0 release. want to give a try. otherwise, pls pd.show_versions() and a small reproducible example would be helpful.

@jreback jreback added the IO CSV read_csv, to_csv label Feb 25, 2015
@harshnisar
Copy link
Contributor Author

!cat pandastrial.csv
A,B,C
1,2,3
3,4,5
1,2,4,5,1,6
1,2,3,,,1,
1,2,3
5,6,7

In [3]: df = pd.read_csv('pandastrial.csv', error_bad_lines = False)

Skipping line 4: expected 3 fields, saw 6
Skipping line 5: expected 3 fields, saw 7

In [5]: df = pd.read_csv('pandastrial.csv', usecols = ['A','B','C'])

CParserError: Error tokenizing data. C error: Expected 3 fields in line 4, saw 6

Basically, I know that there is an issue with other columns and I only trust three of them. How do I tell pandas to just read the first three values and truncate the rest?

INSTALLED VERSIONS

commit: None
python: 2.7.9.final.0
python-bits: 32
OS: Windows
OS-release: 7
machine: x86
processor: x86 Family 6 Model 69 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.15.2
nose: None
Cython: None
numpy: 1.9.1
scipy: 0.15.1
statsmodels: None
IPython: 2.3.1
sphinx: None
patsy: None
dateutil: 2.4.0
pytz: 2014.10
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.2
openpyxl: None
xlrd: 0.9.3
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: None
pymysql: None
psycopg2: None


@harshnisar
Copy link
Contributor Author

I came up with this hack.
n_clean_cols is the first n columns I trust
maxcols is the maximum length of a bad row. Ideally infinity.

def clean_csv(filename, n_clean_cols, maxcols = 20):
    df = pd.read_csv(filename, names = range(maxcols), index_col = False)
    cols = df.columns
    rename_dict = dict([ (c, df[c][0]) for c in cols])
    df.rename(columns = rename_dict, inplace = True)

    #removing the first row which is no longer needed.
    # df = df.drop(df.head(1).index)
    df = df.ix[1:,:n_clean_cols]
    return df

@harshnisar
Copy link
Contributor Author

@jreback ?

@jreback
Copy link
Contributor

jreback commented Feb 27, 2015

This looks like a good soln to me. I don't think usecols should be default ignore bad lines, which is what you are suggesting, its pretty error-prone to do that.

In [2]: read_csv(StringIO(data),error_bad_lines=False,usecols=['A','B','C'])
Skipping line 5: expected 3 fields, saw 6
Skipping line 6: expected 3 fields, saw 7

Out[2]: 
   A  B  C
0  1  2  3
1  3  4  5
2  1  2  3
3  5  6  7

@selasley
Copy link
Contributor

I added a process_bad_lines option to the c parser for times when you want to get as much data out of a csv file with bad lines as possible without pre-processing the file. The doc string mentions the danger of using this option. Using it outputs WARNING lines for bad lines that are processed

pd.read_csv(StringIO(data),process_bad_lines=True)
WARNING! Processing bad line 3: expected 3 fields, saw 6
WARNING! Processing bad line 4: expected 6 fields, saw 7

Out[8]:
A B C
0 1 2 3
1 3 4 5
2 1 2 4
3 1 2 3
4 1 2 3
5 5 6 7

It's still easy to get bad data from bad lines with this option, so it's not a cure for bad data
pd.read_csv(StringIO(data),process_bad_lines=True, header=False, names=range(6))
0 1 2 3 4 5
0 1 2 3 3 4 5
1 3 4 5 1 2 4
2 1 2 4 5 1 6
3 1 2 3 NaN NaN 1
4 1 2 3 NaN NaN NaN
5 5 6 7 NaN NaN NaN

If this is an option that would be useful I can clean up the code and create a pull request.

@jreback
Copy link
Contributor

jreback commented Feb 27, 2015

how does this differ from my example above?

@selasley
Copy link
Contributor

error_bad_lines and warn_bad_lines skip over bad rows. process_bad_lines attempts to extract data from the bad rows. So the dataframe in your example with error_bad_lines=False contains data from lines 1,2,5,6 while the dataframe with process_bad_rows=True has data from all 6 lines. I believe this is what the op was looking for. process_bad_lines would be useful when you know the data in the columns your are interested in are OK even on lines in the file that contain extra columns you don't care about.

@jreback
Copy link
Contributor

jreback commented Feb 28, 2015

I not add a new arg then instead
make error_bad_lines='skip' do this

@selasley
Copy link
Contributor

OK, so something like this for the bad_lines options?

error_bad_lines : True, False, "process", default True
Lines with too many fields (e.g. a csv line with too many commas) will by
default cause an exception to be raised, and no DataFrame will be returned.
If False, then these "bad lines" will dropped from the DataFrame that is
returned. If "process", then these "bad lines" will processed and added to
the DataFrame that is returned. (Only valid with C parser)
warn_bad_lines : boolean, default True
If error_bad_lines is False or "process" and warn_bad_lines is True, a
warning for each "bad line" will be output. (Only valid with C parser).

@jreback
Copy link
Contributor

jreback commented Feb 28, 2015

I think I would change things a bit

why don't we make a new argument (and deprecate error_bad_lines, warn_bad_lines)

bad_lines='error'|'warn'|'skip'|'process'

default would be error (to replicate the current back compat)
or warn if error_bad_lines=False and warn_bad_lines=True

skip would simply skip them
process would only make a difference if usecols were specified and they were
then valid.

too confusing? (or less confusing)?

(obviously not all of these apply to c/python parsers but that is a separate issue)

@harshnisar
Copy link
Contributor Author

@selasley Yes. That is what I wanted. My work involved checking csv's which over the years some people had edited and added comments etc. But the data was clean, the parts I wanted atleast.

I finally used the hack I have mentioned earlier. But it was a little slow and dirty. How did you implement it initially?

@jreback Your last comment makes sense. I didn't find it that confusing. Maybe something other than process. Maybe force ?

@selasley
Copy link
Contributor

selasley commented Mar 3, 2015

@harshnisar I am modifying the python, cython and c files for the parser and tokenizer so bad_lines will eventually be available in a release version of pandas.

What do you think about this

bad_lines : string, {'error', 'skip', or 'keep'}, default 'error'
    Specifies how to handle lines with too many fields in the input file,
    e.g. a csv line with too many commas.
    'error' will raise an exception when the first bad line is enocuntered.
    and will not return a DataFrame.
    'skip' will skip bad lines and continue processing the input file.
    'keep' will attempt to process the bad lines and include the data in
    the DataFrame that is returned. The 'keep' option requires the usecols
    option to specify the good columns in the input file.
    (Only valid with C parser)
error_bad_lines : deprecated, boolean, default True
    Lines with too many fields (e.g. a csv line with too many commas) will by
    default cause an exception to be raised, and no DataFrame will be returned.
    If False, then these "bad lines" will dropped from the DataFrame that is
    returned. (Only valid with C parser)
warn_bad_lines : boolean, default True
    If warn_bad_lines is True, a warning for each "bad line" will be output.
    (Only valid with C parser).

I vote to keep warn_bad_lines in order to turn warnings on or off with the skip and keep options.

@jreback jreback added this to the Next Major Release milestone Mar 4, 2015
@jreback
Copy link
Contributor

jreback commented Mar 4, 2015

@selasley why would you need warn_bad_lines in addition to the 6 options above?

maybe enumerate what the other 3, e.g. skipwarn/keepwarn/errorwarn mean/do? (I think the last one is a bit odd, maybe make it just warn)

@harshnisar
Copy link
Contributor Author

@selasley
Why do we need to always specify usecols?
Can't we by default take the number of columns in the first row as the ones we want. This would go with the assumption that the first row is always clean.

The heuristic way would be to take the most frequent row length as the default. But is this an overkill?

Also, warning as a different argument seems more intuitive too. I vote for that too, if it counts :)

@selasley
Copy link
Contributor

selasley commented Mar 4, 2015

@jreback
My second post was a possible scheme for deprecating both error_bad_lines and warn_bad_lines when adding bad_lines. It seemed hacky to me even as I was writing it so I deleted it. I would like to keep warn_bad_lines if you don't have strong objections, deprecate error_bad_lines and add bad_lines with three options error, skip, keep.

@harshnisar
I misread jreback's post regarding usecols. The code I've written so far does not require it so I'll remove that section of the docstring for the keep option. Using the most frequent row length would require reading the file twice or keeping data for each row in memory. This could be a problem for very large files. If a user needs information from the extra columns in the bad rows they can do something like this for the data string used above

 read_csv(StringIO(data), bad_lines='keep',names=list('ABCDEF'),skiprows=1)) 
    A  B  C   D   E   F
 0  1  2  3 NaN NaN NaN
 1  3  4  5 NaN NaN NaN
 2  1  2  4   5   1   6
 3  1  2  3 NaN NaN   1
 4  1  2  3 NaN NaN NaN
 5  5  6  7 NaN NaN NaN

I'm still working on the code. For example, I would like to make header=False work as well as skiprows=1 when they are used with bad_lines='keep'

gfyoung added a commit to forking-repos/pandas that referenced this issue Jan 12, 2017
@jorisvandenbossche jorisvandenbossche modified the milestones: 0.20.0, Next Major Release Jan 12, 2017
@jorisvandenbossche
Copy link
Member

@selasley I think the addition described here for processing bad lines would be very useful. Do you still want to make a PR for this? Or just push your code to a branch on your fork so somebody else could use it?

@jorisvandenbossche
Copy link
Member

Closing this in favor of #15122, the discussion here was a bit buried under the original related but different issue report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

4 participants