Indicate index of rows for which an apply() statement fails #614

hammer · 2012-01-12T03:01:29Z

I'm writing some code to transform a column of my data frame. I expect all string values in this column, and I'm using x.startswith() on the column contents as part of the transformation logic. When I try to apply this transformation to each column in the data frame using df.apply(), the transformation is failing, claiming that it's trying to look up the startswith attribute on a float object. It would be useful for me to know on which row this transformation is failing; could that information be added to the traceback?

wesm · 2012-01-12T05:18:12Z

I like it, an easy change too. Here a self-contained example:

data = DataFrame({'A' : ['foo', 'foo', 'foo', 'foo',
                         'bar', 'bar', 'bar', 'bar',
                         'foo', 'foo', 'foo'],
                  'B' : ['one', 'one', 'one', 'two',
                         'one', 'one', 'one', 'two',
                         'two', 'two', 'one'],
                  'C' : ['dull', 'dull', 'shiny', 'dull',
                         'dull', 'shiny', 'shiny', 'dull',
                         'shiny', 'shiny', 'shiny'],
                  'D' : np.random.randn(11),
                  'E' : np.random.randn(11),
                  'F' : np.random.randn(11)})

data['C'][4] = np.nan

def transform(row):
    if row['C'].startswith('shin') and row['A'] == 'foo':
        row['D'] = 7
    return row

def transform2(row):
    if (notnull(row['C']) and  row['C'].startswith('shin')
        and row['A'] == 'foo'):
        row['D'] = 7
    return row

then the traceback would look be:

In [2]: data.apply(transform, axis=1)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/home/wesm/code/pandas/<ipython-input-2-cae6247f68ab> in <module>()
----> 1 data.apply(transform, axis=1)

/home/wesm/code/pandas/pandas/core/frame.pyc in apply(self, func, axis, broadcast, raw, args, **kwds)
   2665                     return self._apply_raw(f, axis)
   2666                 else:
-> 2667                     return self._apply_standard(f, axis)
   2668             else:
   2669                 return self._apply_broadcast(f, axis)

/home/wesm/code/pandas/pandas/core/frame.pyc in _apply_standard(self, func, axis, ignore_failures)
   2719             try:
   2720                 for k, v in series_gen:
-> 2721                     results[k] = func(v)
   2722             except Exception, e:
   2723                 if hasattr(e, 'args'):

/home/wesm/code/pandas/<ipython-input-1-696d0fa0400e> in transform(row)
     15 
     16 def transform(row):
---> 17         if row['C'].startswith('shin') and row['A'] == 'foo':
     18                 row['D'] = 7
     19         return row

AttributeError: ("'float' object has no attribute 'startswith'", 'occurred at index 4')

I also noticed that if you apply a function to the rows on a mixed-type DataFrame (like with the above example) that you lose the type information. I added a small type inference hack to convert things back. df.apply(f, axis=1) isn't especially fast right now, something I should fix at some point

wesm · 2012-01-12T05:22:17Z

Aside: it'd be nice to add vectorized string functions to pandas, similar to hadley's stringr package. They could also be made NA-friendly

hammer · 2012-01-12T06:09:27Z

Great! I should note that, looking into the problem a bit further, the float values were np.nan, which were generated when I imported a CSV with some blank values in a column. In other words, I suspect anyone applying string functions to a column with missing values will hit this issue. Which brings me to another potential improvement: verbose data import that indicates the number of missing values automatically filled in.

ghost · 2012-01-12T17:29:33Z

"Aside: it'd be nice to add vectorized string functions to pandas, similar to hadley's stringr package. They could also be made NA-friendly"
This would be great.

…values filled in non-numeric columns per comment on #614

wesm · 2012-01-12T21:14:34Z

@hammer, OK I'll bite on that (this would have been useful information to me in the past). Hard to add a lot of verbosity without sacrificing performance but getting a basic NA count for non-numeric columns seems pretty useful:


from pandas import *
from StringIO import StringIO

def f(verbose=True):
    text = """a,b,c,d
one,1,2,3
one,1,2,3
,1,2,3
one,1,2,3
,1,2,3
,1,2,3
one,1,2,3
two,1,2,3"""
    data = StringIO(text)
    result = read_csv(data, verbose=verbose)
    return result

result = f()

looks like


In [2]: result = read_csv(data, verbose=True)
Filled 3 NA values in column a
Out[2]: 
  a    b  c  d
0 one  1  2  3
1 one  1  2  3
2 NaN  1  2  3
3 one  1  2  3
4 NaN  1  2  3
5 NaN  1  2  3
6 one  1  2  3
7 two  1  2  3

Something that can definitely be fleshed out over time

wesm added a commit that referenced this issue Jan 12, 2012

ENH: take a crack at #614

4ef29e4

wesm added a commit that referenced this issue Jan 12, 2012

ENH: add verbose option to read_csv/read_table to print number of NA …

3ed22d7

…values filled in non-numeric columns per comment on #614

wesm closed this as completed Jan 12, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indicate index of rows for which an apply() statement fails #614

Indicate index of rows for which an apply() statement fails #614

hammer commented Jan 12, 2012

wesm commented Jan 12, 2012

wesm commented Jan 12, 2012

hammer commented Jan 12, 2012

ghost commented Jan 12, 2012

wesm commented Jan 12, 2012

Indicate index of rows for which an apply() statement fails #614

Indicate index of rows for which an apply() statement fails #614

Comments

hammer commented Jan 12, 2012

wesm commented Jan 12, 2012

wesm commented Jan 12, 2012

hammer commented Jan 12, 2012

ghost commented Jan 12, 2012

wesm commented Jan 12, 2012