Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indicate index of rows for which an apply() statement fails #614

Closed
hammer opened this issue Jan 12, 2012 · 5 comments
Closed

Indicate index of rows for which an apply() statement fails #614

hammer opened this issue Jan 12, 2012 · 5 comments
Milestone

Comments

@hammer
Copy link

hammer commented Jan 12, 2012

I'm writing some code to transform a column of my data frame. I expect all string values in this column, and I'm using x.startswith() on the column contents as part of the transformation logic. When I try to apply this transformation to each column in the data frame using df.apply(), the transformation is failing, claiming that it's trying to look up the startswith attribute on a float object. It would be useful for me to know on which row this transformation is failing; could that information be added to the traceback?

wesm added a commit that referenced this issue Jan 12, 2012
@wesm
Copy link
Member

wesm commented Jan 12, 2012

I like it, an easy change too. Here a self-contained example:

data = DataFrame({'A' : ['foo', 'foo', 'foo', 'foo',
                         'bar', 'bar', 'bar', 'bar',
                         'foo', 'foo', 'foo'],
                  'B' : ['one', 'one', 'one', 'two',
                         'one', 'one', 'one', 'two',
                         'two', 'two', 'one'],
                  'C' : ['dull', 'dull', 'shiny', 'dull',
                         'dull', 'shiny', 'shiny', 'dull',
                         'shiny', 'shiny', 'shiny'],
                  'D' : np.random.randn(11),
                  'E' : np.random.randn(11),
                  'F' : np.random.randn(11)})

data['C'][4] = np.nan

def transform(row):
    if row['C'].startswith('shin') and row['A'] == 'foo':
        row['D'] = 7
    return row

def transform2(row):
    if (notnull(row['C']) and  row['C'].startswith('shin')
        and row['A'] == 'foo'):
        row['D'] = 7
    return row

then the traceback would look be:

In [2]: data.apply(transform, axis=1)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/home/wesm/code/pandas/<ipython-input-2-cae6247f68ab> in <module>()
----> 1 data.apply(transform, axis=1)

/home/wesm/code/pandas/pandas/core/frame.pyc in apply(self, func, axis, broadcast, raw, args, **kwds)
   2665                     return self._apply_raw(f, axis)
   2666                 else:
-> 2667                     return self._apply_standard(f, axis)
   2668             else:
   2669                 return self._apply_broadcast(f, axis)

/home/wesm/code/pandas/pandas/core/frame.pyc in _apply_standard(self, func, axis, ignore_failures)
   2719             try:
   2720                 for k, v in series_gen:
-> 2721                     results[k] = func(v)
   2722             except Exception, e:
   2723                 if hasattr(e, 'args'):

/home/wesm/code/pandas/<ipython-input-1-696d0fa0400e> in transform(row)
     15 
     16 def transform(row):
---> 17         if row['C'].startswith('shin') and row['A'] == 'foo':
     18                 row['D'] = 7
     19         return row

AttributeError: ("'float' object has no attribute 'startswith'", 'occurred at index 4')

I also noticed that if you apply a function to the rows on a mixed-type DataFrame (like with the above example) that you lose the type information. I added a small type inference hack to convert things back. df.apply(f, axis=1) isn't especially fast right now, something I should fix at some point

@wesm
Copy link
Member

wesm commented Jan 12, 2012

Aside: it'd be nice to add vectorized string functions to pandas, similar to hadley's stringr package. They could also be made NA-friendly

@hammer
Copy link
Author

hammer commented Jan 12, 2012

Great! I should note that, looking into the problem a bit further, the float values were np.nan, which were generated when I imported a CSV with some blank values in a column. In other words, I suspect anyone applying string functions to a column with missing values will hit this issue. Which brings me to another potential improvement: verbose data import that indicates the number of missing values automatically filled in.

@ghost
Copy link

ghost commented Jan 12, 2012

"Aside: it'd be nice to add vectorized string functions to pandas, similar to hadley's stringr package. They could also be made NA-friendly"
This would be great.

wesm added a commit that referenced this issue Jan 12, 2012
…values filled in non-numeric columns per comment on #614
@wesm
Copy link
Member

wesm commented Jan 12, 2012

@hammer, OK I'll bite on that (this would have been useful information to me in the past). Hard to add a lot of verbosity without sacrificing performance but getting a basic NA count for non-numeric columns seems pretty useful:


from pandas import *
from StringIO import StringIO

def f(verbose=True):
    text = """a,b,c,d
one,1,2,3
one,1,2,3
,1,2,3
one,1,2,3
,1,2,3
,1,2,3
one,1,2,3
two,1,2,3"""
    data = StringIO(text)
    result = read_csv(data, verbose=verbose)
    return result

result = f()

looks like


In [2]: result = read_csv(data, verbose=True)
Filled 3 NA values in column a
Out[2]: 
  a    b  c  d
0 one  1  2  3
1 one  1  2  3
2 NaN  1  2  3
3 one  1  2  3
4 NaN  1  2  3
5 NaN  1  2  3
6 one  1  2  3
7 two  1  2  3

Something that can definitely be fleshed out over time

@wesm wesm closed this as completed Jan 12, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants