Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: regex for drop #4818

Closed
jseabold opened this issue Sep 11, 2013 · 19 comments
Closed

Feature Request: regex for drop #4818

jseabold opened this issue Sep 11, 2013 · 19 comments
Labels
API Design Closing Candidate May be closeable, needs more eyeballs Enhancement

Comments

@jseabold
Copy link
Contributor

Don't have time to implement this, but I wanted to float the idea and park it. It's pretty trivial and you can achieve the same thing with filter, but it might be nice if drop had a regex keyword. E.g., these would be equivalent

df = df.filter(regex="^(?!var_start)")
df = df.drop(regex="^var_start", axis=1)
@nehalecky
Copy link
Contributor

Nice. Perhaps on a more fundamental level, we could simply expose the Series.str methods to all pandas index classes and allow for pattern searches across labels that way? I tried explaining (rather poorly) use cases for this type of functionality a while back, don't really know how well it came across:
#2922 (comment)

@jtratner
Copy link
Contributor

We've talked a few times about moving min, max, & friends to a mixin so they can be used for Index as well as series, etc. We could try to do the same thing for str too.

@jseabold
Copy link
Contributor Author

I'm often doing things like

var_names = df.filter(regex="pat").columns.tolist()

Would be great if I could just do df.columns.select("pat").tolist() or something.

@jreback
Copy link
Contributor

jreback commented Sep 21, 2013

fyi...you don't need the tolist usually, as the returned index is already pretty list-like

@jreback
Copy link
Contributor

jreback commented Sep 21, 2013

wouldn't df.columns.filter('pat') be better?

@jseabold
Copy link
Contributor Author

Yeah, sure. Except regex isn't the default arg for filter (unfortunately). It's what I want 95% of the time.

Also, not list-like enough for me

["a"] + pd.Index(["b", "c"])

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Mar 11, 2014
@jreback
Copy link
Contributor

jreback commented Mar 11, 2014

@jseabold

I think its trivial to allow filter to work (on the specified axis, defaults to 1, e.g. columns) on a frame
with a regex if it is passed directly, e.g.

df.filter('a_reg_ex'), while if its list-like it will match exactly, e.g. df.filter(['A','B'])

is their a reason didn't do this before? (this is not even a big API change and is backwards compatible)

at the same time is their a reason at all for select? (which filter actually uses, but can be folded in)

@hayd @jorisvandenbossche

@hayd
Copy link
Contributor

hayd commented Mar 11, 2014

Is a slight API change, as atm you can do (though I guess this is break-able):

In [11]: df = pd.DataFrame([[1, 2], [1, 3], [5, 6]], columns=['A', 'B'])

In [12]: df.filter('AB')  # equivalent to ['A', 'B']
Out[12]:
   A  B
0  1  2
1  1  3
2  5  6

In [13]: df.filter(regex='AB')  # works
Out[13]:
Empty DataFrame
Columns: []
Index: [0, 1, 2]

I had no idea what like arg did without checking source, it's basically just a subset of regex (in the spirit of SQL's like)...

👍 on taking string regex or list-like or a crit (a la select), and maybe dep the other args. I reckon alias select to filter and depreciate it (but not remove it). Happy to take this and do drop at the same time.

@jreback
Copy link
Contributor

jreback commented Mar 11, 2014

Go for it!

@hayd
Copy link
Contributor

hayd commented Mar 11, 2014

Ah, like is subtley different in that it converts stuff to string first (regex raises horribly if there is any non-strings)... e.g. if you have df.columns = [0, 'a'] should regex='0' find the first col? (basically should we convert to string), should there be a way to choose not to grab ints?

@hayd
Copy link
Contributor

hayd commented Mar 11, 2014

Also, select says above it "# TODO: Check if this was clearer in 0.12"...

@hayd
Copy link
Contributor

hayd commented Mar 11, 2014

This function can be significantly simplified, and also work with dupe colnames.

Any thoughts on a good argname (items is not that great)?

    def filter(self, items, axis=None, **kwargs):
        """
        Restrict the info axis to set of items or wildcard

        Parameters
        ----------
        items : Either function, regex or list-like
            Boolean function to be called on each index (label)
            Regular expression to be tested against each index
            List of info axis to restrict to

        axis : int

maybe labels (a la drop)

@jreback
Copy link
Contributor

jreback commented Mar 11, 2014

crit? matcher? selection?

@hayd
Copy link
Contributor

hayd commented Mar 11, 2014

I was thinking of unifying the args for filter and drop, so perhaps makes sense to use drop's args. see PR

Is there an argument to keep filter rather than select?

@jreback
Copy link
Contributor

jreback commented Mar 11, 2014

in theory 1 should apply to data (think query/select)
and 1 labels

i think somewhat arbitrary but filter already does labels

@jreback
Copy link
Contributor

jreback commented Mar 11, 2014

also may want to post on mailing list to get some more feedback on this once u have a nice proposed API

@jreback jreback modified the milestones: 0.14.1, 0.14.0 May 5, 2014
@jreback jreback modified the milestones: 0.15.0, 0.14.1 Jun 5, 2014
@jreback jreback modified the milestones: 0.15.0, 0.15.1 Jul 6, 2014
@jreback jreback modified the milestones: 0.16, 0.15.0 Sep 14, 2014
@jreback jreback removed this from the 0.16 milestone Oct 7, 2014
@jreback jreback modified the milestones: 0.15.1, 0.16 Oct 7, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@datapythonista datapythonista modified the milestones: Contributions Welcome, Someday Jul 8, 2018
@tik0
Copy link

tik0 commented Mar 13, 2019

I just want to leave a simple one-liner (almost, if on emerges everything together) here, how one could do this. Drop all columns starting with "enc_":

import pandas as pd
import re
def filter(df, regex):
    matches = [match[0] for match in [re.findall( regex, col) for col in df.columns] if len(match)]
    return df.drop(matches, axis=1)
d = {'enc_1': [1, 2], 'enc_2': [3, 4], 'dec_2': [3, 4]}
df = pd.DataFrame(data=d)
filter(df, r'enc_.*')

@WillAyd
Copy link
Member

WillAyd commented Mar 14, 2019

@tik0 you can use a regex to do that. Something like:

df.filter(regex=r'^(?!enc_)')

@jreback this is a pretty old issue but I think it's duplicative of what's available already in filter. Any reason to keep this one open?

@rhshadrach rhshadrach added the Closing Candidate May be closeable, needs more eyeballs label Feb 12, 2021
@MarcoGorelli
Copy link
Member

closing as there hasn't been any uptake here (and I agree with the assessment that df.filter(regex=r'^(?!enc_)') is simple enough), though please do ping if you have use-case where that isn't practical

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Closing Candidate May be closeable, needs more eyeballs Enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.