Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get start and end of regex match in dataframe #8747

Open
teese opened this issue Nov 6, 2014 · 9 comments
Open

get start and end of regex match in dataframe #8747

teese opened this issue Nov 6, 2014 · 9 comments
Labels
Enhancement Indexing Related to indexing on series/frames, not to indexes themselves Strings String extension data type and string data

Comments

@teese
Copy link

teese commented Nov 6, 2014

What about including a method to get the start and stop after a regex search of items in a DataFrame . Perhaps using .str.extract?

Returning the start as a new column would perhaps be as follows:

df['start'] = df['string'].str.extract(pattern, output = 'start')

an alternative suggestion from jkitchen on StackOverflow was to use start_index = True, or end_index = True

df['start'] = df['string'].str.extract(pattern, start_index = True)

For multiple parameters (e.g. start and end) as outputs, there needs to be a way to avoid running the search twice. One solution would be to give the output as a tuple:

df['regex_output_tuple'] = df['string'].str.extract(pattern, output = ('start','end'))

I don't use regex very often, so I don't know if there are other parameters that people want after a regex search. If there really is just the text in the groups, the start and the end, perhaps there's a way to put the output directly into new columns?

df['groups'], df['start'], df['end']  = df['string'].str.extract(pattern, output = ('groups','start','end'))

I think it makes sense that non-matches return a NaN, just as in the regular extract function. This would mix integer and float datatypes in the df['start'] column, but I guess we all know about that situation :)

I'm not an experienced programmer, so sorry if I misunderstood some basic concepts.

Please see the question in StackOverflow for example code and comments:
http://stackoverflow.com/questions/26658213/how-can-i-find-the-start-and-end-of-a-regex-match-using-a-python-pandas-datafram

A block of example data and code is below, as requested by jreback.

import pandas as pd
import re
#some example query sequences, markup strings, hit sequences.
q1,q2,q3 = 'MPIMGSSVYITVELAIAVLAILG','MPIMGSSVYITVELAIAVLAILG','MPI-MGSSVYITVELAIAVLAIL'
m1,m2,m3 = '|| ||  ||||||||||||||||','||   | ||| :|| || |:: |','||:    ::|: :||||| |:: '
h1,h2,h3 = 'MPTMGFWVYITVELAIAVLAILG','MP-NSSLVYIGLELVIACLSVAG','MPLETQDALYVALELAIAALSVA' 
#create a pandas dataframe to hold the aligned sequences
df = pd.DataFrame({'query':[q1,q2,q3],'markup':[m1,m2,m3],'hit':[h1,h2,h3]})

#create a regex search string to find the appropriate subset in the query sequence, 
desired_region_from_query = 'PIMGSS'
regex_desired_region_from_query = '(P-*I-*M-*G-*S-*S-*)'

#Pandas has a nice extract function to slice out the matched sequence from the query:
df['extracted'] = df['query'].str.extract(regex_desired_region_from_query)

#However I need the start and end of the match in order to extract the equivalent regions 
#from the markup and hit columns. For a single string, this is done as follows:
match = re.search(regex_desired_region_from_query, df.loc[2,'query'])
sliced_hit = df.loc[2,'hit'][match.start():match.end()]
print('sliced_hit, non-vectorized example: ', sliced_hit)

#HERE the new syntax is necessary
#e.g. df['start'], df['end']  = df['string'].str.extract(pattern, output = ('start','end'))

#My current workaround in pandas is as follows.
#define function to obtain regex output (start, stop, etc) as a tuple
def get_regex_output(x):
    m = re.search(regex_desired_region_from_query, x)
    return (m.start(), m.end())
#apply function
df['regex_output_tuple'] = df['query'].apply(get_regex_output)
#convert the tuple into two separate columns
columns_from_regex_output = ['start','end']      
for n, col in enumerate(columns_from_regex_output):
    df[col] = df['regex_output_tuple'].apply(lambda x: x[n])
#delete the unnecessary column
df = df.drop('regex_output_tuple', axis=1)
@jreback
Copy link
Contributor

jreback commented Nov 7, 2014

can you provide a short but specific example of what exactly is needed/wanted here (make runnable as much as possible and indicate where syntax is needed)

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Strings String extension data type and string data API Design labels Nov 7, 2014
@teese
Copy link
Author

teese commented Nov 8, 2014

I've added the code to the question as requested. Sorry it's not short, but contains some real data to help explain why it is necessary to obtain the regex start and end. The code should work in both python 2.7 and 3.4, and the latest pandas release (0.15.0). In my case, I will apply the above workaround to ~5000 dataframes, each containing ~5000 rows, with significantly longer sequences (~500 characters in each string).

@jreback
Copy link
Contributor

jreback commented Nov 8, 2014

so it actually sounds like you want a function like extract but returns the matched indices.

eg. df['query'].indices(regex_desired_region_from_query, outtype='list|frame')

subtle issue is whether the match can return just (start,end) or a list of matches (not sure what that would look like)

@jreback jreback added this to the 0.15.2 milestone Nov 8, 2014
@teese
Copy link
Author

teese commented Nov 8, 2014

Create .indices as another function?
It's an interesting idea, but I'd have to admit I'm already confused with the .match, .extract, .contains functions that already exist.

Beginners learn to apply regex to single strings using the following syntax
(from https://docs.python.org/3.4/library/re.html):

text = "He was carefully disguised but captured quickly by police."
for m in re.finditer(r"\w+ly", text):
    print '%02d-%02d: %s' % (m.start(), m.end(), m.group(0))
07-16: carefully
40-47: quickly

As a beginner, I am happiest when the syntax in pandas matches the original syntax as closely as possible. The .extract function works great, but after looking at the discussion in #5075, I would probably have voted to keep the name .match, replace the legacy code with the new extract function, and change the output (group, bool, index, or a combination) based on various arguments.

Currently when someone wants to get three things: the groups, the start index and end index. The only way this can be done without repeating the regex search is to get the indices first and then apply some lambda functions to slice out the group. This is a very different process to what people are accustomed to from using the original re module.

So in summary, in order of my preferences:

  1. incorporate extract and proposed get-indices into str.match (to me the simplest for new users, but involves reopening an old discussion and worrying about backwards compatibility)
  2. incorporate get-indices function into str.match, but leave the current default output as 'bool' (as planned)
  3. create a new str.indices function

What're your thoughts concerning the first two options?

Regarding your second comment as to whether the match can return just (start,end) or a list of matches, I still have to sit down and think about that one :)

@jreback jreback modified the milestones: 0.16.0, 0.15.2 Nov 30, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@edumotya
Copy link

edumotya commented Feb 3, 2021

This is my workaround for named groups:

import re
import pandas as pd

class SpanExtractor:
    def __init__(self, pattern):
        self._pattern = re.compile(pattern)
        self._groups = list(self._pattern.groupindex.keys())

    def __call__(self, x):
        """
        Utility function to extract the start and end indices.
        """
        m = self._pattern.search(x)
        if m:
            span_groups = {g: m.span(g) for g in self._groups}
        else:
            span_groups = {g: (float("nan"), float("nan")) for g in self._groups}
        return pd.Series(span_groups)


def _extract_spans(ds: pd.Series, pattern: str) -> pd.DataFrame:
    span_extractor = SpanExtractor(pattern)
    spans = ds.apply(span_extractor)
    spans = pd.concat(
        [
            pd.DataFrame(
                spans[col].to_list(),
                columns=["start_index_" + col, "end_index_" + col],
                index=spans.index,
            )
            for col in spans
        ],
        axis="columns",
    )
    return spans

spans = _extract_spans(df["text"], pattern)
spans["start_index_{your_named_group_1}"] 
spans["end_index_{your_named_group_1}"] 
spans["start_index_{your_named_group_2}"] 

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@vsocrates
Copy link

Hi, I'd love to see a solution to this and it seems like fairly expected functionality, given the way that the re module works.

It looks like an edit to the _str_extract function here may be the start to a fix, but it seems like there would be issues with backwards compatibility or impacts on other functions.

if not expand:

    def g(x):
        m = regex.search(x)
        return m.groups()[0] if m else na_value


    return self._str_map(g, convert=False)

I'd be willing to take a stab at it if someone can provide me with some more direction (unless there's plans to implement this in a future release that I missed)?

@vsocrates
Copy link

Hi, following up on this! @mroeschke, wondering why the "Contributions Welcome" milestone was taken off and if this is still up for contributions, thanks!

@GolAGitHub
Copy link

I frequently wish I had access to regex match object methods when using str.extract/str.extractall. Is this still under consideration for a new release?

@delucca
Copy link

delucca commented Jun 27, 2023

+1 for this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Indexing Related to indexing on series/frames, not to indexes themselves Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

7 participants